DiSC - Record-Boundary Discovery in Web Documents

Digital Symposium Collection 2000

Record-Boundary Discovery in Web Documents

David W. Embley, Y. S. Jiang, and Yiu-Kai Ng
View Paper (PDF)

Return to Text and Web Databases

Abstract

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By "record" we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In our approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (100% in the experiments we conducted).

References

Note: References link to DBLP on the Web.

[Ade98]: Brad Adelberg : NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. SIGMOD Conference 1998 : 283-294
[AK97a]: ...
[AK97b]: Naveen Ashish , Craig A. Knoblock : Wrapper Generation for Semi-structured Internet Sources. SIGMOD Record 26(4) : 8-15(1997)
[AM97]: Paolo Atzeni , Giansalvatore Mecca : Cut & Paste. PODS 1997 : 144-153
[Ape94]: Peter M. G. Apers : Identifying Internet-related Database Research. East/West Database Workshop 1994 : 183-193
[BDFS97]: Peter Buneman , Susan B. Davidson , Mary F. Fernandez , Dan Suciu : Adding Structure to Unstructured Data. ICDT 1997 : 336-350
[DEW97]: Robert B. Doorenbos , Oren Etzioni , Daniel S. Weld : A Scalable Comparison-Shopping Agent for the World-Wide Web. Agents 1997 : 39-48
[ECJ+98]: David W. Embley , Douglas M. Campbell , Y. S. Jiang , Stephen W. Liddle , Yiu-Kai Ng , Dallan Quass , Randy D. Smith : A Conceptual-Modeling Approach to Extracting Data from the Web. ER 1998 : 78-91
[ECLS98]: David W. Embley , Douglas M. Campbell , Randy D. Smith , Stephen W. Liddle : Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents. CIKM 1998 : 52-59
[GHR97]: Ashish Gupta , Venky Harinarayan , Anand Rajaraman : Virtual Database technology. SIGMOD Record 26(4) : 57-61(1997)
[HGMC+97]: ...
[KWD97]: Nicholas Kushmerick , Daniel S. Weld , Robert B. Doorenbos : Wrapper Induction for Information Extraction. IJCAI (1) 1997 : 729-737
[LS98]: ...
[MMK98]: ...
[Sod97]: Stephen Soderland : Learning to Extract Text-Based Information from the World Wide Web. KDD 1997 : 251-254
[WWW]: ...

BIBTEX

@inproceedings{DBLP:conf/sigmod/EmbleyJN99,
  author    = {David W. Embley and
                Y. S. Jiang and
                Yiu-Kai Ng},
   editor    = {Alex Delis and
                Christos Faloutsos and
                Shahram Ghandeharizadeh},
   title     = {Record-Boundary Discovery in Web Documents},
   booktitle = {SIGMOD 1999, Proceedings ACM SIGMOD International Conference
                on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania,
                USA},
   publisher = {ACM Press},
   year      = {1999},
   isbn      = {1-58113-084-8},
   pages     = {467-478},
   crossref  = {DBLP:conf/sigmod/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },

Copyright(C) 2000 ACM