Record-Boundary Discovery in Web
Documents
D.W. Embley
|      
| Y.S. Jiang
|      
| Y.-K. Ng*
|
Brigham Young University
|      
| Brigham Young University
|      
| Brigham Young University
|
embley@cs.byu.edu
|      
| jiang@cs.byu.edu
|      
| ng@cs.byu.edu
|
Abstract
Extraction of information from unstructured or semistructured
Web documents often requires a recognition and delimitation
of records. (By "record" we mean a group of information
relevant to some entity.) Without first chunking documents
that contain multiple records according to record boundaries,
extraction of record information will not likely succeed.
In this paper we describe a heuristic approach to discovering
record boundaries in Web documents. In this approach, we
capture the structure of a document as a tree of nested HTML
tags, locate the subtree containing the records of interest,
identify candidate separator tags within the subtree using
five independent heuristics, and select a consensus separator
tag based on a combined heuristic. Our approach is fast (runs
linearly for practical cases within the context of the larger
data-extraction problem) and accurate (98% or better in the
experiments we conducted).
Keywords: data extraction, data structuring, unstructured data,
data records, record boundaries, record-boundary discovery,
World-Wide Web.