Record-Boundary Discovery in Web Documents

D.W. Embley       Y.S. Jiang       Y.-K. Ng*
Brigham Young University       Brigham Young University       Brigham Young University
embley@cs.byu.edu       jiang@cs.byu.edu       ng@cs.byu.edu

Abstract

Extraction of information from unstructured or semistructured Web documents often requires a recognition and delimitation of records. (By "record" we mean a group of information relevant to some entity.) Without first chunking documents that contain multiple records according to record boundaries, extraction of record information will not likely succeed. In this paper we describe a heuristic approach to discovering record boundaries in Web documents. In this approach, we capture the structure of a document as a tree of nested HTML tags, locate the subtree containing the records of interest, identify candidate separator tags within the subtree using five independent heuristics, and select a consensus separator tag based on a combined heuristic. Our approach is fast (runs linearly for practical cases within the context of the larger data-extraction problem) and accurate (98% or better in the experiments we conducted).

Keywords: data extraction, data structuring, unstructured data, data records, record boundaries, record-boundary discovery, World-Wide Web.