|














|
|
 |
|
 |
WebView: A Tool for Retrieving Internal Structures and Extracting Information from HTML Documents
|
Seung-Jin Lim and
Yiu-Kai Ng
View Paper (PDF)
Return to Session 2A: Document Retrieval
HTML [Rag96,Sei96] is a well-accepted and widely used
language for creating platform-independent documents to be posted on the Web,
and HTML documents are semistructured in nature according to the HTML
specification. We propose a tool, called WebView, which constructs the
semistructured data graph (SDG) of an HTML document H to capture the internal
structure of data embedded in H and its (in)directly linked documents. On top
of the SDG, WebView provides query processing capability for evaluating
SQL-like queries that are posted against the SDG, i.e., the source document(s),
for extracting information from the SDG. Existing methods for extracting
structured information from certain HTML documents with static internal
structure, such as wrappers and integrators for data warehousing, can benefit
from WebView.
Keywords: Knowledge discovery, HTML documents, information
visualization, semistructured data, internet
Copyright(C) 2000 ACM
|
|
|
|
|
|
|