Digital Symposium Collection 2000  

 
 
 
 
 
 

 















WebView: A Tool for Retrieving Internal Structures and Extracting Information from HTML Documents

Seung-Jin Lim and Yiu-Kai Ng

  View Paper (PDF)  

Return to Session 2A: Document Retrieval

Abstract

HTML [Rag96,Sei96] is a well-accepted and widely used language for creating platform-independent documents to be posted on the Web, and HTML documents are semistructured in nature according to the HTML specification. We propose a tool, called WebView, which constructs the semistructured data graph (SDG) of an HTML document H to capture the internal structure of data embedded in H and its (in)directly linked documents. On top of the SDG, WebView provides query processing capability for evaluating SQL-like queries that are posted against the SDG, i.e., the source document(s), for extracting information from the SDG. Existing methods for extracting structured information from certain HTML documents with static internal structure, such as wrappers and integrators for data warehousing, can benefit from WebView. Keywords: Knowledge discovery, HTML documents, information visualization, semistructured data, internet

























Copyright(C) 2000 ACM