Digital Symposium Collection 2000  

 
 
 
 
 
 

 





















Distributed Hypertext Resource Discovery Through Examples

Soumen Chakrabarti, Martin van den Berg, and Byron Dom

  View Paper (PDF)  

Return to Document Classification and Information Retrieval

Note: The quality of the PDF contained herein reflects that of the material supplied to the DiSC'00 Production Team.

Abstract
We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, meta-data, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that a keyword-based "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.


References

Note: References link to DBLP on the Web.

[1]
Chidanand Apté , Fred Damerau , Sholom M. Weiss : Automated Learning of Decision Rules for Text Categorization. TOIS 12(3) : 233-251(1994)
[2]
...
[3]
Krishna Bharat , Andrei Z. Broder : A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. WWW7 / Computer Networks 30(1-7) : 379-388(1998)
[4]
Krishna Bharat , Monika Rauch Henzinger : Improved Algorithms for Topic Distillation in a Hyperlinked Environment. SIGIR 1998 : 104-111
[5]
Sergey Brin , Lawrence Page : The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7) : 107-117(1998)
[6]
Soumen Chakrabarti , Byron Dom , Rakesh Agrawal , Prabhakar Raghavan : Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. VLDB Journal 7(3) : 163-178(1998)
[7]
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon M. Kleinberg : Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. WWW7 / Computer Networks 30(1-7) : 65-74(1998)
[8]
Soumen Chakrabarti , Byron Dom , Piotr Indyk : Enhanced Hypertext Categorization Using Hyperlinks. SIGMOD Conference 1998 : 307-318
[9]
...
[10]
...
[11]
Donald D. Chamberlin : A Complete Guide to DB2 Universal Database. Morgan Kaufmann 1998, ISBN 1-55860-482-0
[12]
...
[13]
Junghoo Cho , Hector Garcia-Molina , Lawrence Page : Efficient Crawling Through URL Ordering. WWW7 / Computer Networks 30(1-7) : 161-172(1998)
[14]
William W. Cohen : Fast Effective Rule Induction. ICML 1995 : 115-123
[15]
...
[16]
P. De Bra , R. D. J. Post : Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible. Computer Networks and ISDN Systems 27(2) : 183-192(1994)
[17]
Susan T. Dumais , John Platt , David Hecherman , Mehran Sahami : Inductive Learning Algorithms and Representations for Text Categorization. CIKM 1998 : 148-155
[18]
Roy Goldman , Narayanan Shivakumar , Suresh Venkatasubramanian , Hector Garcia-Molina : Proximity Search in Databases. VLDB 1998 : 26-37
[19]
Joachim Hammer , Hector Garcia-Molina , Kelly Ireland , Yannis Papakonstantinou , Jeffrey D. Ullman , Jennifer Widom : Information Translation, Mediation, and Mosaic-Based Browsing in the TSIMMIS System. SIGMOD Conference 1995 : 483
[20]
Thorsten Joachims , Dayne Freitag , Tom M. Mitchell : Web Watcher: A Tour Guide for the World Wide Web. IJCAI (1) 1997 : 770-777
[21]
...
[22]
Thomas Kistler , Hannes Marais : WebL - A Programming Language for the Web. WWW7 / Computer Networks 30(1-7) : 259-270(1998)
[23]
Jon M. Kleinberg : Authoritative Sources in a Hyperlinked Environment. SODA 1998 : 668-677
[24]
David Konopnicki , Oded Shmueli : Information Gathering in the World-Wide Web: The W3QL Query Language and the W3QS System. TODS 23(4) : 369-410(1998)
[25]
...
[26/27]
Alberto O. Mendelzon , Tova Milo : Formal Models of Web Queries. PODS 1997 : 134-143
[28]
...
[29]
Wayne Niblack , Xiaoming Zhu , James L. Hafner , Tom Breuel , Dulce B. Ponceleon , Dragutin Petkovic , Myron Flickner , Eli Upfal , Sigfredo I. Nin , Sanghoon Sull , Byron Dom , Boon-Lock Yeo , Savitha Srinivasan , Dan Zivkovic , Mike Penner : Srinivasan, Savitha; Zivkovic, Dan; Updates to the QBIC System. Storage and Retrieval for Image and Video Databases (SPIE) 1998 : 150-161
[30]
...
[31]
Jacques Savoy : An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems. Information Processing and Management 32(2) : 155-170(1996)
[32]
Loren G. Terveen , William C. Hill : Finding and Visualizing Inter-Site Clan Graphs. CHI 1998 : 448-455
[33]
...

BIBTEX

@inproceedings{DBLP:conf/vldb/ChakrabartiBD99,
  author    = {Soumen Chakrabarti and
                Martin van den Berg and
                Byron Dom},
   editor    = {Malcolm P. Atkinson and
                Maria E. Orlowska and
                Patrick Valduriez and
                Stanley B. Zdonik and
                Michael L. Brodie},
   title     = {Distributed Hypertext Resource Discovery Through Examples},
   booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
                Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
                UK},
   publisher = {Morgan Kaufmann},
   year      = {1999},
   isbn      = {1-55860-615-5},
   pages     = {375-386},
   crossref  = {DBLP:conf/vldb/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },


























Copyright(C) 2000 ACM