Distributed Hypertext Resource Discovery Through Examples.

Soumen Chakrabarti, Martin van den Berg, Byron Dom: Distributed Hypertext Resource Discovery Through Examples. VLDB 1999: 375-386
  author    = {Soumen Chakrabarti and
               Martin van den Berg and
               Byron Dom},
  editor    = {Malcolm P. Atkinson and
               Maria E. Orlowska and
               Patrick Valduriez and
               Stanley B. Zdonik and
               Michael L. Brodie},
  title     = {Distributed Hypertext Resource Discovery Through Examples},
  booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
               Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
  publisher = {Morgan Kaufmann},
  year      = {1999},
  isbn      = {1-55860-615-7},
  pages     = {375-386},
  ee        = {db/conf/vldb/ChakrabartiBD99.html},
  crossref  = {DBLP:conf/vldb/99},
  bibsource = {DBLP,}


We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, meta-data, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that a keyword-based "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.

Copyright © 1999 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Online Paper

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ... BibTeX

Printed Edition

Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.): VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann 1999, ISBN 1-55860-615-7
Contents BibTeX


Chidanand Apté, Fred Damerau, Sholom M. Weiss: Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst. 12(3): 233-251(1994) BibTeX
Krishna Bharat, Andrei Z. Broder: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. Computer Networks 30(1-7): 379-388(1998) BibTeX
Krishna Bharat, Monika Rauch Henzinger: Improved Algorithms for Topic Distillation in a Hyperlinked Environment. SIGIR 1998: 104-111 BibTeX
Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117(1998) BibTeX
Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan: Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. VLDB J. 7(3): 163-178(1998) BibTeX
Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, Jon M. Kleinberg: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Computer Networks 30(1-7): 65-74(1998) BibTeX
Soumen Chakrabarti, Byron Dom, Piotr Indyk: Enhanced Hypertext Categorization Using Hyperlinks. SIGMOD Conference 1998: 307-318 BibTeX
Donald D. Chamberlin: A Complete Guide to DB2 Universal Database. Morgan Kaufmann 1998, ISBN 1-55860-482-0
Junghoo Cho, Hector Garcia-Molina, Lawrence Page: Efficient Crawling Through URL Ordering. Computer Networks 30(1-7): 161-172(1998) BibTeX
William W. Cohen: Fast Effective Rule Induction. ICML 1995: 115-123 BibTeX
Paul De Bra, R. D. J. Post: Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible. Computer Networks and ISDN Systems 27(2): 183-192(1994) BibTeX
Susan T. Dumais, John C. Platt, David Hecherman, Mehran Sahami: Inductive Learning Algorithms and Representations for Text Categorization. CIKM 1998: 148-155 BibTeX
Roy Goldman, Narayanan Shivakumar, Suresh Venkatasubramanian, Hector Garcia-Molina: Proximity Search in Databases. VLDB 1998: 26-37 BibTeX
Joachim Hammer, Hector Garcia-Molina, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom: Information Translation, Mediation, and Mosaic-Based Browsing in the TSIMMIS System. SIGMOD Conference 1995: 483 BibTeX
Thorsten Joachims, Dayne Freitag, Tom M. Mitchell: Web Watcher: A Tour Guide for the World Wide Web. IJCAI (1) 1997: 770-777 BibTeX
Thomas Kistler, Hannes Marais: WebL - A Programming Language for the Web. Computer Networks 30(1-7): 259-270(1998) BibTeX
Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. SODA 1998: 668-677 BibTeX
David Konopnicki, Oded Shmueli: Information Gathering in the World-Wide Web: The W3QL Query Language and the W3QS System. ACM Trans. Database Syst. 23(4): 369-410(1998) BibTeX
Alberto O. Mendelzon, Tova Milo: Formal Models of Web Queries. PODS 1997: 134-143 BibTeX
Wayne Niblack, Xiaoming Zhu, James L. Hafner, Thomas M. Breuel, Dulce B. Ponceleon, Dragutin Petkovic, Myron Flickner, Eli Upfal, Sigfredo I. Nin, Sanghoon Sull, Byron Dom, Boon-Lock Yeo, Savitha Srinivasan, Dan Zivkovic, Mike Penner: Updates to the QBIC System. Storage and Retrieval for Image and Video Databases (SPIE) 1998: 150-161 BibTeX
Jacques Savoy: An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems. Inf. Process. Manage. 32(2): 155-170(1996) BibTeX
Loren G. Terveen, William C. Hill: Finding and Visualizing Inter-Site Clan Graphs. CHI 1998: 448-455 BibTeX
ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]
VLDB Proceedings: Copyright © by VLDB Endowment,
ACM SIGMOD Anthology: Copyright © by ACM (, Corrections:
DBLP: Copyright © by Michael Ley (, last change: Sat May 16 23:46:27 2009