Digital Symposium Collection 2000  

 
 
 
 
 
 

 





















Extracting Large-Scale Knowledge Bases from the Web

S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins

  View Paper (PDF)  

Return to Databases and the Web

Abstract
The subject of this paper is the creation of knowledge bases by enumerating and organizing all web occurrences of certain subgraphs. We focus on subgraphs that are signatures of web phenomena such as tightly-focused topic communities, webrings, taxonomy trees, keiretsus, etc. For instance, the signature of a webring is a central page with bidirectional links to a number of other pages. We develop novel algorithms for such enumeration problems. A key technical contribution is the development of a model for the evolution of the web graph, based on experimental observations derived from a snapshot of the web. We argue that our algorithms run efficiently in this model, and use the model to explain some statistical phenomena on the web that emerged during our experiments. Finally, we describe the design and implementation of Campfire, a knowledge base of over one hundred thousand web communities.


References

Note: References link to DBLP on the Web.

[1]
Rakesh Agrawal , Ramakrishnan Srikant : Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994 : 487-499
[2]
...
[3]
...
[4]
Krishna Bharat , Andrei Z. Broder , Monika Rauch Henzinger , Puneet Kumar , Suresh Venkatasubramanian : The Connectivity Server: Fast Access to Linkage Information on the Web. WWW7 / Computer Networks 30(1-7) : 469-477(1998)
[5]
Krishna Bharat , Monika Rauch Henzinger : Improved Algorithms for Topic Distillation in a Hyperlinked Environment. SIGIR 1998 : 104-111
[6]
Sergey Brin , Lawrence Page : The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW7 / Computer Networks 30(1-7) : 107-117(1998)
[7]
...
[8]
...
[9]
Soumen Chakrabarti , Byron Dom , Prabhakar Raghavan , Sridhar Rajagopalan , David Gibson , Jon M. Kleinberg : Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. WWW7 / Computer Networks 30(1-7) : 65-74(1998)
[10]
...
[11]
...
[12]
Jeffrey Dean , Monika Rauch Henzinger : Finding Related Pages in the World Wide Web. WWW8 / Computer Networks 31(11-16) : 1467-1479(1999)
[13]
Daniela Florescu , Alon Y. Levy , Alberto O. Mendelzon : Database Techniques for the World-Wide Web: A Survey. SIGMOD Record 27(3) : 59-74(1998)
[14]
...
[15]
...
[16]
...
[17]
...
[18]
...
[19]
Jon M. Kleinberg : Authoritative Sources in a Hyperlinked Environment. SODA 1998 : 668-677
[20]
...
[21]
Jon M. Kleinberg , S. Ravi Kumar , Prabhakar Raghavan , Sridhar Rajagopalan , Andrew Tomkins : The Web as a Graph: Measurements, Models, and Methods. COCOON 1999 : 1-17
[22]
...
[23]
...
[24]
...
[25]
Alberto O. Mendelzon , Peter T. Wood : Finding Regular Simple Paths in Graph Databases. SIAM J. Comput. 24(6) : 1235-1258(1995)
[26]
...
[27]
Ehud Rivlin , Rodrigo A. Botafogo , Ben Shneiderman : Navigating in Hyperspace: Designing a Structure-Based Toolbox. CACM 37(2) : 87-96(1994)
[28]
...
[29]
Shalom Tsur , Jeffrey D. Ullman , Serge Abiteboul , Chris Clifton , Rajeev Motwani , Svetlozar Nestorov , Arnon Rosenthal : Query Flocks: A Generalization of Association-Rule Mining. SIGMOD Conference 1998 : 1-12
[30]
George Kingsley Zipf: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley 1949

BIBTEX

@inproceedings{DBLP:conf/vldb/KumarRRT99,
  author    = {S. Ravi Kumar and
                Prabhakar Raghavan and
                Sridhar Rajagopalan and
                Andrew Tomkins},
   editor    = {Malcolm P. Atkinson and
                Maria E. Orlowska and
                Patrick Valduriez and
                Stanley B. Zdonik and
                Michael L. Brodie},
   title     = {Extracting Large-Scale Knowledge Bases from the Web},
   booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
                Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
                UK},
   publisher = {Morgan Kaufmann},
   year      = {1999},
   isbn      = {1-55860-615-5},
   pages     = {639-650},
   crossref  = {DBLP:conf/vldb/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },


























Copyright(C) 2000 ACM