Providing Database-like Access to the Web
Using Queries Based on Textual Similarity
William W. Cohen (AT&T Labs - Research)
Most databases contain ``name constants'' like course numbers,
personal names, and place names that correspond to entities in the
real world. Previous work in integration of heterogeneous databases
has assumed that local name constants can be mapped into an
appropriate global domain by normalization. Here we assume instead
that the names are given in natural language text. We then propose a
logic for database integration called WHIRL which reasons explicitly
about the similarity of local names, as measured using the
vector-space model commonly adopted in statistical information
retrieval. An implemented data integration system based on WHIRL has
been used to successfully integrate information from several dozen Web
sites in two domains.
More information is available
on the author's homepage.