Integration of Heterogeneous Databases Without Common Domains
Using Queries Based on Textual Similarity
William W. Cohen (AT&T Labs - Research)
Most databases contain ``name constants'' like course numbers,
personal names, and place names that correspond to entities in the
real world. Previous work in integration of heterogeneous databases
has assumed that local name constants can be mapped into an
appropriate global domain by normalization. However, in many cases,
this assumption does not hold; determining if two name constants
should be considered identical can require detailed knowledge of the
world, the purpose of the user's query, or both. In this paper, we
reject the assumption that global domains can be easily constructed,
and assume instead that the names are given in natural language text.
We then propose a logic called WHIRL which reasons explicitly about
the similarity of local names, as measured using the vector-space
model commonly adopted in statistical information retrieval. We
describe an efficient implementation of WHIRL and evaluate it
experimentally on data extracted from the World Wide Web. We show
that WHIRL is much faster than naive inference methods, even for short
queries. We also show that inferences made by WHIRL are surprisingly
accurate, equaling the accuracy of hand-coded normalization routines
on one benchmark problem, and outperforming exact matching with a
plausible global domain on a second.
More information is available
on the author's homepage.