Enhanced hypertext categorization using hyperlinks
Soumen Chakrabarti (IBM Almaden)
Byron Dom (IBM Almaden)
Piotr Indyk (Stanford University)
A major challenge in indexing unstructured hypertext databases is to
automatically extract meta-data that enables structured search using
topic taxonomies, circumvents keyword ambiguity, and improves the
quality of search and profile-based routing and filtering.
Therefore, an accurate classifier is an essential component of a
hypertext database. Hyperlinks pose new problems
not addressed in the extensive text classification literature.
Links clearly contain high-quality semantic clues that are lost
upon a purely term-based classifier, but exploiting link
information is non-trivial because it is noisy.
Naive use of terms in the link neighborhood of a document
can even degrade accuracy.
Our contribution is to propose robust
statistical models and a relaxation labeling
technique for better classification
by exploiting link information in a small neighborhood around
documents. Our technique also
adapts gracefully to the fraction of neighboring documents having
known topics.
We experimented with pre-classified samples from
Yahoo! and the
US Patent Database.
In previous work, we developed a text
classifier that misclassified only 13% of the documents in the well-known
Reuters
benchmark; this was comparable to the best results ever obtained.
This classifier misclassified 36% of the patents,
indicating that classifying hypertext can be more difficult
than classifying text. Naively using terms in neighboring documents
increased error to 38%; our hypertext classifier reduced it
to 21%. Results with the Yahoo! sample were
more dramatic: the text classifier showed 68% error,
whereas our hypertext classifier reduced this to only 21%.