![]() |
![]() |
![]() |
@inproceedings{DBLP:conf/sigir/SchutzeHP95, author = {Hinrich Sch{\"u}tze and David A. Hull and Jan O. Pedersen}, editor = {Edward A. Fox and Peter Ingwersen and Raya Fidel}, title = {A Comparison of Classifiers and Document Representations for the Routing Problem}, booktitle = {SIGIR'95, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Seattle, Washington, USA, July 9-13, 1995 (Special Issue of the SIGIR Forum)}, publisher = {ACM Press}, year = {1995}, isbn = {0-89791-714-6}, pages = {229-237}, ee = {db/conf/sigir/SchutzeHP95.html}, crossref = {DBLP:conf/sigir/95}, bibsource = {DBLP, http://dblp.uni-trier.de} }BibTeX
In this paper, we compare learning techniques based on statistical classification to traditional methods of relevance feedback for the document routing problem. We consider three classification techniques which have decision rules that are derived via explicit error minimization linear discriminant analysis, logistic regression, and neural networks. We demonstrate that the classifiers perform 10-15% better than relevance feedback via Rocchio expansion for the TREC-2 and TREC-3 routing tasks.
Error minimization is difficult in high-dimensional feature spaces because the convergence process is slow and the models are prone to overfitting. We use two different strategies, latent semantic indexing and optimal term selection, to reduce the number of features. Our results indicate that features based on latent semantic indexing are more effective for techniques such as linear discriminant analysis and logistic regression, which have no way to protect against overfitting. Neural networks perform equally well with either set of features and can take advantage of the additional information available when both feature sets are used as input.
Copyright © 1995 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.