Digital Symposium Collection 2000  

 
 
 
 
 
 

 





















Similarity Search in High Dimensions via Hashing

Aristides Gionis, Piotr Indyk, and Rajeev Motwani

  View Paper (PDF)  

Return to High-Dimensional Queries

Abstract
The nearest- or near-neighbor query problems arise in a large variety of database applications, usually in the context of similarity searching. Of late, there has been increasing interest in building search/index structures for performing similarity search over high-dimensional data, e.g., image databases, document collections, time-series databases, and genome databases. Unfortunately, all known techniques for solving this problem fall prey to the "curse of dimensionality." That is, the data structures scale poorly with data dimensionality; in fact, if the number of dimensions exceeds 10 to 20, searching in k-d trees and related structures involves the inspection of a large fraction of the database, thereby doing no better than brute-force linear search. It has been suggested that since the selection of features and the choice of a distance metric in typical applications is rather heuristic, determining an approximate nearest neighbor should suffice for most practical purposes. In this paper, we examine a novel scheme for approximate similarity search based on hashing. The basic idea is to hash the points from the database so as to ensure that the probability of collision is much higher for objects that are close to each other than for those that are far apart. We provide experimental evidence that our method gives significant improvement in running time over other methods for searching in high-dimensional spaces based on hierarchical tree decomposition. Experimental results also indicate that our scheme scales well even for a relatively large number of dimensions (more than 50).


References

Note: References link to DBLP on the Web.

[1]
...
[2]
Sunil Arya , David M. Mount , Nathan S. Netanyahu , Ruth Silverman , Angela Y. Wu : An Optimal Algorithm for Approximate Nearest Neighbor Searching. SODA 1994 : 573-582
[3]
Jon Louis Bentley : Multidimensional Binary Search Trees Used for Associative Searching. CACM 18(9) : 509-517(1975)
[4]
Stefan Berchtold , Daniel A. Keim : High-Dimensional Index Structures, Database Support for Next Decade's Applications (Tutorial). SIGMOD Conference 1998 : 501
[5]
...
[6]
Timothy M. Chan : Approximate Nearest Neighbor Queries Revisited. Symposium on Computational Geometry 1997 : 352-358
[7]
Scott Cost , Steven Salzberg : A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning 10 : 57-78(1993)
[8]
Surajit Chaudhuri , Rajeev Motwani , Vivek R. Narasayya : Random Sampling for Histogram Construction: How much is enough? SIGMOD Conference 1998 : 436-447
[9]
...
[10]
Paolo Ciaccia , Marco Patella , Pavel Zezula : A Cost Model for Similarity Queries in Metric Spaces. PODS 1998 : 59-68
[11]
Scott C. Deerwester , Susan T. Dumais , Thomas K. Landauer , George W. Furnas , Richard A. Harshman : Indexing by Latent Semantic Analysis. JASIS 41(6) : 391-407(1990)
[12]
...
[13]
...
[14]
...
[15]
Christos Faloutsos , Ron Barber , Myron Flickner , Jim Hafner , Wayne Niblack , Dragutin Petkovic , William Equitz : Efficient and Effective Querying by Image Content. JIIS 3(3/4) : 231-262(1994)
[16]
...
[17]
Myron Flickner , Harpreet S. Sawhney , Jonathan Ashley , Qian Huang , Byron Dom , Monika Gorkani , Jim Hafner , Denis Lee , Dragutin Petkovic , David Steele , Peter Yanker : Query by Image and Video Content: The QBIC System. IEEE Computer 28(9) : 23-32(1995)
[18]
Jerome H. Friedman , Jon Louis Bentley , Raphael A. Finkel : An Algorithm for Finding Best Matches in Logarithmic Expected Time. TOMS 3(3) : 209-226(1977)
[19]
...
[20]
...
[21]
Trevor Hastie , Robert Tibshirani : Discriminant Adaptive Nearest Neighbor Classification. KDD 1995 : 142-149
[22]
...
[23]
...
[24]
Piotr Indyk , Rajeev Motwani : Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. STOC 1998 : 604-613
[25]
Kothuri Venkata Ravi Kanth , Divyakant Agrawal , Ambuj K. Singh : Dimensionality Reduction for Similarity Searching in Dynamic Databases. SIGMOD Conference 1998 : 166-176
[26]
...
[27]
...
[28]
Norio Katayama , Shin'ichi Satoh : The SR-tree: An Index Structure for High-Dimensional Nearest Neighbor Queries. SIGMOD Conference 1997 : 369-380
[29]
Nathan Linial , Eran London , Yuri Rabinovich : The geometry of graphs and some of its algorithmic applications. FOCS 1994 : 577-591
[30]
...
[31]
...
[32]
...
[33]
Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay : Approximate Medians and other Quantiles in One Pass and with Limited Memory. SIGMOD Conference 1998 : 426-435
[34]
Yossi Matias , Jeffrey Scott Vitter , Min Wang : Wavelet-Based Histograms for Selectivity Estimation. SIGMOD Conference 1998 : 448-459
[35]
Rajeev Motwani , Prabhakar Raghavan : Randomized Algorithms. Cambridge University Press 1995, ISBN 0-521-47465-5
[36]
...
[37]
Alex Pentland , Rosalind W. Picard , S. Scarloff : Photobook: Tools for Content-Based Manipulation of Image Databases. Storage and Retrieval for Image and Video Databases (SPIE) 1994 : 34-47
[38]
Gerard Salton , Michael McGill : Introduction to Modern Information Retrieval. McGraw-Hill Book Company 1984, ISBN 0-07-054484-0
[39]
Hanan Samet : The Design and Analysis of Spatial Data Structures. Addison-Wesley 1990
[40]
...
[41]
Timos K. Sellis , Nick Roussopoulos , Christos Faloutsos : Multidimensional Access Methods: Trees Have Grown Everywhere. VLDB 1997 : 13-14
[42]
...
[43]
John R. Smith , Shih-Fu Chang : Visually Searching the Web for Content. IEEE Multimedia 4(3) : 12-20(1997)
[44]
...
[45]
Roger Weber , Hans-Jörg Schek , Stephen Blott : A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces. VLDB 1998 : 194-205

BIBTEX

@inproceedings{DBLP:conf/vldb/GionisIM99,
  author    = {Aristides Gionis and
                Piotr Indyk and
                Rajeev Motwani},
   editor    = {Malcolm P. Atkinson and
                Maria E. Orlowska and
                Patrick Valduriez and
                Stanley B. Zdonik and
                Michael L. Brodie},
   title     = {Similarity Search in High Dimensions via Hashing},
   booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
                Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
                UK},
   publisher = {Morgan Kaufmann},
   year      = {1999},
   isbn      = {1-55860-615-5},
   pages     = {518-529},
   crossref  = {DBLP:conf/vldb/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },


























Copyright(C) 2000 ACM