CURE: An Efficient Clustering Algorithm for Large Databases.

Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: CURE: An Efficient Clustering Algorithm for Large Databases. SIGMOD Conference 1998: 73-84

@inproceedings{DBLP:conf/sigmod/GuhaRS98,
  author    = {Sudipto Guha and
               Rajeev Rastogi and
               Kyuseok Shim},
  editor    = {Laura M. Haas and
               Ashutosh Tiwary},
  title     = {CURE: An Efficient Clustering Algorithm for Large Databases},
  booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
               on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
  publisher = {ACM Press},
  year      = {1998},
  isbn      = {0-89791-995-5},
  pages     = {73-84},
  ee        = {http://doi.acm.org/10.1145/276304.276312, db/conf/sigmod/GuhaRS98.html},
  crossref  = {DBLP:conf/sigmod/98},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

BibTeX

Abstract

Clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. Traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very fragile in the presence of outliers. We propose a new clustering algorithm called CURE that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. CURE achieves this by representing each cluster by a certain fixed number of points that are generated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. Having more than one representative point per cluster allows CURE to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. To handle large databases, CURE employs a combination of random sampling and partitioning. A random sample drawn from the data set is first partitioned and each partition is partially clustered. The partial clusters are then clustered in a second pass to yield the desired clusters. Our experimental results confirm that the quality of clusters produced by CURE is much better than those found by existing algorithms. Furthermore, they demonstrate that random sampling and partitioning enable CURE to not only outperform existing algorithms but also to scale well for large databases without sacrificing clustering quality.

Copyright © 1998 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.

ACM SIGMOD DiSC

CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the CD and click on the path of your mount point:
/Anthology/Disc99_1 or /cdrom

Online Version (ACM WWW Account required): Full Text in PDF Format

ACM SIGMOD Anthology

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

BibTeX

Printed Edition

Laura M. Haas, Ashutosh Tiwary (Eds.): SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. ACM Press 1998, ISBN 0-89791-995-5 BibTeX , SIGMOD Record 27(2), June 1998
Contents

Online Edition: ACM SIGMOD

[Abstract]
[Full Text (Postscript)]

References

[BKSS90]: Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles. SIGMOD Conference 1990: 322-331 BibTeX
[CLR90]: Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest: Introduction to Algorithms. The MIT Press and McGraw-Hill Book Company 1989, ISBN 0-262-03141-8,0-07-013143-0
BibTeX
[EKSX96]: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996: 226-231 BibTeX
[EKX95]: Martin Ester, Hans-Peter Kriegel, Xiaowei Xu: A Database Interface for Clustering in Large Spatial Databases. KDD 1995: 94-99 BibTeX
[GRS97]: ...
[JD88]: Anil K. Jain, Richard C. Dubes: Algorithms for Clustering Data. Prentice-Hall 1988
BibTeX
[MR95]: Rajeev Motwani, Prabhakar Raghavan: Randomized Algorithms. Cambridge University Press 1995, ISBN 0-521-47465-5
BibTeX
[NH94]: Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155 BibTeX
[Ols93]: ...
[Sam89]: ...
[Sam90]: Hanan Samet: The Design and Analysis of Spatial Data Structures. Addison-Wesley 1990
BibTeX
[SRF87]: Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos: The R+-Tree: A Dynamic Index for Multi-Dimensional Objects. VLDB 1987: 507-518 BibTeX
[Vit85]: Jeffrey Scott Vitter: Random Sampling with a Reservoir. ACM Trans. Math. Softw. 11(1): 37-57(1985) BibTeX
[ZRL96]: Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conference 1996: 103-114 BibTeX

Referenced by

Kaushik Chakrabarti, Sharad Mehrotra: Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces. VLDB 2000: 89-100
Erik Riedel, Christos Faloutsos, Gregory R. Ganger, David Nagle: Data Mining on an OLTP System (Nearly) for Free. SIGMOD Conference 2000: 13-21
Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD Conference 2000: 427-438
Christopher R. Palmer, Christos Faloutsos: Density Biased Sampling: An Improved Method for Data Mining and Clustering. SIGMOD Conference 2000: 82-92
Charu C. Aggarwal, Philip S. Yu: Finding Generalized Projected Clusters In High Dimensional Spaces. SIGMOD Conference 2000: 70-81
Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: Data Mining and the Web: Past, Present and Future. Workshop on Web Information and Data Management 1999: 43-47
H. V. Jagadish, J. Madar, Raymond T. Ng: Semantic Compression and Pattern Extraction with Fascicles. VLDB 1999: 186-198
Ju-Hong Lee, Deok-Hwan Kim, Chin-Wan Chung: Multi-dimensional Selectivity Estimation Using Compressed Histogram Information. SIGMOD Conference 1999: 205-214
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander: OPTICS: Ordering Points To Identify the Clustering Structure. SIGMOD Conference 1999: 49-60
Charu C. Aggarwal, Cecilia Magdalena Procopiuc, Joel L. Wolf, Philip S. Yu, Jong Soo Park: Fast Algorithms for Projected Clustering. SIGMOD Conference 1999: 61-72
Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan: A Framework for Measuring Changes in Data Characteristics. PODS 1999: 126-137
Davood Rafiei: On Similarity-Based Queries for Time Series Data. ICDE 1999: 410-417
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: ROCK: A Robust Clustering Algorithm for Categorical Attributes. ICDE 1999: 512-521
Venkatesh Ganti, Raghu Ramakrishnan, Johannes Gehrke, Allison L. Powell, James C. French: Clustering Large Datasets in Arbitrary Metric Spaces. ICDE 1999: 502-511
Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz: Pattern-Oriented Hierachical Clustering. ADBIS 1999: 179-190

BibTeX

ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]