CURE: An Efficient Clustering Algorithm for Large Databases.
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim:
CURE: An Efficient Clustering Algorithm for Large Databases.
SIGMOD Conference 1998: 73-84@inproceedings{DBLP:conf/sigmod/GuhaRS98,
author = {Sudipto Guha and
Rajeev Rastogi and
Kyuseok Shim},
editor = {Laura M. Haas and
Ashutosh Tiwary},
title = {CURE: An Efficient Clustering Algorithm for Large Databases},
booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
publisher = {ACM Press},
year = {1998},
isbn = {0-89791-995-5},
pages = {73-84},
ee = {http://doi.acm.org/10.1145/276304.276312, db/conf/sigmod/GuhaRS98.html},
crossref = {DBLP:conf/sigmod/98},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX
Abstract
Clustering, in data mining, is useful for discovering groups and
identifying interesting distributions in the underlying data.
Traditional clustering algorithms either favor clusters with spherical
shapes and similar sizes, or are very fragile in the presence of
outliers. We propose a new clustering algorithm called CURE that is
more robust to outliers, and identifies clusters having non-spherical
shapes and wide variances in size. CURE achieves this by representing
each cluster by a certain fixed number of points that are generated by
selecting well scattered points from the cluster and then shrinking
them toward the center of the cluster by a specified fraction.
Having more than one representative point per cluster allows CURE to
adjust well to the geometry of non-spherical shapes and the shrinking
helps to dampen the effects of outliers. To handle large databases,
CURE employs a combination of random sampling and partitioning. A random
sample drawn from the data set is first partitioned and each partition
is partially clustered. The partial clusters are then clustered in
a second pass to yield the desired clusters. Our experimental results
confirm that the quality of clusters produced by CURE is much better
than those found by existing algorithms. Furthermore, they demonstrate
that random sampling and partitioning enable CURE to not only outperform
existing algorithms but also to scale well for large databases without
sacrificing clustering quality.
Copyright © 1998 by the ACM,
Inc., used by permission. Permission to make
digital or hard copies is granted provided that
copies are not made or distributed for profit or
direct commercial advantage, and that copies show
this notice on the first page or initial screen of
a display along with the full citation.
CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ...
Online Version (ACM WWW Account required): Full Text in PDF Format
DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...
BibTeX
Printed Edition
Laura M. Haas, Ashutosh Tiwary (Eds.):
SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA.
ACM Press 1998, ISBN 0-89791-995-5 BibTeX
,
SIGMOD Record 27(2),
June 1998
Contents
[Abstract]
[Full Text (Postscript)]
References
- [BKSS90]
- Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, Bernhard Seeger:
The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles.
SIGMOD Conference 1990: 322-331 BibTeX
- [CLR90]
- Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest:
Introduction to Algorithms.
The MIT Press and McGraw-Hill Book Company 1989, ISBN 0-262-03141-8,0-07-013143-0
BibTeX
- [EKSX96]
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu:
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.
KDD 1996: 226-231 BibTeX
- [EKX95]
- Martin Ester, Hans-Peter Kriegel, Xiaowei Xu:
A Database Interface for Clustering in Large Spatial Databases.
KDD 1995: 94-99 BibTeX
- [GRS97]
- ...
- [JD88]
- Anil K. Jain, Richard C. Dubes:
Algorithms for Clustering Data.
Prentice-Hall 1988
BibTeX
- [MR95]
- Rajeev Motwani, Prabhakar Raghavan:
Randomized Algorithms.
Cambridge University Press 1995, ISBN 0-521-47465-5
BibTeX
- [NH94]
- Raymond T. Ng, Jiawei Han:
Efficient and Effective Clustering Methods for Spatial Data Mining.
VLDB 1994: 144-155 BibTeX
- [Ols93]
- ...
- [Sam89]
- ...
- [Sam90]
- Hanan Samet:
The Design and Analysis of Spatial Data Structures.
Addison-Wesley 1990
BibTeX
- [SRF87]
- Timos K. Sellis, Nick Roussopoulos, Christos Faloutsos:
The R+-Tree: A Dynamic Index for Multi-Dimensional Objects.
VLDB 1987: 507-518 BibTeX
- [Vit85]
- Jeffrey Scott Vitter:
Random Sampling with a Reservoir.
ACM Trans. Math. Softw. 11(1): 37-57(1985) BibTeX
- [ZRL96]
- Tian Zhang, Raghu Ramakrishnan, Miron Livny:
BIRCH: An Efficient Data Clustering Method for Very Large Databases.
SIGMOD Conference 1996: 103-114 BibTeX
Referenced by
- Kaushik Chakrabarti, Sharad Mehrotra:
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces.
VLDB 2000: 89-100
- Erik Riedel, Christos Faloutsos, Gregory R. Ganger, David Nagle:
Data Mining on an OLTP System (Nearly) for Free.
SIGMOD Conference 2000: 13-21
- Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim:
Efficient Algorithms for Mining Outliers from Large Data Sets.
SIGMOD Conference 2000: 427-438
- Christopher R. Palmer, Christos Faloutsos:
Density Biased Sampling: An Improved Method for Data Mining and Clustering.
SIGMOD Conference 2000: 82-92
- Charu C. Aggarwal, Philip S. Yu:
Finding Generalized Projected Clusters In High Dimensional Spaces.
SIGMOD Conference 2000: 70-81
- Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim:
Data Mining and the Web: Past, Present and Future.
Workshop on Web Information and Data Management 1999: 43-47
- H. V. Jagadish, J. Madar, Raymond T. Ng:
Semantic Compression and Pattern Extraction with Fascicles.
VLDB 1999: 186-198
- Ju-Hong Lee, Deok-Hwan Kim, Chin-Wan Chung:
Multi-dimensional Selectivity Estimation Using Compressed Histogram Information.
SIGMOD Conference 1999: 205-214
- Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander:
OPTICS: Ordering Points To Identify the Clustering Structure.
SIGMOD Conference 1999: 49-60
- Charu C. Aggarwal, Cecilia Magdalena Procopiuc, Joel L. Wolf, Philip S. Yu, Jong Soo Park:
Fast Algorithms for Projected Clustering.
SIGMOD Conference 1999: 61-72
- Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan:
A Framework for Measuring Changes in Data Characteristics.
PODS 1999: 126-137
- Davood Rafiei:
On Similarity-Based Queries for Time Series Data.
ICDE 1999: 410-417
- Sudipto Guha, Rajeev Rastogi, Kyuseok Shim:
ROCK: A Robust Clustering Algorithm for Categorical Attributes.
ICDE 1999: 512-521
- Venkatesh Ganti, Raghu Ramakrishnan, Johannes Gehrke, Allison L. Powell, James C. French:
Clustering Large Datasets in Arbitrary Metric Spaces.
ICDE 1999: 502-511
- Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz:
Pattern-Oriented Hierachical Clustering.
ADBIS 1999: 179-190
BibTeX
ACM SIGMOD Anthology - DBLP:
[Home | Search: Author, Title | Conferences | Journals]
ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Sat May 16 23:40:42 2009