CURE: An Efficient Clustering Algorithm for Large Databases
Sudipto Guha (Stanford University)
Rajeev Rastogi (Bell Laboratories)
Kyuseok Shim (Bell Laboratories)
Abstract
Clustering, in data mining, is useful for discovering groups and
identifying interesting distributions in the underlying data.
Traditional clustering algorithms either favor clusters with spherical
shapes and similar sizes, or are very fragile in the presence of
outliers. We propose a new clustering algorithm called CURE that is
more robust to outliers, and identifies clusters having non-spherical
shapes and wide variances in size. CURE achieves this by representing
each cluster by a certain fixed number of points that are generated by
selecting well scattered points from the cluster and then shrinking
them toward the center of the cluster by a specified fraction.
Having more than one representative point per cluster allows CURE to
adjust well to the geometry of non-spherical shapes and the shrinking
helps to dampen the effects of outliers. To handle large databases,
CURE employs a combination of random sampling and partitioning. A random
sample drawn from the data set is first partitioned and each partition
is partially clustered. The partial clusters are then clustered in
a second pass to yield the desired clusters. Our experimental results
confirm that the quality of clusters produced by CURE is much better
than those found by existing algorithms. Furthermore, they demonstrate
that random sampling and partitioning enable CURE to not only outperform
existing algorithms but also to scale well for large databases without
sacrificing clustering quality.