DiSC - Fast Algorithms for Projected Clustering

Digital Symposium Collection 2000

Fast Algorithms for Projected Clustering

Charu C. Aggarwal, Cecilia Magdalena Procopiuc, Joel L. Wolf, Philip S. Yu, and Jong Soo Park
View Paper (PDF)

Return to Clustering

Abstract

The clustering problem is well known in the database literature for its numerous applications in problems such as customer segmentation, classification and trend analysis. Unfortunately, all known algorithms tend to break down in high dimensional spaces because of the inherent sparsity of the points. In such high dimensional spaces not all dimensions may be relevant to a given cluster. One way of handling this is to pick the closely correlated dimensions and find clusters in the corresponding subspace. Traditional feature selection algorithms attempt to achieve this. The weakness of this approach is that in typical high dimensional data mining applications different sets of points may cluster better for different subsets of dimensions. The number of dimensions in each such cluster-specific subspace may also vary. Hence, it may be impossible to find a single small subset of dimensions for all the clusters. We therefore discuss a generalization of the clustering problem, referred to as the projected clustering problem, in which the subsets of dimensions selected are specific to the clusters themselves. We develop an algorithmic framework for solving the projected clustering problem, and test its performance on synthetic data.

References

Note: References link to DBLP on the Web.

[1]: Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan : Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998 : 94-105
[2]: ...
[3]: ...
[4]: ...
[5]: ...
[6]: Richard Dubes , Anil K. Jain : Clustering Methodologies in Exploratory Data Analysis. Advances in Computers 19 : 113-228(1980)
[7]: Martin Ester , Hans-Peter Kriegel , Xiaowei Xu : A Database Interface for Clustering in Large Spatial Databases. KDD 1995 : 94-99
[8]: Martin Ester , Hans-Peter Kriegel , Xiaowei Xu : Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. SSD 1995 : 67-82
[9]: Martin Ester , Hans-Peter Kriegel , Jörg Sander , Xiaowei Xu : A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996 : 226-231
[10]: Upendra Shardanand , Pattie Maes : Social Information Filtering: Algorithms for Automating "Word of Mouth". CHI 1995 : 210-217
[11]: Douglas Fisher : Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning 2(2) : 139-172(1987)
[12]: Douglas Fisher : Optimization and Simplification of Hierarchical Clusterings. KDD 1995 : 118-123
[13]: David Gibson , Jon M. Kleinberg , Prabhakar Raghavan : Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB 1998 : 311-322
[14]: Teofilo F. Gonzalez : Clustering to Minimize the Maximum Intercluster Distance. TCS 38 : 293-306(1985)
[15]: Sudipto Guha , Rajeev Rastogi , Kyuseok Shim : CURE: An Efficient Clustering Algorithm for Large Databases. SIGMOD Conference 1998 : 73-84
[16]: ...
[17]: Anil K. Jain , Richard C. Dubes: Algorithms for Clustering Data. Prentice-Hall 1988
[18]: L. Kaufman, P. J. Rousseeuw: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley 1990
[19]: Ron Kohavi , Dan Sommerfield : Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology. KDD 1995 : 192-197
[20]: ...
[21]: Raymond T. Ng , Jiawei Han : Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994 : 144-155
[22]: Stefan Berchtold , Christian Böhm , Daniel A. Keim , Hans-Peter Kriegel : A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space. PODS 1997 : 78-86
[23]: ...
[24]: Xiaowei Xu , Martin Ester , Hans-Peter Kriegel , Jörg Sander : A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. ICDE 1998 : 324-331
[25]: ...
[26]: Tian Zhang , Raghu Ramakrishnan , Miron Livny : BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conf. 1996 : 103-114

BIBTEX

@inproceedings{DBLP:conf/sigmod/AggarwalPWYP99,
  author    = {Charu C. Aggarwal and
                Cecilia Magdalena Procopiuc and
                Joel L. Wolf and
                Philip S. Yu and
                Jong Soo Park},
   editor    = {Alex Delis and
                Christos Faloutsos and
                Shahram Ghandeharizadeh},
   title     = {Fast Algorithms for Projected Clustering},
   booktitle = {SIGMOD 1999, Proceedings ACM SIGMOD International Conference
                on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania,
                USA},
   publisher = {ACM Press},
   year      = {1999},
   isbn      = {1-58113-084-8},
   pages     = {61-72},
   crossref  = {DBLP:conf/sigmod/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },

Copyright(C) 2000 ACM