Automatic Subspace Clustering of High Dimensional Data for Data Mining
Applications
Rakesh Agrawal (IBM Almaden RC)
Johannes Gehrke (IBM Almaden RC/Univ. of Wisconsin)
Dimitrios Gunopulos (IBM Almaden RC)
Prabhakar Raghavan (IBM Almaden RC)
Data mining applications place special requirements on clustering
algorithms including:
the ability to find clusters embedded in subspaces of high dimensional
data, scalability, end-user comprehensibility of the results,
non-presumption of any canonical data distribution, and insensitivity
to the order of input records.
We present CLIQUE, a clustering algorithm that satisfies each of these
requirements.
CLIQUE identifies dense clusters in subspaces of maximum dimensionality.
It generates cluster descriptions in the form of DNF expressions that
are minimized for ease of comprehension. It produces identical results
irrespective of the order in which input records are presented and does
not presume any specific mathematical form for data distribution.
Through experiments, we show that CLIQUE efficiently finds accurate
clusters in large high dimensional datasets.