BIRCH: An Efficient Data Clustering Method for Very Large Databases.

Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conference 1996: 103-114

@inproceedings{DBLP:conf/sigmod/ZhangRL96,
  author    = {Tian Zhang and
               Raghu Ramakrishnan and
               Miron Livny},
  editor    = {H. V. Jagadish and
               Inderpal Singh Mumick},
  title     = {BIRCH: An Efficient Data Clustering Method for Very Large Databases},
  booktitle = {Proceedings of the 1996 ACM SIGMOD International Conference on
               Management of Data, Montreal, Quebec, Canada, June 4-6, 1996},
  publisher = {ACM Press},
  year      = {1996},
  pages     = {103-114},
  ee        = {http://doi.acm.org/10.1145/233269.233324, db/conf/sigmod/ZhangRL96.html},
  crossref  = {DBLP:conf/sigmod/96},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

BibTeX

Abstract

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.
This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., availbale memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.
We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

Copyright © 1996 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.

ACM SIGMOD Anthology

Online Version (ACM WWW Account required): Full Text in PDF Format

CDROM Version: Load the CDROM "Volume 1 Issue 1, SIGMOD '93-'97" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the CD and click on the path of your mount point:
/Anthology/smod9397 or /cdrom

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

BibTeX

Printed Edition

H. V. Jagadish, Inderpal Singh Mumick (Eds.): Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, June 4-6, 1996. ACM Press 1996 BibTeX , SIGMOD Record 25(2), June 1996
Contents

Online Edition: ACM Digital Library

[Index Terms]
[Full Text in PDF Format, 1451 KB]

References

[CKS88]: ...
[DH73]: ...
[DJ80]: ...
[EKX95a]: Martin Ester, Hans-Peter Kriegel, Xiaowei Xu: A Database Interface for Clustering in Large Spatial Databases. KDD 1995: 94-99 BibTeX
[EKX95b]: Martin Ester, Hans-Peter Kriegel, Xiaowei Xu: Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification. SSD 1995: 67-82 BibTeX
[Fis87]: Douglas H. Fisher: Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learning 2(2): 139-172(1987) BibTeX
[Fis95]: ...
[GG92]: ...
[KR90]: ...
[Leb87]: ...
[Lee81]: ...
[Mur83]: Fionn Murtagh: A Survey of Recent Advances in Hierarchical Clustering Algorithms. Comput. J. 26(4): 354-359(1983) BibTeX
[NH94]: Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155 BibTeX
[Ols93]: ...
[ZRL95]: ...

Referenced by

Anthony K. H. Tung, Raymond T. Ng, Laks V. S. Lakshmanan, Jiawei Han: Constraint-based clustering in large databases. ICDT 2001: 405-419
Jeff Edmonds, Jarek Gryz, Dongming Liang, Renée J. Miller: Mining for Empty Rectangles in Large Data Sets. ICDT 2001: 174-188
Gholamhosein Sheikholeslami, Surojit Chatterjee, Aidong Zhang: WaveCluster: A Wavelet Based Clustering Approach for Spatial Data in Very Large Databases. VLDB J. 8(3-4): 289-304(2000)
Edwin M. Knorr, Raymond T. Ng, V. Tucakov: Distance-Based Outliers: Algorithms and Applications. VLDB J. 8(3-4): 237-253(2000)
David Gibson, Jon M. Kleinberg, Prabhakar Raghavan: Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB J. 8(3-4): 222-236(2000)
Theodore Johnson, Laks V. S. Lakshmanan, Raymond T. Ng: The 3W Model and Algebra for Unified Data Mining. VLDB 2000: 21-32
Kaushik Chakrabarti, Sharad Mehrotra: Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces. VLDB 2000: 89-100
Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD Conference 2000: 427-438
Christopher R. Palmer, Christos Faloutsos: Density Biased Sampling: An Improved Method for Data Mining and Clustering. SIGMOD Conference 2000: 82-92
Carlos Ordonez, Paul Cereghini: SQLEM: Fast Clustering in SQL using the EM Algorithm. SIGMOD Conference 2000: 559-570
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander: LOF: Identifying Density-Based Local Outliers. SIGMOD Conference 2000: 93-104
Charu C. Aggarwal, Philip S. Yu: Finding Generalized Projected Clusters In High Dimensional Spaces. SIGMOD Conference 2000: 70-81
Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: Data Mining and the Web: Past, Present and Future. Workshop on Web Information and Data Management 1999: 43-47
Alexander Hinneburg, Daniel A. Keim: Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering. VLDB 1999: 506-517
H. V. Jagadish, J. Madar, Raymond T. Ng: Semantic Compression and Pattern Extraction with Fascicles. VLDB 1999: 186-198
Wen-Chi Hou: A Framework for Statistical Data Mining with Summary Tables. SSDBM 1999: 14-23
Apostol Natsev, Rajeev Rastogi, Kyuseok Shim: WALRUS: A Similarity Retrieval Algorithm for Image Databases. SIGMOD Conference 1999: 395-406
Ju-Hong Lee, Deok-Hwan Kim, Chin-Wan Chung: Multi-dimensional Selectivity Estimation Using Compressed Histogram Information. SIGMOD Conference 1999: 205-214
Alin Deutsch, Mary F. Fernández, Dan Suciu: Storing Semistructured Data with STORED. SIGMOD Conference 1999: 431-442
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, Jörg Sander: OPTICS: Ordering Points To Identify the Clustering Structure. SIGMOD Conference 1999: 49-60
Charu C. Aggarwal, Cecilia Magdalena Procopiuc, Joel L. Wolf, Philip S. Yu, Jong Soo Park: Fast Algorithms for Projected Clustering. SIGMOD Conference 1999: 61-72
Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan: A Framework for Measuring Changes in Data Characteristics. PODS 1999: 126-137
Wei Wang, Jiong Yang, Richard R. Muntz: STING+: An Approach to Active Spatial Data Mining. ICDE 1999: 116-125
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: ROCK: A Robust Clustering Algorithm for Categorical Attributes. ICDE 1999: 512-521
Venkatesh Ganti, Raghu Ramakrishnan, Johannes Gehrke, Allison L. Powell, James C. French: Clustering Large Datasets in Arbitrary Metric Spaces. ICDE 1999: 502-511
Philip S. Yu: Data Mining and Personalization Technologies. DASFAA 1999: 6-13
Pedro Furtado, Henrique Madeira: Summary Grids: Building Accurate Multidimensional Histograms. DASFAA 1999: 187-194
Tadeusz Morzy, Marek Wojciechowski, Maciej Zakrzewicz: Pattern-Oriented Hierachical Clustering. ADBIS 1999: 179-190
Chye-Lin Chee, Hongjun Lu, Hong Tang, C. V. Ramamoorthy: Adaptive Prefetching and Storage Reorganization In A Log-Structured Storage System. IEEE Trans. Knowl. Data Eng. 10(5): 824-838(1998)
Wendy Chang, Gholamhosein Sheikholeslami, Jia Wang, Aidong Zhang: Data Resource Selection in Distributed Visual Information Systems. IEEE Trans. Knowl. Data Eng. 10(6): 926-946(1998)
Jiawei Han: Towards On-Line Analytical Mining in Large Databases. SIGMOD Record 27(1): 97-107(1998)
G. D. Ramkumar, Arun N. Swami: Clustering Data Without Distance Functions. IEEE Data Eng. Bull. 21(1): 9-14(1998)
Gholamhosein Sheikholeslami, Surojit Chatterjee, Aidong Zhang: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. VLDB 1998: 428-439
Edwin M. Knorr, Raymond T. Ng: Algorithms for Mining Distance-Based Outliers in Large Datasets. VLDB 1998: 392-403
David Gibson, Jon M. Kleinberg, Prabhakar Raghavan: Clustering Categorical Data: An Approach Based on Dynamical Systems. VLDB 1998: 311-322
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Michael Wimmer, Xiaowei Xu: Incremental Clustering for Mining in a Data Warehousing Environment. VLDB 1998: 323-333
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim: CURE: An Efficient Clustering Algorithm for Large Databases. SIGMOD Conference 1998: 73-84
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998: 94-105
Xiaowei Xu, Martin Ester, Hans-Peter Kriegel, Jörg Sander: A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases. ICDE 1998: 324-331
Wendy Chang, Deepak Murthy, Aidong Zhang, Tanveer Fathima Syeda-Mahmood: Global Integration of Visual Databases. ICDE 1998: 542-549
Daniel Barbará, William DuMouchel, Christos Faloutsos, Peter J. Haas, Joseph M. Hellerstein, Yannis E. Ioannidis, H. V. Jagadish, Theodore Johnson, Raymond T. Ng, Viswanath Poosala, Kenneth A. Ross, Kenneth C. Sevcik: The New Jersey Data Reduction Report. IEEE Data Eng. Bull. 20(4): 3-45(1997)
Wei Wang, Jiong Yang, Richard R. Muntz: STING: A Statistical Information Grid Approach to Spatial Data Mining. VLDB 1997: 186-195
Renée J. Miller, Yuping Yang: Association Rules over Interval Data. SIGMOD Conference 1997: 452-461
Miron Livny, Raghu Ramakrishnan, Kevin S. Beyer, Guangshun Chen, Donko Donjerkovic, Shilpa Lawande, Jussi Myllymaki, R. Kent Wenger: DEVise: Integrated Querying and Visual Exploration of Large Datasets (Demo Abstract). SIGMOD Conference 1997: 517-520
Miron Livny, Raghu Ramakrishnan, Kevin S. Beyer, Guangshun Chen, Donko Donjerkovic, Shilpa Lawande, Jussi Myllymaki, R. Kent Wenger: DEVise: Integrated Querying and Visualization of Large Datasets. SIGMOD Conference 1997: 301-312
Flip Korn, H. V. Jagadish, Christos Faloutsos: Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. SIGMOD Conference 1997: 289-300
Jiawei Han, Krzysztof Koperski, Nebojsa Stefanovic: GeoMiner: A System Prototype for Spatial Data Mining. SIGMOD Conference 1997: 553-556
Ming-Syan Chen, Jiawei Han, Philip S. Yu: Data Mining: An Overview from a Database Perspective. IEEE Trans. Knowl. Data Eng. 8(6): 866-883(1996)

BibTeX

ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]