DiSC - Semantic Compression and Pattern Extraction with Fascicles

Digital Symposium Collection 2000

Semantic Compression and Pattern Extraction with Fascicles

H. V. Jagadish, J. Madar, and Raymond T. Ng
View Paper (PDF)

Return to Data Mining Algorithms

Abstract

Often many recoords in a database share similar values for several attributes. If one is able to identify and group together records that share similar values for some - even if not all - attributes, one can both obtain a more parsimonious representation of the data, and gain useful insight into the data from a mining perspective.

In this paper, we introduce the notion of fascicles . A fascicle F(k,t) is a subset of records that have k compact attributes. An attribute A of a collection F of records is compact if the width of the range of A -values (for numeric attributes) or the number of distinct A -values (for categorial attributes) of all the records in F does not exceed t . We introduce and study two problems related to fascicles. First, we consider how to find fascicles such that the total storage of the relation is minimized. Second, we study how best to extract fascicles whose sizes exceed a given minimum threshold (i.e., support) and that represent patterns of maximal quality, where quality is measured by the pair (k,t) . We develop algorithms to attack both of the above problems. We show that these two problems are very hard to solve optimally. But we demonstrate empirically that good solutions can be obtained using our algorithms.

References

Note: References link to DBLP on the Web.

[1]: Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan : Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. SIGMOD Conference 1998 : 94-105
[2]: Rakesh Agrawal , Tomasz Imielinski , Arun N. Swami : Mining Association Rules between Sets of Items in Large Databases. SIGMOD Conference 1993 : 207-216
[3]: Rakesh Agrawal , Ramakrishnan Srikant : Fast Algorithms for Mining Association Rules in Large Databases. VLDB 1994 : 487-499
[4]: Daniel Barbará , William DuMouchel , Christos Faloutsos , Peter J. Haas , Joseph M. Hellerstein , Yannis E. Ioannidis , H. V. Jagadish , Theodore Johnson , Raymond T. Ng , Viswanath Poosala , Kenneth A. Ross , Kenneth C. Sevcik : The New Jersey Data Reduction Report. Data Engineering Bulletin 20(4) : 3-45(1997)
[5]: Sergey Brin , Rajeev Motwani , Jeffrey D. Ullman , Shalom Tsur : Dynamic Itemset Counting and Implication Rules for Market Basket Data. SIGMOD Conference 1997 : 255-264
[6]: Martin Ester , Hans-Peter Kriegel , Jörg Sander , Xiaowei Xu : A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996 : 226-231
[7]: Christos Faloutsos , King-Ip Lin : FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Conference 1995 : 163-174
[8]: ...
[9]: ...
[10]: Sudipto Guha , Rajeev Rastogi , Kyuseok Shim : CURE: An Efficient Clustering Algorithm for Large Databases. SIGMOD Conference 1998 : 73-84
[11]: Theodore Johnson , Tamraparni Dasu : Comparing Massive High-Dimensional Data Sets. KDD 1998 : 229-233
[12]: ...
[13]: Mika Klemettinen , Heikki Mannila , Pirjo Ronkainen , Hannu Toivonen , A. Inkeri Verkamo : Finding Interesting Rules from Large Sets of Discovered Association Rules. CIKM 1994 : 401-407
[14]: Flip Korn , H. V. Jagadish , Christos Faloutsos : Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences. SIGMOD Conference 1997 : 289-300
[15]: ...
[16]: Heikki Mannila , Hannu Toivonen : Levelwise Search and Borders of Theories in Knowledge Discovery. Data Mining and Knowledge Discovery 1(3) : 241-258(1997)
[17]: Raymond T. Ng , Jiawei Han : Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994 : 144-155
[18]: Wee Keong Ng , Chinya V. Ravishankar : Block-Oriented Compression Techniques for Large Statistical Databases. TKDE 9(2) : 314-328(1997)
[19]: Frank Olken , Doron Rotem : Rearranging Data to Maximize the Efficiency of Compression. PODS 1986 : 78-90
[20]: Michael B. Smyth : Power Domains. JCSS 16(1) : 23-36(1978)
[21]: Tian Zhang , Raghu Ramakrishnan , Miron Livny : BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conf. 1996 : 103-114

Referenced by

Richard T. Snodgrass : Review - Semantic Compression and Pattern Extraction with Fascicles. ACM SIGMOD Digital Review 1 : (1999)

BIBTEX

@inproceedings{DBLP:conf/vldb/JagadishMN99,
  author    = {H. V. Jagadish and
                J. Madar and
                Raymond T. Ng},
   editor    = {Malcolm P. Atkinson and
                Maria E. Orlowska and
                Patrick Valduriez and
                Stanley B. Zdonik and
                Michael L. Brodie},
   title     = {Semantic Compression and Pattern Extraction with Fascicles},
   booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
                Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
                UK},
   publisher = {Morgan Kaufmann},
   year      = {1999},
   isbn      = {1-55860-615-5},
   pages     = {186-198},
   crossref  = {DBLP:conf/vldb/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },

Copyright(C) 2000 ACM