Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results.

Jason Tsong-Li Wang, Gung-Wei Chirn, Thomas G. Marr, Bruce A. Shapiro, Dennis Shasha, Kaizhong Zhang: Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results. SIGMOD Conference 1994: 115-125

@inproceedings{DBLP:conf/sigmod/WangCMSSZ94,
  author    = {Jason Tsong-Li Wang and
               Gung-Wei Chirn and
               Thomas G. Marr and
               Bruce A. Shapiro and
               Dennis Shasha and
               Kaizhong Zhang},
  editor    = {Richard T. Snodgrass and
               Marianne Winslett},
  title     = {Combinatorial Pattern Discovery for Scientific Data: Some Preliminary
               Results},
  booktitle = {Proceedings of the 1994 ACM SIGMOD International Conference on
               Management of Data, Minneapolis, Minnesota, May 24-27, 1994},
  publisher = {ACM Press},
  year      = {1994},
  pages     = {115-125},
  ee        = {http://doi.acm.org/10.1145/191839.191863, db/conf/sigmod/WangCMSSZ94.html},
  crossref  = {DBLP:conf/sigmod/94},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

BibTeX

Abstract

Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric.

This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today.

Copyright © 1994 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.

ACM SIGMOD Anthology

Online Version (ACM WWW Account required): Full Text in PDF Format

CDROM Version: Load the CDROM "Volume 1 Issue 1, SIGMOD '93-'97" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the CD and click on the path of your mount point:
/Anthology/smod9397 or /cdrom

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

BibTeX

Printed Edition

Richard T. Snodgrass, Marianne Winslett (Eds.): Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, May 24-27, 1994. ACM Press 1994 BibTeX , SIGMOD Record 23(2), June 1994
Contents

Online Edition: ACM Digital Library

[Abstract and Index Terms]
[Full Text in PDF Format, 1016 KB]

References

[1]: Rakesh Agrawal, Sakti P. Ghosh, Tomasz Imielinski, Balakrishna R. Iyer, Arun N. Swami: An Interval Classifier for Database Mining Applications. VLDB 1992: 560-573 BibTeX
[2]: Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases. SIGMOD Conference 1993: 207-216 BibTeX
[3]: ...
[4]: ...
[5]: ...
[6]: ...
[7]: ...
[8]: ...
[9]: Vasant Dhar, Alexander Tuzhilin: Abstract-Driven Pattern Discovery in Databases. IEEE Trans. Knowl. Data Eng. 5(6): 926-938(1993) BibTeX
[10]: William J. Frawley, Gregory Piatetsky-Shapiro, Christopher J. Matheus: Knowledge Discovery in Databases: An Overview. Knowledge Discovery in Databases 1991: 1-30 BibTeX
[11]: Karen A. Frenkel: The Human Genome Project and Informatics. Commun. ACM 34(11): 40-51(1991) BibTeX
[12]: ...
[13]: Peter J. Haas, Arun N. Swami: Sequential Sampling Procedures for Query Size Estimation. SIGMOD Conference 1992: 341-350 BibTeX
[14]: Jiawei Han, Yandong Cai, Nick Cercone: Knowledge Discovery in Databases: An Attribute-Oriented Approach. VLDB 1992: 547-559 BibTeX
[15]: ...
[16]: Wen-Chi Hou, Gultekin Özsoyoglu: Statistical Estimators for Aggregate Relational Algebra Queries. ACM Trans. Database Syst. 16(4): 600-654(1991) BibTeX
[17]: Lucas Chi Kwong Hui: Color Set Size Problem with Application to String Matching. CPM 1992: 230-243 BibTeX
[18]: ...
[19]: ...
[20]: Gad M. Landau, Uzi Vishkin: Fast Parallel and Serial Approximate String Matching. J. Algorithms 10(2): 157-169(1989) BibTeX
[21]: ...
[22]: Richard J. Lipton, Jeffrey F. Naughton, Donovan A. Schneider: Practical Selectivity Estimation through Adaptive Sampling. SIGMOD Conference 1990: 1-11 BibTeX
[23]: Edward M. McCreight: A Space-Economical Suffix Tree Construction Algorithm. J. ACM 23(2): 262-272(1976) BibTeX
[24]: Donald R. Morrison: PATRICIA - Practical Algorithm To Retrieve Information Coded in Alphanumeric. J. ACM 15(4): 514-534(1968) BibTeX
[25]: Frank Olken, Doron Rotem: Random Sampling from B+ Trees. VLDB 1989: 269-277 BibTeX
[26]: ...
[27]: ...
[28]: ...
[29]: ...
[30]: Dennis Shasha, Jason Tsong-Li Wang: New Techniques for Best-Match Retrieval. ACM Trans. Inf. Syst. 8(2): 140-158(1990) BibTeX
[31]: Abraham Silberschatz, Michael Stonebraker, Jeffrey D. Ullman: Database Systems: Achievements and Opportunities. Commun. ACM 34(10): 110-120(1991) BibTeX
[32]: Bharathi Subramanian, Stanley B. Zdonik, Theodore W. Leung, Scott L. Vandenberg: Ordered Types in the AQUA Data Model. DBPL 1993: 115-135 BibTeX
[33]: Shalom Tsur: Data Dredging. IEEE Data Eng. Bull. 13(4): 58-63(1990) BibTeX
[34]: ...
[35]: Robert A. Wagner, Michael J. Fischer: The String-to-String Correction Problem. J. ACM 21(1): 168-173(1974) BibTeX
[36]: Jason Tsong-Li Wang, Kaizhong Zhang, Karpjoo Jeong, Dennis Shasha: A System for Approximate Tree Matching. IEEE Trans. Knowl. Data Eng. 6(4): 559-571(1994) BibTeX
[37]: Jason Tsong-Li Wang, Dennis Shasha: Query Processing for Distance Metrics. VLDB 1990: 602-613 BibTeX
[38]: ...
[39]: Sun Wu, Udi Manber: Fast Text Searching Allowing Errors. Commun. ACM 35(10): 83-91(1992) BibTeX
[40]: Kaizhong Zhang, Dennis Shasha: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM J. Comput. 18(6): 1245-1262(1989) BibTeX
[41]: Kaizhong Zhang, Dennis Shasha, Jason Tsong-Li Wang: Approximate Tree Matching in the Presence of Variable Length Don't Cares. J. Algorithms 16(1): 33-66(1994) BibTeX
[42]: George Kingsley Zipf: Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology. Addison-Wesley 1949
BibTeX

Referenced by

Minos N. Garofalakis, Rajeev Rastogi, Kyuseok Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 1999: 223-234
Sudarshan S. Chawathe: Comparing Hierarchical Data in External Memory. VLDB 1999: 90-101
Ming-Syan Chen, Jong Soo Park, Philip S. Yu: Efficient Data Mining for Path Traversal Patterns. IEEE Trans. Knowl. Data Eng. 10(2): 209-221(1998)
Claudio Bettini, Xiaoyang Sean Wang, Sushil Jajodia, Jia-Ling Lin: Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences. IEEE Trans. Knowl. Data Eng. 10(2): 222-237(1998)
Claudio Bettini, Xiaoyang Sean Wang, Sushil Jajodia: Mining Temporal Relationships with Multiple Granularities in Time Sequences. IEEE Data Eng. Bull. 21(1): 32-38(1998)
Bin Li, Dennis Shasha: Free Parallel Data Mining. SIGMOD Conference 1998: 541-543
Jong Soo Park, Ming-Syan Chen, Philip S. Yu: Using a Hash-Based Method with Transaction Trimming for Mining Association Rules. IEEE Trans. Knowl. Data Eng. 9(5): 813-825(1997)
P. Krishnan, Jeffrey Scott Vitter, Balakrishna R. Iyer: Estimating Alphanumeric Selectivity in the Presence of Wildcards. SIGMOD Conference 1996: 282-293
Ramakrishnan Srikant, Rakesh Agrawal: Mining Sequential Patterns: Generalizations and Performance Improvements. EDBT 1996: 3-17
Ashok Savasere, Edward Omiecinski, Shamkant B. Navathe: An Efficient Algorithm for Mining Association Rules in Large Databases. VLDB 1995: 432-444
Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, Kyuseok Shim: Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases. VLDB 1995: 490-501
Jason Tsong-Li Wang, Kaizhong Zhang, Dennis Shasha: Pattern Matching and Pattern Discovery in Scientific, Program, and Document Databases. SIGMOD Conference 1995: 487
Rakesh Agrawal, Ramakrishnan Srikant: Mining Sequential Patterns. ICDE 1995: 3-14

BibTeX

ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]