Second Generation Data Mining: Concepts and Implementation

Tomasz Imielinski
Department of Computer Science
Rutgers, the State University of New Jersey
Piscataway, NJ 08854

Contact Information

Tomasz Imielinski
Phone: (732) 445-3546
Fax : (732) 445-0537
Email: imielins@cs.rutgers.edu

WWW PAGE

http://www.cs.rutgers.edu/datamining/

Keywords

Data Mining, Second Generation, Rule Induction, MSQL

Project Award Information

Award Number: IRI 95-09819
Duration: 9/1/95-2/28/99
Title: A Query-Based Approach to Database Mining: Database Tools for Rule Discovery

Project Summary

We introduce a concept of second generation data mining which involves not only rule generation but also rule rule management and rule postprocessing. To support this paradigm, we developed a declarative query language - MSQL, and a C++ API to provide the user with flexible yet powerful means to post-process the large number of rules generated. A prototype second generation data-mining system called Discovery Board was also developed as part of this research.

Goals, Objectives, and Targeted Activities

The primary goal of this research was to develop efficient algorithms for query based rule-induction. We successfully demonstrated this approach by implementing various algorithms for rule induction, and providing users with tools to query, store, and manage the rules generated from data.

We are now trying to incorporate other forms of discovery objects (trees, regression or discrimination functions, etc) into the Discovery Board's API to allow for cross referencing knowledge generated via different statistical and machine learning techniques. This poses a new challenge since the type of information which must be catalogued for the various learning algorithm differ greatly in form and semantics.

Another extension to the system includes providing the users with an ability to remember past mining-sessions. This will allow analysts to increase their productivity, and to cross-reference their efforts. A side benefit to this is that answers to new queries can often be derived as a subset of answers from past queries (even if posed by other users), saving time and processing power. We are still formalizing this notion of "mining-sessions".

Along the same lines as above, we are developing heuristics for efficiently "re-mining" the same, or similar data in the face of updates and modifications. The heuristics we are developing involve enabling the system to "learn" about the correlation distribution from the previous mining on the data, and use this information stored persistently to greatly improve the processing time on future iterations of rule-generaion on the modified data.

Indication of Success

We have developed the concept of second generation data mining, and built a proof-of-concept system -- "Discovery Board" which embodies our approach. Second generation data mining involves not just massive rule-generation from databases, but also querying, storage and management of these rules, and providing primitives for embedding them into applications as first class objects.

We have developed a number of algorithms for efficient rule-generation from large databases, and have performed substantial performance studies to identify the key issues that affects performance and usability of such programs. We have built a complex performance testbed to test different rule induction engines. In the testbed one can generate database instances which reflect predefined statistical data patterns. In this way we can study the impact of data distribution on different rule generation methods.

In addition, we have also formally defined MSQL, the rule querying and rule generation language as an extension to the OQL specification, proposed in the ODMG 2.0 standard. The system is now fully object oriented allowing to apply rule induction engine to any class and any set of methods, both user-defined as well as materialized.

We have also introduced the notions of metamethods (method constructors) which allow the engine to be applied to complex data such as temporal sequences and multi-dimensional data.

Project Impact

The project supported two graduate students during their graduate studies. Several papers and demos at various data-mining conferences were presented in the last three years. A PhD dissertation titled: "Second Generation Data Mining: Concepts and Implementations" was accomplished by a graduate student directly funded from this grant.

As part of the colloboration with Helen Berman from the Chemistry Department on her NSF sponsored Nucleid acid database project, we helped port the system to SGI platform. Preliminary studies were done confirming the hypotheses of researchers about the molecular structure of DNA. We are continually adapting our system to make it more useful to highly domain-specific applications such as these.

The Discovery Board system was also extensively beta-tested at major insurance companies, resulting in a feedback loop, which provided us with real life experiences of people actively mining-using this system.

In one of the trials, a major insurance company that was trying to explore anomalies in their medical claims database. Discovery Board helped them quickly isolate pockets of high cost claims, and typical scenarios within each disease group that would lead to a high cost claim. This allowed the company to take preventive steps in several cases, by either providing pro-active customer care, or more required tests in different age group patients. We also worked with another group within the same insurance company, mining on the customer and policy database. The system helped them identify characteristics of people which are likely to drop out of their health plan. It also helped spot a couple of counties in NJ where the policy dropout rates were much above the norm.

Our industrial partners were very pleased with the results and we are now exploring a colloboration with another insurance company to perform data mining on their web-server log files.

Project References

T.Imielinski "From File Mining to Database Mining" at "Workshop on Database Mining, SIGMOD 1996

T.Imielinski and A.Virmani and A. Abdulghani. "DataMine - Application Programming Interface and Query Language for Database Mining" KDD96- the 2nd KDD conference, Portland, August 1996

T.Imielinski and H.Manilla "Database mining - new frontier" in Communications of the ACM, November 1996

H.Hirsh and Ronen Feldman "Query based approach to knowledge discovery in text" in 2nd KDD conference, Portland, August 1996

A. Virmani "Second Generation Data Mining: Concepts and Implementation", draft Ph.D. thesis, March 1998, Rutgers University, NJ

T.Imielinski and A. Virmani "MSQL - query language for data mining" to be submitted

T. Imielinski and A. Virmani "Performance results of the Discovery Board Rule Induction Engine" to be submitted

I. Dagan, R. Feldman., and H. Hirsh. "Keyword-Based Browsing and Analysis of Large Document Sets" In Proceedings of the 1996 Symposium on Document Analysis and Information Retrieval, pp. 191-208, Las Vegas, Nevada April 1996.

R. Feldman and H. Hirsh. "Mining Associations in Text in the Presence of Background Knowledge" In Proceedings of the 2nd International Conference on Knowledge Discovery (KDD-96), pp. 343-346, Portland, Aug 1996.

R. Feldman, A. Amir, Y. Aumann, A. Zilberstein, and H. Hirsh. "Incremental Algorithms for Association Generation" In Proceedings of the 1st Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD97), 14 pages, Singapore, 1997.

R. Feldman and H. Hirsh "Exploiting Background Information in Knowledge Discovery from Text" accepted for publication, Journal of Intelligent Information Systems.

Area Background

Many government and commercial organizations possess extremely large databases, with sizes often measured in terabytes, containing such information as consumer data, astronomical observations, biological sequences, financial records, census and tax data, etc. The extraction of information from such large databases has become known as {\em database mining} or {\em knowledge discovery in databases}, and is an area where machine learning techniques must meet performance requirements of very large database systems.

One particular database-mining task, the problem of {\em rule discovery}, involves finding sets of conditions on fields of a record that allow one to predict the value of other fields of a record. Rule discovery is viewed as an interactive process with a ``human in the loop,'' an iterative process where the user is not only trying to discover interesting results, but also interesting questions to ask.

Area References

William J. Frawley, Gregory Piatetsky-Shapiro, and Chris Matheus, "Knowledge Discovery in Databases: An Overview", AI Magazine, Fall, 1992.

Rakesh Agrawal, Tomasz Imielinski, Arun Swami, "Mining Associations in Large Databases", SIGMOD 93, Washington, DC, 1993.

Leo Breiman et al.,"Classification and Regression Trees", Wadsworth, Belmont, 1984.

T.Imielinski and H.Manilla "Database mining - new frontier", in Communications of the ACM, Nov 1996