We are now trying to incorporate other forms of discovery objects (trees, regression or discrimination functions, etc) into the Discovery Board's API to allow for cross referencing knowledge generated via different statistical and machine learning techniques. This poses a new challenge since the type of information which must be catalogued for the various learning algorithm differ greatly in form and semantics.
Another extension to the system includes providing the users with an ability to remember past mining-sessions. This will allow analysts to increase their productivity, and to cross-reference their efforts. A side benefit to this is that answers to new queries can often be derived as a subset of answers from past queries (even if posed by other users), saving time and processing power. We are still formalizing this notion of "mining-sessions".
Along the same lines as above, we are developing heuristics for efficiently "re-mining" the same, or similar data in the face of updates and modifications. The heuristics we are developing involve enabling the system to "learn" about the correlation distribution from the previous mining on the data, and use this information stored persistently to greatly improve the processing time on future iterations of rule-generaion on the modified data.
We have developed a number of algorithms for efficient rule-generation from large databases, and have performed substantial performance studies to identify the key issues that affects performance and usability of such programs. We have built a complex performance testbed to test different rule induction engines. In the testbed one can generate database instances which reflect predefined statistical data patterns. In this way we can study the impact of data distribution on different rule generation methods.
In addition, we have also formally defined MSQL, the rule querying and rule generation language as an extension to the OQL specification, proposed in the ODMG 2.0 standard. The system is now fully object oriented allowing to apply rule induction engine to any class and any set of methods, both user-defined as well as materialized.
We have also introduced the notions of metamethods (method constructors) which allow the engine to be applied to complex data such as temporal sequences and multi-dimensional data.
As part of the colloboration with Helen Berman from the Chemistry Department on her NSF sponsored Nucleid acid database project, we helped port the system to SGI platform. Preliminary studies were done confirming the hypotheses of researchers about the molecular structure of DNA. We are continually adapting our system to make it more useful to highly domain-specific applications such as these.
The Discovery Board system was also extensively beta-tested at major insurance companies, resulting in a feedback loop, which provided us with real life experiences of people actively mining-using this system.
In one of the trials, a major insurance company that was trying to explore anomalies in their medical claims database. Discovery Board helped them quickly isolate pockets of high cost claims, and typical scenarios within each disease group that would lead to a high cost claim. This allowed the company to take preventive steps in several cases, by either providing pro-active customer care, or more required tests in different age group patients. We also worked with another group within the same insurance company, mining on the customer and policy database. The system helped them identify characteristics of people which are likely to drop out of their health plan. It also helped spot a couple of counties in NJ where the policy dropout rates were much above the norm.
Our industrial partners were very pleased with the results and we are now exploring a colloboration with another insurance company to perform data mining on their web-server log files.
T.Imielinski and A.Virmani and A. Abdulghani. "DataMine - Application Programming Interface and Query Language for Database Mining" KDD96- the 2nd KDD conference, Portland, August 1996
T.Imielinski and H.Manilla "Database mining - new frontier" in Communications of the ACM, November 1996
H.Hirsh and Ronen Feldman "Query based approach to knowledge discovery in text" in 2nd KDD conference, Portland, August 1996
A. Virmani "Second Generation Data Mining: Concepts and Implementation", draft Ph.D. thesis, March 1998, Rutgers University, NJ
T.Imielinski and A. Virmani "MSQL - query language for data mining" to be submitted
T. Imielinski and A. Virmani "Performance results of the Discovery Board Rule Induction Engine" to be submitted
I. Dagan, R. Feldman., and H. Hirsh. "Keyword-Based Browsing and Analysis of Large Document Sets" In Proceedings of the 1996 Symposium on Document Analysis and Information Retrieval, pp. 191-208, Las Vegas, Nevada April 1996.
R. Feldman and H. Hirsh. "Mining Associations in Text in the Presence of Background Knowledge" In Proceedings of the 2nd International Conference on Knowledge Discovery (KDD-96), pp. 343-346, Portland, Aug 1996.
R. Feldman, A. Amir, Y. Aumann, A. Zilberstein, and H. Hirsh. "Incremental Algorithms for Association Generation" In Proceedings of the 1st Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD97), 14 pages, Singapore, 1997.
R. Feldman and H. Hirsh "Exploiting Background Information in Knowledge Discovery from Text" accepted for publication, Journal of Intelligent Information Systems.
One particular database-mining task, the problem of {\em rule discovery}, involves finding sets of conditions on fields of a record that allow one to predict the value of other fields of a record. Rule discovery is viewed as an interactive process with a ``human in the loop,'' an iterative process where the user is not only trying to discover interesting results, but also interesting questions to ask.
Rakesh Agrawal, Tomasz Imielinski, Arun Swami, "Mining Associations in Large Databases", SIGMOD 93, Washington, DC, 1993.
Leo Breiman et al.,"Classification and Regression Trees", Wadsworth, Belmont, 1984.
T.Imielinski and H.Manilla "Database mining - new frontier", in Communications of the ACM, Nov 1996