Optimistic Decision Tree Construction

Johannes Gehrke*       Venkatesh Ganti       Raghu Ramakrishnan
University of Wisconsin-Madison       University of Wisconsin-Madison       University of Wisconsin-Madison
johannes@cs.wisc.edu       vganti@cs.wisc.edu       raghu@cs.wisc.edu

Abstract

Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree.

In this paper, we introduce a new algorithm for decision tree construction, that addresses both performance and functionality issues of previous work. The algorithm constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300\% over previous work. The key to this performance improvement is a novel {\em optimistic} approach to tree construction, in which statistical techniques are exploited to construct the tree based on a small subset of the data, while guaranteeing that any difference with respect to the ``real'' tree (i.e., the tree that would be constructed by examining all the data) is detected and corrected. If correction is not possible, the affected part of the tree needs to be processed again; typically, this situation never arises, or only affects a small portion of the tree and can be addressed with little added cost.

Beyond offering faster tree construction, our algorithm can incrementally update the tree (on both insertions and deletions) in a dynamic environment in which the training dataset changes over time. The update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.