Digital Symposium Collection 2000  

 
 
 
 
 
 

 





















BOAT-Optimistic Decision Tree Construction

Johannes Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh

  View Paper (PDF)  

Return to Data Mining - Association Rules and Decision Trees

Abstract
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all main-memory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any difference with respect to the “real” tree (i.e., the tree that would be constructed by examining all the data in a traditional way) is detected and corrected. The correction step occasionally requires us to make additional scans over subsets of the data; typically, this situation rarely arises, and can be addressed with little added cost. Beyond offering faster tree construction, BOAT is the first scalable algorithm with the ability to incrementally update the tree with respect to both insertions and deletions over the dataset. This property is valuable in dynamic environments such as data warehouses, in which the training dataset changes over time. The BOAT update operation is much cheaper than completely re-building the tree, and the resulting tree is guaranteed to be identical to the tree that would be produced by a complete re-build.


References

Note: References link to DBLP on the Web.

[AD97]
...
[AGI+92]
Rakesh Agrawal , Sakti P. Ghosh , Tomasz Imielinski , Balakrishna R. Iyer , Arun N. Swami : An Interval Classifier for Database Mining Applications. VLDB 1992 : 560-573
[AIS93]
Rakesh Agrawal , Tomasz Imielinski , Arun N. Swami : Database Mining: A Performance Perspective. TKDE 5(6) : 914-925(1993)
[BFOS84]
...
[ET93]
...
[FMM96]
Takeshi Fukuda , Yasuhiko Morimoto , Shinichi Morishita , Takeshi Tokuyama : Constructing Efficient Decision Trees by Using Optimized Numeric Association Rules. VLDB 1996 : 146-155
[FMMT96a]
Takeshi Fukuda , Yasuhiko Morimoto , Shinichi Morishita , Takeshi Tokuyama : Data Mining Using Two-Dimensional Optimized Accociation Rules: Scheme, Algorithms, and Visualization. SIGMOD Conf. 1996 : 13-23
[FMMT96b]
Takeshi Fukuda , Yasuhiko Morimoto , Shinichi Morishita , Takeshi Tokuyama : Mining Optimized Association Rules for Numeric Attributes. PODS 1996 : 182-191
[FPSSU96]
Usama M. Fayyad , Gregory Piatetsky-Shapiro , Padhraic Smyth , Ramasamy Uthurusamy (Eds.): Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press 1996, ISBN 0-262-56097-6
Contents
[GRG98]
Johannes Gehrke , Raghu Ramakrishnan , Venkatesh Ganti : RainForest - A Framework for Fast Decision Tree Construction of Large Datasets. VLDB 1998 : 416-427
[Han97]
...
[LLS97]
...
[LS97]
...
[LS97]
...
[Man94]
...
[MAR96]
Manish Mehta , Rakesh Agrawal , Jorma Rissanen : SLIQ: A Fast Scalable Classifier for Data Mining. EDBT 1996 : 18-32
[MFM+98]
Yasuhiko Morimoto , Takeshi Fukuda , Hirofumi Matsuzawa , Takeshi Tokuyama , Kunikazu Yoda : Algorithms for Mining Association Rules for Binary Segmentations of Huge Categorical Databases. VLDB 1998 : 380-391
[MS98]
Nimrod Megiddo , Ramakrishnan Srikant : Discovering Predictive Association Rules. KDD 1998 : 274-278
[MST94]
Donald Michie , D. J. Spiegelhalter, C. C. Taylor: Machine Learning, Neural and Statistical Classification. Ellis Horwood 1994, ISBN 0-13-106360-X
[Mur95]
...
[Olk93]
...
[Qui86]
J. Ross Quinlan : Induction of Decision Trees. Machine Learning 1(1) : 81-106(1986)
[RS98]
Rajeev Rastogi , Kyuseok Shim : PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. VLDB 1998 : 404-415
[SAM96]
John C. Shafer , Rakesh Agrawal , Manish Mehta : SPRINT: A Scalable Parallel Classifier for Data Mining. VLDB 1996 : 544-555
[UBC97]
Paul E. Utgoff , Neil C. Berkman , Jeffery A. Clouse : Decision Tree Induction Based on Efficient Tree Restructuring. Machine Learning 29(1) : 5-44(1997)
[Utg89]
Paul E. Utgoff : Incremental Induction of Decision Trees. Machine Learning 4 : 161-186(1989)

BIBTEX

@inproceedings{DBLP:conf/sigmod/GehrkeGRL99,
  author    = {Johannes Gehrke and
                Venkatesh Ganti and
                Raghu Ramakrishnan and
                Wei-Yin Loh},
   editor    = {Alex Delis and
                Christos Faloutsos and
                Shahram Ghandeharizadeh},
   title     = {BOAT-Optimistic Decision Tree Construction},
   booktitle = {SIGMOD 1999, Proceedings ACM SIGMOD International Conference
                on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania,
                USA},
   publisher = {ACM Press},
   year      = {1999},
   isbn      = {1-58113-084-8},
   pages     = {169-180},
   crossref  = {DBLP:conf/sigmod/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },


























Copyright(C) 2000 ACM