Algorithms for Mining Distance-Based Outliers in Large Datasets.

Edwin M. Knorr, Raymond T. Ng: Algorithms for Mining Distance-Based Outliers in Large Datasets. VLDB 1998: 392-403

@inproceedings{DBLP:conf/vldb/KnorrN98,
  author    = {Edwin M. Knorr and
               Raymond T. Ng},
  editor    = {Ashish Gupta and
               Oded Shmueli and
               Jennifer Widom},
  title     = {Algorithms for Mining Distance-Based Outliers in Large Datasets},
  booktitle = {VLDB'98, Proceedings of 24rd International Conference on Very
               Large Data Bases, August 24-27, 1998, New York City, New York,
               USA},
  publisher = {Morgan Kaufmann},
  year      = {1998},
  isbn      = {1-55860-566-5},
  pages     = {392-403},
  ee        = {db/conf/vldb/KnorrN98.html},
  crossref  = {DBLP:conf/vldb/98},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

BibTeX

Abstract

This paper deals with finding outliers (exceptions) in large, multidimensional datasets. The identification of outliers can lead to the discovery of truly unexpected knowledge in areas such as electronic commerce, credit card fraud, and even the analysis of performance statistics of professional athletes. Existing methods that we have seen for finding outliers in large datasets can only deal efficiently with two dimensions/attributes of a dataset. Here, we study the notion of DB- (Distance- Based) outliers. While we provide formal and empirical evidence showing the usefulness of DB-outliers, we focus on the development of algorithms for computingsuch outliers.

First, we present two simple algorithms, both having a complexity of O(k N²), k being the dimensionality and N being the number of objects in the dataset. These algorithms readily support datasets with many more than two attributes. Second, we present an optimized cell-based algorithm that has a complexitythat is linear wrt N, but exponential wrt k. Third, for datasets that are mainly disk-resident, we present another version of the cell-based algorithm that guarantees at most 3 passes over a dataset. We provide experimental results showing that these cell- based algorithms are by far the best for k <=4.

Copyright © 1998 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Online Paper

Download PDF file (www.vldb.org, Darmstadt, Germany)
Download PDF file (www.acm.org, New York, USA)

ACM SIGMOD DiSC

CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the CD and click on the path of your mount point:
/Anthology/Disc99_1 or /cdrom

ACM SIGMOD Anthology

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

BibTeX

Printed Edition

Ashish Gupta, Oded Shmueli, Jennifer Widom (Eds.): VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA. Morgan Kaufmann 1998, ISBN 1-55860-566-5
Contents BibTeX

References

[AAR96]: Andreas Arning, Rakesh Agrawal, Prabhakar Raghavan: A Linear Method for Deviation Detection in Large Databases. KDD 1996: 164-169 BibTeX
[AGI+92]: Rakesh Agrawal, Sakti P. Ghosh, Tomasz Imielinski, Balakrishna R. Iyer, Arun N. Swami: An Interval Classifier for Database Mining Applications. VLDB 1992: 560-573 BibTeX
[AIS93]: Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases. SIGMOD Conference 1993: 207-216 BibTeX
[AL88]: ...
[BCP+97]: Inderpal S. Bhandari, Edward Colet, Jennifer Parker, Zachary Pines, Rajiv Pratap, Krishnakumar Ramanujam: Advanced Scout: Data Mining and Knowledge Discovery in NBA Data. Data Min. Knowl. Discov. 1(1): 121-125(1997) BibTeX
[Ben75]: Jon Louis Bentley: Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 18(9): 509-517(1975) BibTeX
[BL94]: ...
[EKSX96]: Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD 1996: 226-231 BibTeX
[FPP78]: ...
[Gut84]: Antonin Guttman: R-Trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Conference 1984: 47-57 BibTeX
[Haw80]: ...
[HCC92]: Jiawei Han, Yandong Cai, Nick Cercone: Knowledge Discovery in Databases: An Attribute-Oriented Approach. VLDB 1992: 547-559 BibTeX
[HKP97]: Joseph M. Hellerstein, Elias Koutsoupias, Christos H. Papadimitriou: On the Analysis of Indexing Schemes. PODS 1997: 249-256 BibTeX
[JW92]: ...
[KN96]: Edwin M. Knorr, Raymond T. Ng: Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining. IEEE Trans. Knowl. Data Eng. 8(6): 884-897(1996) BibTeX
[KN97]: Edwin M. Knorr, Raymond T. Ng: A Unified Notion of Outliers: Properties and Computation. KDD 1997: 219-222 BibTeX
[Kno97]: ...
[MT96]: Heikki Mannila, Hannu Toivonen: Discovering Generalized Episodes Using Minimal Occurrences. KDD 1996: 146-151 BibTeX
[MTV95]: Heikki Mannila, Hannu Toivonen, A. Inkeri Verkamo: Discovering Frequent Episodes in Sequences. KDD 1995: 210-215 BibTeX
[NH94]: Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: 144-155 BibTeX
[PS88]: ...
[RR96]: ...
[Sam90]: Hanan Samet: The Design and Analysis of Spatial Data Structures. Addison-Wesley 1990
BibTeX
[ZRL96]: Tian Zhang, Raghu Ramakrishnan, Miron Livny: BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD Conference 1996: 103-114 BibTeX

Referenced by

Edwin M. Knorr, Raymond T. Ng, V. Tucakov: Distance-Based Outliers: Algorithms and Applications. VLDB J. 8(3-4): 237-253(2000)
Themistoklis Palpanas: Knowledge Discovery in Data Warehouses. SIGMOD Record 29(3): 88-100(2000)
Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD Conference 2000: 427-438
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander: LOF: Identifying Density-Based Local Outliers. SIGMOD Conference 2000: 93-104
Minos N. Garofalakis, Rajeev Rastogi, S. Seshadri, Kyuseok Shim: Data Mining and the Web: Past, Present and Future. Workshop on Web Information and Data Management 1999: 43-47
H. V. Jagadish, Nick Koudas, S. Muthukrishnan: Mining Deviants in a Time Series Database. VLDB 1999: 102-113
Edwin M. Knorr, Raymond T. Ng: Finding Intensional Knowledge of Distance-Based Outliers. VLDB 1999: 211-222
Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan: A Framework for Measuring Changes in Data Characteristics. PODS 1999: 126-137

BibTeX

ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]

VLDB Proceedings: Copyright © by VLDB Endowment,
ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Sat May 16 23:46:22 2009