ACM SIGMOD Anthology TODS dblp.uni-trier.de

A Linear-Time Probabilistic Counting Algorithm for Database Applications.

Kyu-Young Whang, Brad T. Vander Zanden, Howard M. Taylor: A Linear-Time Probabilistic Counting Algorithm for Database Applications. ACM Trans. Database Syst. 15(2): 208-229(1990)
@article{DBLP:journals/tods/WhangVT90,
  author    = {Kyu-Young Whang and
               Brad T. Vander Zanden and
               Howard M. Taylor},
  title     = {A Linear-Time Probabilistic Counting Algorithm for Database Applications},
  journal   = {ACM Trans. Database Syst.},
  volume    = {15},
  number    = {2},
  year      = {1990},
  pages     = {208-229},
  ee        = {http://doi.acm.org/10.1145/78922.78925, db/journals/tods/WhangVT90.html},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX

Abstract

We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.

Copyright © 1990 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.


Joint ACM SIGMOD / IEEE Computer Society Anthology

CDROM Version: Load the CDROM "Volume 3 Issue 1, TODS 1976-1990" and ... DVD Version: Load ACM SIGMOD Anthology DVD 2" and ... BibTeX

References

[1]
Morton M. Astrahan, Mario Schkolnick, Kyu-Young Whang: Approximating the number of unique values of an attribute without sorting. Inf. Syst. 12(1): 11-15(1987) BibTeX
[2]
Dina Bitton, David J. DeWitt: Duplicate Record Elimination in Large Data Files. ACM Trans. Database Syst. 8(2): 255-265(1983) BibTeX
[3]
...
[4]
Robert Demolombe: Estimation of the Number of Tuples Satisfying a Query Expressed in Predicate Calculus Language. VLDB 1980: 55-63 BibTeX
[5]
...
[6]
Erol Gelenbe, Danièle Gardy: The Size of Projections of Relations Satisfying a Functional Dependency. VLDB 1982: 325-333 BibTeX
[7]
...
[8]
...
[9]
Frank Olken, Doron Rotem: Simple Random Sampling from Relational Databases. VLDB 1986: 160-169 BibTeX
[10]
Neil C. Rowe: Antisampling for Estimation: An Overview. IEEE Trans. Software Eng. 11(10): 1081-1091(1985) BibTeX
[11]
Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price: Access Path Selection in a Relational Database Management System. SIGMOD Conference 1979: 23-34 BibTeX
[12]
...
[13]
Brad T. Vander Zanden, Howard M. Taylor, Dina Bitton: Estimating Block Accessses when Attributes are Correlated. VLDB 1986: 119-127 BibTeX
[14]
Kyu-Young Whang, Gio Wiederhold, Daniel Sagalowicz: Estimating Block Accesses in Database Organizations: A Closed Noniterative Formula. Commun. ACM 26(11): 940-944(1983) BibTeX
[15]
Kyu-Young Whang, Gio Wiederhold, Daniel Sagalowicz: Separability - An Approach to Physical Database Design. IEEE Trans. Computers 33(3): 209-222(1984) BibTeX
[16]
Gio Wiederhold, Ramez Elmasri: The Structural Model for Database Design. ER 1979: 237-258 BibTeX
[17]
...
[18]
Karel Youssefi, Eugene Wong: Query Processing in a Relational Database Management System. VLDB 1979: 409-417 BibTeX
[19]
Clement T. Yu, K. Lam, M. K. Siu, Z. Meral Özsoyoglu: Performance Analysis of three Related Assignment Problems. SIGMOD Conference 1979: 82-92 BibTeX

Referenced by

  1. Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, Vivek R. Narasayya: Towards Estimation Error Guarantees for Distinct Values. PODS 2000: 268-279
  2. Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, Jeffrey D. Ullman: Computing Iceberg Queries Efficiently. VLDB 1998: 299-310
  3. Phillip B. Gibbons, Yossi Matias: New Sampling-Based Summary Statistics for Improving Approximate Query Answers. SIGMOD Conference 1998: 331-342
  4. P. Krishnan, Jeffrey Scott Vitter, Balakrishna R. Iyer: Estimating Alphanumeric Selectivity in the Presence of Wildcards. SIGMOD Conference 1996: 282-293
  5. Paolo Ciaccia, Dario Maio: Domains and Active Domains: What This Distinction Implies for the Estimation of Projection Sizes in Relational Databases. IEEE Trans. Knowl. Data Eng. 7(4): 641-655(1995)
  6. Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, Lynne Stokes: Sampling-Based Estimation of the Number of Distinct Values of an Attribute. VLDB 1995: 311-322
  7. Kyu-Young Whang, Sang-Wook Kim, Gio Wiederhold: Dynamic Maintenance of Data Distribution for Selectivity Estimation. VLDB J. 3(1): 29-51(1994)
  8. Jeffrey F. Naughton, S. Seshadri: On Estimating the Size of Projections. ICDT 1990: 499-513
BibTeX
ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]
TODS, ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Tue Jun 24 18:39:08 2008