A Linear-Time Probabilistic Counting Algorithm for Database Applications.
Kyu-Young Whang, Brad T. Vander Zanden, Howard M. Taylor:
A Linear-Time Probabilistic Counting Algorithm for Database Applications.
ACM Trans. Database Syst. 15(2): 208-229(1990)@article{DBLP:journals/tods/WhangVT90,
author = {Kyu-Young Whang and
Brad T. Vander Zanden and
Howard M. Taylor},
title = {A Linear-Time Probabilistic Counting Algorithm for Database Applications},
journal = {ACM Trans. Database Syst.},
volume = {15},
number = {2},
year = {1990},
pages = {208-229},
ee = {http://doi.acm.org/10.1145/78922.78925, db/journals/tods/WhangVT90.html},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX
Abstract
We present a probabilistic algorithm for counting the number of
unique values in the presence of duplicates. This algorithm has
O(q) time complexity, where q is the number of values
including duplicates, and produces an estimation with an arbitrary
accuracy prespecified by the user using only a small amount of
space. Traditionally, accurate counts of unique values were
obtained by sorting, which has O(q log q) time complexity.
Our technique, called linear counting, is based on hashing. We
present a comprehensive theoretical and experimental analysis of
linear counting. The analysis reveals an interesting result: A load
factor (number of unique values/hash table size) much larger than
1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of
error). We present this technique with two important applications to
database problems: namely, (1) obtaining the column cardinality (the
number of unique values in a column of a relation) and (2) obtaining
the join selectivity (the number of unique values in the join column
resulting from an unconditional join divided by the number of unique
join column values in the relation to he joined). These two
parameters are important statistics that are used in relational
query optimization and physical database design.
Copyright © 1990 by the ACM,
Inc., used by permission. Permission to make
digital or hard copies is granted provided that
copies are not made or distributed for profit or
direct commercial advantage, and that copies show
this notice on the first page or initial screen of
a display along with the full citation.
CDROM Version: Load the CDROM "Volume 3 Issue 1, TODS 1976-1990" and ...
DVD Version: Load ACM SIGMOD Anthology DVD 2" and ...
BibTeX
References
- [1]
- Morton M. Astrahan, Mario Schkolnick, Kyu-Young Whang:
Approximating the number of unique values of an attribute without sorting.
Inf. Syst. 12(1): 11-15(1987) BibTeX
- [2]
- Dina Bitton, David J. DeWitt:
Duplicate Record Elimination in Large Data Files.
ACM Trans. Database Syst. 8(2): 255-265(1983) BibTeX
- [3]
- ...
- [4]
- Robert Demolombe:
Estimation of the Number of Tuples Satisfying a Query Expressed in Predicate Calculus Language.
VLDB 1980: 55-63 BibTeX
- [5]
- ...
- [6]
- Erol Gelenbe, Danièle Gardy:
The Size of Projections of Relations Satisfying a Functional Dependency.
VLDB 1982: 325-333 BibTeX
- [7]
- ...
- [8]
- ...
- [9]
- Frank Olken, Doron Rotem:
Simple Random Sampling from Relational Databases.
VLDB 1986: 160-169 BibTeX
- [10]
- Neil C. Rowe:
Antisampling for Estimation: An Overview.
IEEE Trans. Software Eng. 11(10): 1081-1091(1985) BibTeX
- [11]
- Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price:
Access Path Selection in a Relational Database Management System.
SIGMOD Conference 1979: 23-34 BibTeX
- [12]
- ...
- [13]
- Brad T. Vander Zanden, Howard M. Taylor, Dina Bitton:
Estimating Block Accessses when Attributes are Correlated.
VLDB 1986: 119-127 BibTeX
- [14]
- Kyu-Young Whang, Gio Wiederhold, Daniel Sagalowicz:
Estimating Block Accesses in Database Organizations: A Closed Noniterative Formula.
Commun. ACM 26(11): 940-944(1983) BibTeX
- [15]
- Kyu-Young Whang, Gio Wiederhold, Daniel Sagalowicz:
Separability - An Approach to Physical Database Design.
IEEE Trans. Computers 33(3): 209-222(1984) BibTeX
- [16]
- Gio Wiederhold, Ramez Elmasri:
The Structural Model for Database Design.
ER 1979: 237-258 BibTeX
- [17]
- ...
- [18]
- Karel Youssefi, Eugene Wong:
Query Processing in a Relational Database Management System.
VLDB 1979: 409-417 BibTeX
- [19]
- Clement T. Yu, K. Lam, M. K. Siu, Z. Meral Özsoyoglu:
Performance Analysis of three Related Assignment Problems.
SIGMOD Conference 1979: 82-92 BibTeX
Referenced by
- Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, Vivek R. Narasayya:
Towards Estimation Error Guarantees for Distinct Values.
PODS 2000: 268-279
- Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, Jeffrey D. Ullman:
Computing Iceberg Queries Efficiently.
VLDB 1998: 299-310
- Phillip B. Gibbons, Yossi Matias:
New Sampling-Based Summary Statistics for Improving Approximate Query Answers.
SIGMOD Conference 1998: 331-342
- P. Krishnan, Jeffrey Scott Vitter, Balakrishna R. Iyer:
Estimating Alphanumeric Selectivity in the Presence of Wildcards.
SIGMOD Conference 1996: 282-293
- Paolo Ciaccia, Dario Maio:
Domains and Active Domains: What This Distinction Implies for the Estimation of Projection Sizes in Relational Databases.
IEEE Trans. Knowl. Data Eng. 7(4): 641-655(1995)
- Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, Lynne Stokes:
Sampling-Based Estimation of the Number of Distinct Values of an Attribute.
VLDB 1995: 311-322
- Kyu-Young Whang, Sang-Wook Kim, Gio Wiederhold:
Dynamic Maintenance of Data Distribution for Selectivity Estimation.
VLDB J. 3(1): 29-51(1994)
- Jeffrey F. Naughton, S. Seshadri:
On Estimating the Size of Projections.
ICDT 1990: 499-513
BibTeX
ACM SIGMOD Anthology - DBLP:
[Home | Search: Author, Title | Conferences | Journals]
TODS, ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Tue Jun 24 18:39:08 2008