ACM SIGMOD Anthology VLDB dblp.uni-trier.de

Sampling-Based Estimation of the Number of Distinct Values of an Attribute.

Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, Lynne Stokes: Sampling-Based Estimation of the Number of Distinct Values of an Attribute. VLDB 1995: 311-322
@inproceedings{DBLP:conf/vldb/HaasNSS95,
  author    = {Peter J. Haas and
               Jeffrey F. Naughton and
               S. Seshadri and
               Lynne Stokes},
  editor    = {Umeshwar Dayal and
               Peter M. D. Gray and
               Shojiro Nishio},
  title     = {Sampling-Based Estimation of the Number of Distinct Values of
               an Attribute},
  booktitle = {VLDB'95, Proceedings of 21th International Conference on Very
               Large Data Bases, September 11-15, 1995, Zurich, Switzerland},
  publisher = {Morgan Kaufmann},
  year      = {1995},
  isbn      = {1-55860-379-4},
  pages     = {311-322},
  ee        = {db/conf/vldb/HaasNSS95.html},
  crossref  = {DBLP:conf/vldb/95},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX

Abstract

We provide several new sampling-based estimators of the number of distinctvalues of an attribute in a relation. We compare these new estimators to estimators from the database and statistical literature empirically, using a large number of attribute-value distributions drawn from a variety of real-world databases. This appears to be the first extensive comparison of distinct-value estimators in either the database or statistical literature, and is certainly the first to use highly- skewed data of the sort frequently encountered in database applications. Our experiments indicate that a new "hybrid" estimator yields the highest precision on average for a given sampling fraction. This estimator explicitly takes into account the degree of skew in the data and combines a new "smoothed jackknife" estimator with an estimator due to Shlosser. We investigate how the hybrid estimator behaves as we scale up the size ofthe database.

Copyright © 1995 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.


Online Paper

ACM SIGMOD Anthology

CDROM Version: Load the CDROM "Volume 1 Issue 5, VLDB '89-'97" and ... DVD Version: Load ACM SIGMOD Anthology DVD 1" and ... BibTeX

Printed Edition

Umeshwar Dayal, Peter M. D. Gray, Shojiro Nishio (Eds.): VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland. Morgan Kaufmann 1995, ISBN 1-55860-379-4
Contents BibTeX

References

[AS72]
...
[ASW87]
Morton M. Astrahan, Mario Schkolnick, Kyu-Young Whang: Approximating the number of unique values of an attribute without sorting. Inf. Syst. 12(1): 11-15(1987) BibTeX
[BFi93]
...
[BO78]
...
[BO79]
...
[BFe93]
Quentin L. Burrell, Michael R. Fenton: Yes, the GIGP Really Does Work - and Is Workable! JASIS 44(2): 61-69(1993) BibTeX
[Cha84]
...
[CL92]
...
[ET93]
...
[FM85]
Philippe Flajolet, G. Nigel Martin: Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31(2): 182-209(1985) BibTeX
[GG82]
Erol Gelenbe, Danièle Gardy: On the Size of Projections: I. Inf. Process. Lett. 14(1): 18-21(1982) BibTeX
[Goo49]
...
[HN+93]
Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, Arun N. Swami: Selectivity and Cost Estimation for Joins Based on Random Sampling. J. Comput. Syst. Sci. 52(3): 550-569(1996) BibTeX
[HS94]
Joseph M. Hellerstein, Michael Stonebraker: Predicate Migration: Optimizing Queries with Expensive Predicates. SIGMOD Conference 1993: 267-276 BibTeX
[HF83]
...
[HOT88]
Wen-Chi Hou, Gultekin Özsoyoglu, Baldeo K. Taneja: Statistical Estimators for Relational Algebra Expressions. PODS 1988: 276-287 BibTeX
[HOT89]
Wen-Chi Hou, Gultekin Özsoyoglu, Baldeo K. Taneja: Processing Aggregate Relational Queries with Hard Time Constraints. SIGMOD Conference 1989: 68-77 BibTeX
[LP56]
...
[NS90]
Jeffrey F. Naughton, S. Seshadri: On Estimating the Size of Projections. ICDT 1990: 499-513 BibTeX
[Olk93]
...
[ODT91]
Gultekin Özsoyoglu, Kaizheng Du, A. Tjahjana, Wen-Chi Hou, D. Y. Rowland: On Estimating COUNT, SUM, and AVERAGE. DEXA 1991: 406-412 BibTeX
[SSW92]
...
[SAC+79]
Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price: Access Path Selection in a Relational Database Management System. SIGMOD Conference 1979: 23-34 BibTeX
[Shl81]
...
[Si86a]
...
[Si86b]
...
[Si92]
H. S. Sichel: Anatomy of the Generalized Inverse Gaussian-Poisson Distribution with Special Applications to Bibliometric Studies. Inf. Process. Manage. 28(1): 5-18(1992) BibTeX
[SvB84]
...
[WVT90]
Kyu-Young Whang, Brad T. Vander Zanden, Howard M. Taylor: A Linear-Time Probabilistic Counting Algorithm for Database Applications. ACM Trans. Database Syst. 15(2): 208-229(1990) BibTeX

Referenced by

  1. Christopher R. Palmer, Christos Faloutsos: Density Biased Sampling: An Improved Method for Data Mining and Clustering. SIGMOD Conference 2000: 82-92
  2. Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, Vivek R. Narasayya: Towards Estimation Error Guarantees for Distinct Values. PODS 2000: 268-279
  3. Surajit Chaudhuri, Rajeev Motwani: On Sampling and Relational Operators. IEEE Data Eng. Bull. 22(4): 41-46(1999)
  4. Ju-Hong Lee, Deok-Hwan Kim, Chin-Wan Chung: Multi-dimensional Selectivity Estimation Using Compressed Histogram Information. SIGMOD Conference 1999: 205-214
  5. Swarup Acharya, Viswanath Poosala, Sridhar Ramaswamy: Selectivity Estimation in Spatial Databases. SIGMOD Conference 1999: 13-24
  6. Swarup Acharya, Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy: Join Synopses for Approximate Query Answering. SIGMOD Conference 1999: 275-286
  7. Noga Alon, Phillip B. Gibbons, Yossi Matias, Mario Szegedy: Tracking Join and Self-Join Sizes in Limited Storage. PODS 1999: 10-20
  8. Hidetoshi Uchiyama, Kanda Runapongsa, Toby J. Teorey: A Progressive View Materialization Algorithm. DOLAP 1999: 36-41
  9. Joseph M. Hellerstein: Optimization Techniques for Queries with Expensive Methods. ACM Trans. Database Syst. 23(2): 113-157(1998)
  10. M. Seetha Lakshmi, Shaoyu Zhou: Selectivity Estimation in Extensible Databases - A Neural Network Approach. VLDB 1998: 623-627
  11. Johannes Gehrke, Raghu Ramakrishnan, Venkatesh Ganti: RainForest - A Framework for Fast Decision Tree Construction of Large Datasets. VLDB 1998: 416-427
  12. Phillip B. Gibbons, Yossi Matias: New Sampling-Based Summary Statistics for Improving Approximate Query Answers. SIGMOD Conference 1998: 331-342
  13. Surajit Chaudhuri, Rajeev Motwani, Vivek R. Narasayya: Random Sampling for Histogram Construction: How much is enough? SIGMOD Conference 1998: 436-447
  14. Surajit Chaudhuri: An Overview of Query Optimization in Relational Systems. PODS 1998: 34-43
  15. Takeshi Fukuda, Hirofumi Matsuzawa: Parallel Processing of Multiple Aggregate Queries on Shared-Nothing Multiprocessors. EDBT 1998: 278-292
  16. Daniel Barbará, William DuMouchel, Christos Faloutsos, Peter J. Haas, Joseph M. Hellerstein, Yannis E. Ioannidis, H. V. Jagadish, Theodore Johnson, Raymond T. Ng, Viswanath Poosala, Kenneth A. Ross, Kenneth C. Sevcik: The New Jersey Data Reduction Report. IEEE Data Eng. Bull. 20(4): 3-45(1997)
  17. Himanshu Gupta, Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman: Index Selection for OLAP. ICDE 1997: 208-219
  18. Amit Shukla, Prasad Deshpande, Jeffrey F. Naughton, Karthikeyan Ramasamy: Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies. VLDB 1996: 522-531
  19. Christos Faloutsos, Yossi Matias, Abraham Silberschatz: Modeling Skewed Distribution Using Multifractals and the `80-20' Law. VLDB 1996: 307-317
  20. Sameet Agarwal, Rakesh Agrawal, Prasad Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan, Sunita Sarawagi: On the Computation of Multidimensional Aggregates. VLDB 1996: 506-521
  21. Viswanath Poosala, Yannis E. Ioannidis, Peter J. Haas, Eugene J. Shekita: Improved Histograms for Selectivity Estimation of Range Predicates. SIGMOD Conference 1996: 294-305
  22. Joseph M. Hellerstein, Jeffrey F. Naughton: Query Execution Techniques for Caching Expensive Methods. SIGMOD Conference 1996: 423-434
  23. Venky Harinarayan, Anand Rajaraman, Jeffrey D. Ullman: Implementing Data Cubes Efficiently. SIGMOD Conference 1996: 205-216
  24. Sumit Ganguly, Phillip B. Gibbons, Yossi Matias, Abraham Silberschatz: Bifocal Sampling for Skew-Resistant Join Size Estimation. SIGMOD Conference 1996: 271-281
BibTeX
ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]
VLDB Proceedings: Copyright © by VLDB Endowment,
ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Sat May 16 23:46:05 2009