Digital Symposium Collection 2000  

 
 
 
 
 
 

 




















Tracking Join and Self-Join Sizes in Limited Storage

Noga Alon, Phillip B. Gibbons, Yossi Matias, and Mario Szegedy

  View Paper (PDF)  

Return to Query Optimization

Abstract
Query optimizers rely on fast, high-quality estimates of result sizes in order to select between various join plans. Self-join sizes of relations provide bounds on the join size of any pairs of such relations. It also indicates the degree of skew in the data, and has been advocated for several estimation procedures. Exact computation of the self-join size requires storage proportional to, the number of distinct attribute values, which may be prohibitively large. In this paper, we study algorithms for tracking (approximate) self-join sizes in limited storage in the presence of insertions and deletions to the relations. Such algorithms detect changes in the degree of skew without an expensive recomputation from the base data. We show that an algorithm based on a tug-ofwar approach provides a more accurate estimation than one based on a sample-and-count approach which is in turn more accurate than a sampling-only approach. Next, we study algorithms for tracking (approximate) join sizes in limited storage; the goal is to maintain a small signature of each relation such that join sizes can be accurately estimated between any pairs of relations. We show that taking random samples for join signatures can lead to inaccurate estimation unless the sample size is quite large; moreover, by a lower bound we show, no other signature scheme can significantly improve upon sampling without further assumptions. These negative results are shown to hold even in the presence of sanity bounds. On the other hand, we present a join signature scheme based on tug-of-war signatures that probvides guarantees on join size estimation as a function of the self-join sizes of the joining relations; this scheme can significantly improve upon the sampling scheme.


References

Note: References link to DBLP on the Web.

[AGP99]
...
[AGPR99]
Swarup Acharya , Phillip B. Gibbons , Viswanath Poosala , Sridhar Ramaswamy : Join Synopses for Approximate Query Answering. SIGMOD Conference 1999 : 275-286
[AMS96]
Noga Alon , Yossi Matias , Mario Szegedy : The Space Complexity of Approximating the Frequency Moments. STOC 1996 : 20-29
[BDF+97]
Daniel Barbará , William DuMouchel , Christos Faloutsos , Peter J. Haas , Joseph M. Hellerstein , Yannis E. Ioannidis , H. V. Jagadish , Theodore Johnson , Raymond T. Ng , Viswanath Poosala , Kenneth A. Ross , Kenneth C. Sevcik : The New Jersey Data Reduction Report. Data Engineering Bulletin 20(4) : 3-45(1997)
[GGMS96]
Sumit Ganguly , Phillip B. Gibbons , Yossi Matias , Abraham Silberschatz : Bifocal Sampling for Skew-Resistant Join Size Estimation. SIGMOD Conf. 1996 : 271-281
[GM98a]
Phillip B. Gibbons , Yossi Matias : New Sampling-Based Summary Statistics for Improving Approximate Query Answers. SIGMOD Conference 1998 : 331-342
[GM98b]
...
[GMP97a]
...
[GMP97b]
Phillip B. Gibbons , Yossi Matias , Viswanath Poosala : Fast Incremental Maintenance of Approximate Histograms. VLDB 1997 : 466-475
[Goo89]
...
[HH99]
Peter J. Haas , Joseph M. Hellerstein : Ripple Joins for Online Aggregation. SIGMOD Conference 1999 : 287-298
[HHW97]
Joseph M. Hellerstein , Peter J. Haas , Helen Wang : Online Aggregation. SIGMOD Conference 1997 : 171-182
[HNSS93]
Peter J. Haas , Jeffrey F. Naughton , S. Seshadri , Arun N. Swami : Fixed-Precision Estimation of Join Selectivity. PODS 1993 : 190-201
[HNSS95]
Peter J. Haas , Jeffrey F. Naughton , S. Seshadri , Lynne Stokes : Sampling-Based Estimation of the Number of Distinct Values of an Attribute. VLDB 1995 : 311-322
[HÖT88]
Wen-Chi Hou , Gultekin Özsoyoglu , Baldeo K. Taneja : Statistical Estimators for Relational Algebra Expressions. PODS 1988 : 276-287
[IP95]
Yannis E. Ioannidis , Viswanath Poosala : Balancing Histogram Optimality and Practicality for Query Result Size Estimation. SIGMOD Conference 1995 : 233-244
[LN95]
Richard J. Lipton , Jeffrey F. Naughton : Query Size Estimation by Adaptive Sampling. JCSS 51(1) : 18-25(1995)
[LNS90]
Richard J. Lipton , Jeffrey F. Naughton , Donovan A. Schneider : Practical Selectivity Estimation through Adaptive Sampling. SIGMOD Conference 1990 : 1-11
[MRL98]
Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay : Approximate Medians and other Quantiles in One Pass and with Limited Memory. SIGMOD Conference 1998 : 426-435
[MS99]
...
[Poo97]
Viswanath Poosala : Histogram-Based Estimation Techniques in Database Systems. Ph.D. thesis, Univ. of Wisconsin-Madison 1997
[Vit85]
Jeffrey Scott Vitter : Random Sampling with a Reservoir. TOMS 11(1) : 37-57(1985)
[VL93]
Susan V. Vrbsky , Jane W.-S. Liu : APPROXIMATE - A Query Processor that Produces Monotonically Improving Approximate Answers. TKDE 5(6) : 1056-1068(1993)

BIBTEX

@inproceedings{DBLP:conf/pods/AlonGMS99,
  author    = {Noga Alon and
                Phillip B. Gibbons and
                Yossi Matias and
                Mario Szegedy},
   title     = {Tracking Join and Self-Join Sizes in Limited Storage},
   booktitle = {Proceedings of the Eighteenth ACM SIGACT-SIGMOD-SIGART Symposium
                on Principles of Database Systems, May 31 - June 2, 1999, Philadelphia,
                Pennsylvania},
   publisher = {ACM Press},
   year      = {1999},
   isbn      = {1-58113-062-7},
   pages     = {10-20},
   crossref  = {DBLP:conf/pods/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },


























Copyright(C) 2000 ACM