Digital Symposium Collection 2000  

 
 
 
 
 
 

 





















Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets

Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay

  View Paper (PDF)  

Return to Sampling - Approximate Answers

Abstract
In a recent paper [MRL98], we had described a general framework for single pass approximate quantile finnding algorithms. This framework included several known algorithms as special cases. We had identified a new algorithm, within the framework, which had a significantly smaller requirement for main memory than other known algorithms. In this paper, we address two issues left open in our earlier paper. First, all known and space efficient algorithms for approximate quantile finding require advance knowledge of the length of the input sequence. Many important database applications employing quantiles cannot provide this information. In this paper, we present a novel non-uniform random sampling scheme and an extension of our framework. Together, they form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length. Second, if the desired quantile is an extreme value (e.g., within the top 1% of the elements), the space requirements of currently known algorithms are overly pessimistic. We provide a simple algorithm which estimates extreme values using less space than required by the earlier more general technique for computing all quantiles. Our principal observation here is that random sampling is quantifiably better when estimating extreme values than is the case with the median.


References

Note: References link to DBLP on the Web.

[ARS97]
Khaled Alsabti , Sanjay Ranka , Vineet Singh : A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data. VLDB 1997 : 346-355
[AS95]
Rakesh Agrawal , Arun N. Swami : A One-Pass Space-Efficient Algorithm for Finding Quantiles. COMAD 1995 : 0-
[BFP+73]
Manuel Blum , Robert W. Floyd , Vaughan R. Pratt , Ronald L. Rivest , Robert Endre Tarjan : Time Bounds for Selection. JCSS 7(4) : 448-461(1973)
[CMN98]
Surajit Chaudhuri , Rajeev Motwani , Vivek R. Narasayya : Random Sampling for Histogram Construction: How much is enough? SIGMOD Conference 1998 : 436-447
[CT91]
...
[DB2]
...
[DNS91]
David J. DeWitt , Jeffrey F. Naughton , Donovan A. Schneider : Parallel Sorting on a Shared-Nothing Architecture using Probabilistic Splitting. PDIS 1991 : 280-291
[GM98]
Phillip B. Gibbons , Yossi Matias : New Sampling-Based Summary Statistics for Improving Approximate Query Answers. SIGMOD Conference 1998 : 331-342
[GM99]
...
[GMP97]
Phillip B. Gibbons , Yossi Matias , Viswanath Poosala : Fast Incremental Maintenance of Approximate Histograms. VLDB 1997 : 466-475
[Hel97]
Joseph M. Hellerstein : Online Processing Redux. Data Engineering Bulletin 20(3) : 20-29(1997)
[Hoe63]
...
[Inf]
...
[MP80]
J. Ian Munro , Michael S. Paterson : Selection and Sorting with Limited Storage. TCS 12 : 315-323(1980)
[MRL98]
Gurmeet Singh Manku , Sridhar Rajagopalan , Bruce G. Lindsay : Approximate Medians and other Quantiles in One Pass and with Limited Memory. SIGMOD Conference 1998 : 426-435
[Pat97]
Michael S. Paterson : Progress in Selection. SWAT 1996 : 368-379
[PIHS96]
Viswanath Poosala , Yannis E. Ioannidis , Peter J. Haas , Eugene J. Shekita : Improved Histograms for Selectivity Estimation of Range Predicates. SIGMOD Conf. 1996 : 294-305
[Poh69]
...
[SALP79]
Patricia G. Selinger , Morton M. Astrahan , Donald D. Chamberlin , Raymond A. Lorie , Thomas G. Price : Access Path Selection in a Relational Database Management System. SIGMOD Conference 1979 : 23-34
[Vit85]
Jeffrey Scott Vitter : Random Sampling with a Reservoir. TOMS 11(1) : 37-57(1985)
[Yao74]
...

Referenced by

  1. Peter J. Haas : Techniques for Online Exploration of Large Object-Relational Datasets. SSDBM 1999 : 4-12

BIBTEX

@inproceedings{DBLP:conf/sigmod/MankuRL99,
  author    = {Gurmeet Singh Manku and
                Sridhar Rajagopalan and
                Bruce G. Lindsay},
   editor    = {Alex Delis and
                Christos Faloutsos and
                Shahram Ghandeharizadeh},
   title     = {Random Sampling Techniques for Space Efficient Online Computation
                of Order Statistics of Large Datasets},
   booktitle = {SIGMOD 1999, Proceedings ACM SIGMOD International Conference
                on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania,
                USA},
   publisher = {ACM Press},
   year      = {1999},
   isbn      = {1-58113-084-8},
   pages     = {251-262},
   crossref  = {DBLP:conf/sigmod/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },


























Copyright(C) 2000 ACM