Space efficient random sampling techniques for computing order statistics online

Gurmeet Singh Manku*       Sridhar Rajagopalan       Bruce G. Lindsay
IBM Almaden       IBM Almaden       IBM Almaden
manku@almaden.ibm.com       sridhar@almaden.ibm.com       bruce@almaden.ibm.com

Abstract

In a previous paper in SIGMOD 98, we described a general framework for one pass approximate quantile finding algorithms that require very little main memory. In this paper, we address two issues that arise in the context of our earlier work.

First, all previously studied deterministic and randomized algorithms seem to require advance knowledge of the length of the input sequence. Some important database applications that employ quantiles cannot provide a reliable estimate of input size at the outset. For example, quantiles can constitute a new aggregation operator, in which case, the input might be an intermediate table for whose cardinality, at best, a crude estimate from the query optimizer is available. For dynamic database tables, approximate quantiles can constitute equi-depth histograms which are expected to be accurate irrespective of the current size of the table.

Second, if the desired quantile is an extreme value, (i.e. top 100000, or top 1% of a 10 Million item dataset), then, the amount of space required by currently known algorithms seems overly pessimistic. There appears to be no known work focussed on finding extreme values efficiently.

We present random sampling methods that address and resolve these issues. In the first instance, our sampling methods are tailored to work well with the deterministic algorithm that we presented in our earlier work. A novel aspect of this sampling scheme is that the sampling rate is non-uniform. In the second case, we show that computing extreme values using random sampling is far more precise than computing quantiles close to the median using the same amount of space.