The authors show how to estimate aggregate results in data warehouses. The cool thing is that their techniques give error bounds. In their experiments, these bounds were conservative but still useful.
Basic problem: if you sample all tables and then do aggregates, that doesn't work in general. For example, if you have R and S and you join the key of R with a key of S, then taking a 1/10 sample of each will give a size that is 1/100 of the size of the real join. So, one must be more clever.
Their basic idea: Suppose you have tables R, S, T, ... linked by foreign key joins and there is a root R (i.e. R has a field which is a subset of the key of S, ...). Take a sample of R and then perform all these foreign key joins based on the sample giving R', S', T', ... Now, if a query involves R, S, T and includes the foreign key links among these, then the query can be done with great accuracy on R', S', T'. The error can be estimated by considering the result obtained by several partitions of R' and looking at their variance. Please see the paper for other technical details.
Copyright © 1999 by the author(s). Review published with permission.