Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse

H.V. Jagadish       Laks V.S. Lakshmanan*       Divesh Srivastava
AT&T Labs Research       Concordia University       AT&T Labs Research
jag@research.att.com       laks@cs.concordia.ca       divesh@research.att.com

Abstract

Physical layout of data is a crucial determinant of performance in a data warehouse. The optimal clustering of data on disk, for minimizing expected I/O, depends on the query workload. In practice, we often have a reasonable sense of the likelihood of different classes of queries, e.g., 40\% of the queries concern calls made from {\sl some} specific telephone number in {\sl some} month. In this paper, we address the problem of finding an optimal clustering of records of a fact table on disk, given an expected workload in the form of a probability distribution over query classes.

Attributes in a data warehouse fact table typically have hierarchies defined on them (by means of auxiliary dimension tables). The product of the dimensional hierarchy levels forms a lattice and leads to a natural notion of query classes. Optimal clustering is a combinatorially explosive problem with a huge search space (doubly exponential in number of hierarchy levels). We identify an important subclass of clustering strategies called {\em lattice paths}, and present a dynamic programming algorithm for finding the optimal lattice path clustering, in time linear in the lattice size.

We additionally propose a technique called {\em snaking}, which when applied to a lattice path, always reduces its cost. For a representative class of star schemas, we show that for every workload, there is a snaked lattice path which is globally optimal. Further, we prove that the clustering obtained by applying snaking to the optimal lattice path is never much worse than the globally optimal snaked lattice path clustering. We show: (i) there are workloads where the often proposed Hilbert curve clustering can be arbitrarily worse than the optimal snaked lattice path, but (ii) when the workload is unknown a priori, the worst case performance of the Hilbert curve is better than that of snaked lattice paths.

We complement our analyses and validate the practical utility of our techniques with experiments using TPC-D benchmark data.