|




















|
|
 |
|
 |
|
Snakes and Sandwiches: Optimal Clustering Strategies for a Data Warehouse
|
H. V. Jagadish,
Laks V. S. Lakshmanan, and
Divesh Srivastava
View Paper (PDF)
Return to Clustering
Physical layout of data is a crucial determinant of performance in a data warehouse. The optimal clustering of data on disk, for minimizing expected I/O, depends on the query workload. In practice, we often have a reasonable sense of the likelihood of different classes of queries, e.g., 40% of the queries concern calls made from some specific telephone number in some month. In this paper, we address the problem of finding an optimal clustering of records of a fact table on disk, given an expected workload in the form of a probability distribution over query classes. Attributes in a data warehouse fact table typically have hierarchies defined on them (by means of auxiliary dimension tables). The product of the dimensional hierarchy levels forms a lattice and leads to a natural notion of query classes. Optimal clustering in this context is a combinatorially explosive problem with a huge search space (doubly exponential in number of hierarchy levels). We identify an important subclass of clustering strategies called lattice paths, and present a dynamic programming algorithm for finding the optimal lattice path clustering, in time linear in the lattice size. We additionally propose a technique called snaking, which when applied to a lattice path, always reduces its cost. For a representative class of star schemas, we show that for every workload, there is a snaked lattice path which is globally optimal. Further, we prove that the clustering obtained by applying snaking to the optimal lattice path is never much worse than the globally optimal snaked lattice path clustering. We complement our analyses and validate the practical utility of our techniques with experiments using TPC-D benchmark data.
Note: References link to DBLP on the Web.
-
[1]
-
Elena Baralis
,
Stefano Paraboschi
,
Ernest Teniente
: Materialized Views Selection in a Multidimensional Database.
VLDB 1997
: 156-165
-
[2]
-
Prasad Deshpande
,
Karthikeyan Ramasamy
,
Amit Shukla
,
Jeffrey F. Naughton
: Caching Multidimensional Queries Using Chunks.
SIGMOD Conference 1998
: 259-270
-
[3]
-
Christos Faloutsos
: Multiattribute Hashing Using Gray Codes.
SIGMOD Conference 1986
: 227-238
-
[4]
-
Christos Faloutsos
: Gray Codes for Partial Match and Range Queries.
TSE 14(10)
: 1381-1393(1988)
-
[5]
-
...
-
[6]
-
Christos Faloutsos
,
Shari Roseman
: Fractals for Secondary Key Retrieval.
PODS 1989
: 247-252
-
[7]
-
Christos Faloutsos
,
H. V. Jagadish
,
Yannis Manolopoulos
: Analysis of the n-Dimensional Quadtree Decomposition for Arbitrary Hyperectangles.
TKDE 9(3)
: 373-383(1997)
-
[8]
-
Ashish Gupta
,
Venky Harinarayan
,
Dallan Quass
: Aggregate-Query Processing in Data Warehousing Environments.
VLDB 1995
: 358-369
-
[9]
-
Himanshu Gupta
,
Venky Harinarayan
,
Anand Rajaraman
,
Jeffrey D. Ullman
: Index Selection for OLAP.
ICDE 1997
: 208-219
-
[10]
-
Venky Harinarayan
,
Anand Rajaraman
,
Jeffrey D. Ullman
: Implementing Data Cubes Efficiently.
SIGMOD Conf. 1996
: 205-216
-
[11]
-
Joseph M. Hellerstein
,
Elias Koutsoupias
,
Christos H. Papadimitriou
: On the Analysis of Indexing Schemes.
PODS 1997
: 249-256
-
[12]
-
H. V. Jagadish
: Linear Clustering of Objects with Multiple Atributes.
SIGMOD Conference 1990
: 332-342
-
[13]
-
H. V. Jagadish
: Analysis of the Hilbert Curve for Representing Two-Dimensional Space.
IPL 62(1)
: 17-22(1997)
-
[14]
-
...
-
[15]
-
...
-
[16]
-
Patrick E. O'Neil
,
Dallan Quass
: Improved Query Performance with Variant Indexes.
SIGMOD Conference 1997
: 38-49
-
[17]
-
Jack A. Orenstein
,
T. H. Merrett
: A Class of Data Structures for Associative Searching.
PODS 1984
: 181-190
-
[18]
-
Vasilis Samoladas
,
Daniel P. Miranker
: A Lower Bound Theorem for Indexing Schemes and Its Application to Multidimensional Range Queries.
PODS 1998
: 44-51
-
[19]
-
Peter Scheuermann
,
Junho Shim
,
Radek Vingralek
: WATCHMAN : A Data Warehouse Intelligent Cache Manager.
VLDB 1996
: 51-62
-
[20]
-
Divesh Srivastava
,
Shaul Dar
,
H. V. Jagadish
,
Alon Y. Levy
: Answering Queries with Aggregation Using Views.
VLDB 1996
: 318-329
-
[21]
-
Sunita Sarawagi
,
Michael Stonebraker
: Efficient Organization of Large Multidimensional Arrays.
ICDE 1994
: 328-336
-
[22]
-
Yuzuru Tanaka
: Adaptive Segmentation Schemes for Large Relational Database Machines.
IWDM 1983
: 293-318
@inproceedings{DBLP:conf/sigmod/JagadishLS99,
author = {H. V. Jagadish and
Laks V. S. Lakshmanan and
Divesh Srivastava},
editor = {Alex Delis and
Christos Faloutsos and
Shahram Ghandeharizadeh},
title = {Snakes and Sandwiches: Optimal Clustering Strategies for a Data
Warehouse},
booktitle = {SIGMOD 1999, Proceedings ACM SIGMOD International Conference
on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania,
USA},
publisher = {ACM Press},
year = {1999},
isbn = {1-58113-084-8},
pages = {37-48},
crossref = {DBLP:conf/sigmod/99},
bibsource = {DBLP, http://dblp.uni-trier.de} } },
Copyright(C) 2000 ACM
|
|
|
|
|
|
|