Digital Symposium Collection 2000  

 
 
 
 
 
 

 





















Multi-Dimensional Substring Selectivity Estimation

H. V. Jagadish, Olga Kapitskaia, Raymond T. Ng, and Divesh Srivastava

  View Paper (PDF)  

Return to Document Classification and Information Retrieval

Abstract
With the explosion of the Internet, LDAP directories and XML, there is an ever greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between dimensions. Effective query optimization in this context requires good selectivity estimates.

In this paper, we use multi-dimensional count-suffix trees as the basic framework for substring selectivity estimation. Given the enormous size of these trees for large databases, we develop a space and time efficient probabilistic algorithm to construct multi-dimensional pruned count-suffix trees directly. We then present two techniques to obtain good estimates for a given multi-dimensional substring matching query, using a pruned count-suffix tree. The first one, called GNO (for Greedy Non-Overlap), generalizes the greedy parsing suggested by Krishnan et al. [9] for one-dimensional substring selectivity estimation. The second one, called MO (for Maximal Overlap), uses all maximal multi-dimensional substrings of the query for estimation; these multi-dimensional substrings help to capture the correlation that may exists between strings in the multiple dimensions. We demonstrate experimentally, using real data sets, the MO is substantially superior to GNO in the quality of the estimate.


References

Note: References link to DBLP on the Web.

[1]
Raffaele Giancarlo : A Generalization of the Suffix Tree to Square Matrices, with Applications. SIAM J. Comput. 24(3) : 520-562(1995)
[2]
Raffaele Giancarlo , Roberto Grossi : On the Construction of Classes of Suffix Trees for Square Matrices: Algorithms and Applications. Information and Computation 130(2) : 151-182(1996)
[3]
Phillip B. Gibbons , Yossi Matias : New Sampling-Based Summary Statistics for Improving Approximate Query Answers. SIGMOD Conference 1998 : 331-342
[4]
...
[5]
Yannis E. Ioannidis : Universality of Serial Histograms. VLDB 1993 : 256-267
[6]
Yannis E. Ioannidis , Viswanath Poosala : Balancing Histogram Optimality and Practicality for Query Result Size Estimation. SIGMOD Conference 1995 : 233-244
[7]
H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Viswanath Poosala , Kenneth C. Sevcik , Torsten Suel : Optimal Histograms with Quality Guarantees. VLDB 1998 : 275-286
[8]
H. V. Jagadish , Raymond T. Ng , Divesh Srivastava : Substring Selectivity Estimation. PODS 1999 : 249-260
[9]
P. Krishnan , Jeffrey Scott Vitter , Balakrishna R. Iyer : Estimating Alphanumeric Selectivity in the Presence of Wildcards. SIGMOD Conf. 1996 : 282-293
[10]
Richard J. Lipton , Jeffrey F. Naughton : Query Size Estimation by Adaptive Sampling. PODS 1990 : 40-46
[11]
Edward M. McCreight : A Space-Economical Suffix Tree Construction Algorithm. JACM 23(2) : 262-272(1976)
[12]
M. Muralikrishna , David J. DeWitt : Equi-Depth Histograms For Estimating Selectivity Factors For Multi-Dimensional Queries. SIGMOD Conference 1988 : 28-36
[13]
Viswanath Poosala , Yannis E. Ioannidis , Peter J. Haas , Eugene J. Shekita : Improved Histograms for Selectivity Estimation of Range Predicates. SIGMOD Conf. 1996 : 294-305
[14]
Viswanath Poosala , Yannis E. Ioannidis : Selectivity Estimation Without the Attribute Value Independence Assumption. VLDB 1997 : 486-495
[15]
Patricia G. Selinger , Morton M. Astrahan , Donald D. Chamberlin , Raymond A. Lorie , Thomas G. Price : Access Path Selection in a Relational Database Management System. SIGMOD Conference 1979 : 23-34
[16]
...
[17]
Min Wang , Jeffrey Scott Vitter , Balakrishna R. Iyer : Selectivity Estimation in the Presence of Alphanumeric Correlations. ICDE 1997 : 169-180
[18]
Peter Weiner : Linear Pattern Matching Algorithms. FOCS 1973 : 1-11

BIBTEX

@inproceedings{DBLP:conf/vldb/JagadishKNS99,
  author    = {H. V. Jagadish and
                Olga Kapitskaia and
                Raymond T. Ng and
                Divesh Srivastava},
   editor    = {Malcolm P. Atkinson and
                Maria E. Orlowska and
                Patrick Valduriez and
                Stanley B. Zdonik and
                Michael L. Brodie},
   title     = {Multi-Dimensional Substring Selectivity Estimation},
   booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
                Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
                UK},
   publisher = {Morgan Kaufmann},
   year      = {1999},
   isbn      = {1-55860-615-5},
   pages     = {387-398},
   crossref  = {DBLP:conf/vldb/99},
   bibsource = {DBLP, http://dblp.uni-trier.de} } },


























Copyright(C) 2000 ACM