![]() |
![]() |
Selectivity estimation (in one incarnation: the problem of quickly determining the approximate number of records in a table that satisfies a given predicate) plays an important role in cost-based query optimization.
When string data (white and yellow page directory entries, XML documents) are stored in DBMSs, as is becoming popular, queries typically involve the use of wildcard (attribute like "*substring_value*") predicates. For effective query optimization, one needs to accurately estimate the selectivity of wildcard predicates over string domains. However, almost all of the traditional work on selectivity estimation has focused on point (attribute = value) and range (attribute in [min_value, max_value]) predicates over ordered numeric domains. It is not obvious how these traditional techniques can be used to deal with wildcard predicates in the string domain.
The interesting observation made by Krishnan et al. in this paper is that a variant of the end-biased histogram, obtained by summarizing and pruning a suffix tree (a data structure for efficiently answering wildcard queries), can be used to great advantage for estimating the selectivity of wildcard predicates. The key insights are that:
One may disagree (as I have done in subsequent papers on this subject) with the greedy parsing technique and the statistical assumptions used in the paper, but this paper deserves to be read for the key insights it provides on issues of string manipulation in databases.
Copyright © 2000 by the author(s). Review published with permission.