Review - Estimating Alphanumeric Selectivity in the Presence of Wildcards.

Divesh Srivastava: Review - Estimating Alphanumeric Selectivity in the Presence of Wildcards. ACM SIGMOD Digital Review 2: (2000) BibTeX

Review

Selectivity estimation (in one incarnation: the problem of quickly determining the approximate number of records in a table that satisfies a given predicate) plays an important role in cost-based query optimization.

When string data (white and yellow page directory entries, XML documents) are stored in DBMSs, as is becoming popular, queries typically involve the use of wildcard (attribute like "*substring_value*") predicates. For effective query optimization, one needs to accurately estimate the selectivity of wildcard predicates over string domains. However, almost all of the traditional work on selectivity estimation has focused on point (attribute = value) and range (attribute in [min_value, max_value]) predicates over ordered numeric domains. It is not obvious how these traditional techniques can be used to deal with wildcard predicates in the string domain.

The interesting observation made by Krishnan et al. in this paper is that a variant of the end-biased histogram, obtained by summarizing and pruning a suffix tree (a data structure for efficiently answering wildcard queries), can be used to great advantage for estimating the selectivity of wildcard predicates. The key insights are that:

the pruned suffix tree maintains counts of commonly occurring substrings in the database,
the (possibly long) wildcard query string can always be parsed into a set of smaller substrings, each of whose selectivities is present in the pruned suffix tree, and
the selectivity of the query string can be estimated by combining the (known) selectivities of these smaller substrings, using suitable statistical assumptions.

One may disagree (as I have done in subsequent papers on this subject) with the greedy parsing technique and the statistical assumptions used in the paper, but this paper deserves to be read for the key insights it provides on issues of string manipulation in databases.

References

[1]: P. Krishnan, Jeffrey Scott Vitter, Balakrishna R. Iyer: Estimating Alphanumeric Selectivity in the Presence of Wildcards. SIGMOD Conference 1996: 282-293 BibTeX

BibTeX

Digital Review - DBLP: [Home | Search: Author, Title | Conferences | Journals]