Storing Semistructured Data with STORED

Alin Deutsch       Mary Fernandez       Dan Suciu*
University of Pennsylvania       AT&T Labs - Research       AT&T Labs - Research
adeutsch@gradient.cis.upenn.edu       mff@research.att.com       suciu@research.att.com

Abstract

Systems for managing and querying semistructured-data sources often store the data either in proprietary object repositories or in a tagged text format. No existing systems use commercial RDBMs to store or manage semistructured data sources.

In this paper, we describe a technique for using relational database management systems to store and manage semistructured data. The core of our approach is a new and powerful mapping between the semistructured data model and the relational data model. This mapping exploits any patterns that exist in a given semistructured-data source to efficiently store the data in a relational format. The mapping is always lossless: parts of the semistructured data that do not fit into the relational schema are stored in an "overflow" store.

We present a new query language, called STORED, to specify this mapping. STORED has separate components for the relational and overflow mappings. For some applications, the mapping is discovered automatically by the system, based on a given data instance; for other applications, the mapping is defined by the application programmer. We have extended data mining techniques to generate storage mappings that are efficient for a given semistructured data instance and, possibly, a given query mix. Once the mapping is specified, the system rewrites queries and updates on the semistructured data model to the relational model (and the overflow). We describe an algorithm for rewriting queries and updates.

We are particularly interested in applying these techniques to XML data, which can be viewed as an instance of semistructured data. We show how the information in the document-type descriptor (DTD) of an XML document can be used to improve our techniques.