Extracting Schema from Semistructured Data
Svetlozar Nestorov (Stanford University)
Serge Abiteboul (INRIA)
Rajeev Motwani (Stanford University)
Semistructured data is characterized by the lack of any fixed and
rigid schema, although typically the data has some implicit
structure. While the lack of fixed schema makes extracting
semistructured data fairly easy and an attractive goal, presenting and
querying such data is greatly impaired. Thus, a critical problem is
the discovery of the structure implicit in semistructured data and,
subsequently, the recasting of the raw data in terms of this
structure. In this paper, we consider a very general form of
semistructured data based on labeled, directed graphs. We show that
such data can be typed using the greatest fixpoint semantics of
monadic datalog programs. We present an algorithm for approximate
typing of semistructured data. We establish that the general problem
of finding an optimal such typing is NP-hard, but present some
heuristics and techniques based on clustering that allow efficient and
near-optimal treatment of the problem. We also present some
preliminary experimental results.
The full version of this paper appears here.