ACM SIGMOD Anthology ACM SIGMOD dblp.uni-trier.de

Tutorial: Text Dominated Databases, Theory Practice and Experience.

Gaston H. Gonnet: Tutorial: Text Dominated Databases, Theory Practice and Experience. PODS 1994: 301-302
@inproceedings{DBLP:conf/pods/Gonnet94,
  author    = {Gaston H. Gonnet},
  title     = {Tutorial: Text Dominated Databases, Theory Practice and Experience},
  booktitle = {Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium
               on Principles of Database Systems, May 24-26, 1994, Minneapolis,
               Minnesota},
  publisher = {ACM Press},
  year      = {1994},
  isbn      = {0-89791-642-5},
  pages     = {301-302},
  ee        = {http://doi.acm.org/10.1145/182591.182655, db/conf/pods/pods94-301.html},
  crossref  = {DBLP:conf/pods/94},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX

Abstract

The application of database technology is seen as essential to the operation of a conventional business enterprise. However, there is a universe of business information, namely text, which is currently stored, accessed, and manipulated in an add hoc fashion with none of theconsistency and discipline of the database approach. Environments supporting both text and relational data are implemented through application programs within which separate repositories are accessed explicitly. Not only is this inconvenient for applications programmers, but the disjointness of the data impedes efforts to ensure data consistency, query optimization and database transparency.

Since 1984, Waterloo's effort in text dominated databases has been focused primarily on the development of software for sophisticated searching and editing of massive and complex textual objects, for example the Oxford English Dictionary. Many examples and experiences are taken directly from this project. In our experience, the OED is an excellent example of a text database (570Mb of text) with significant and precise structure.

We will first explore the differences between text databases and traditional databases. These differences have to do with structure, format, and typical usage. For example, text databases can be highly structured, but not in the same way as an RDB. Naturally, text is organized in a hierarchical/nested structure with multiple levels (e.g. book - chapter - paragraph - sentence - citation - author - lastname). Structured text is better described by a grammar; a context free grammar is usually enough. Usage and querying are typically different too. Text databases are normally repositories for information, so they are not modified as often as standard databases. (c.f. a newspaper database vs. a banking database). They tend to grow with new material and this material stays in the database until it is removed completely, if ever.

Next we will describe the primitives used for text searching and compare them to the primitives used in RDB. While there is some correspondence, the sets of primitives are different. Is is not obvious how to map the standard text searching primitives onto standard DB primitives or vice-versa. We will show some examples of such mappings and/or extensions which try to solve this problem (e.g. SFQL, [ATA91])

The impact of SGML (Standard Generalized Markup Language) [Goldfarb90] in the structuring of text databases is very noticeable. By being an ISO standard, it has been adopted by several communities of users and its acceptance grows much faster than standard databases. We have investigated the role of SGML in data modelling, with particular attention on the role of embedded markup. We will show how we can achieve data modelling through text and data modelling text.

Finally, another important type of text databases has appeared on the scene, these are the computer-readable as well as human-readable databases. For the most part, these databases are self-descriptive and easily extendible. Many scientific communities have adopted this strategy to formidable success in producing/sharing/using large amounts of information. A case worth mentioning are the molecular biology databases. We will examine some of the reasons why these type of DB is becoming increasingly popular.

Further reading. A comprehensive description of basic text searching algorithms can be found in Handbook of Algorithms and Data Structures In Pascal and C, Second edition, G. Gonnet and R. Baeza-Yates, Addison-Wesley, 1991, Chapter 7]. The book [Information Retrieval: Algorithms and Data Structures, edited by Frakes, W. and Baeza-Yates, R., Prentice-Hall, 1992] contains a collection of contributed chapters, some of which are particularly relevant to this topic. The yearly conference quot;Combinatorial Pattern Matchingquot; (last proceedings published by Springer Verlag, CPM-93, Padova, Italy) contain a variety of papers covering the more theoretical fringe of this topic. It is uncertain whether the SFQL proposal will be successful, its main reference continues to be [ATA 89-9C SFQL Committee. Advanced Retrieval Standard-SFQL: Structured Full-text Query Language, October 1991, ATA Specification 100, Air Transport Association]. Several books have been written on SGML, since it is now a standard (and rather readable compared to other standards), the best citation is [Information Processing, Text and Office Systems, Standard Generalized Markup Language, SGML and SGML Support Facilities and SGML Document Interchange Format (SDIF), ISO 8879, 9069].

Copyright © 1994 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.


Load The ACM SIGMOD Anthology, CDROM Edition, Volume 1-3, PODS '82-'98. and ... Load The ACM SIGMOD Anthology, Silver Edition, DVD 1, Proceedings. and ... BibTeX

Printed Edition

Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 24-26, 1994, Minneapolis, Minnesota. ACM Press 1994, ISBN 0-89791-642-5
Contents BibTeX

Online Edition: ACM Digital Library

[Index Terms]
[Abstract in PDF Format, 125 KB]

Referenced by

  1. Paolo Atzeni, Giansalvatore Mecca: Cut & Paste. PODS 1997: 144-153
  2. Anthony J. Bonner, Giansalvatore Mecca: Querying String Databases with Transducers. DBPL 1997: 118-135
  3. Masatoshi Yoshikawa, Osamu Ichikawa, Shunsuke Uemura: Amalgamating SGML Documents and Databases. EDBT 1996: 259-274
  4. Giansalvatore Mecca, Anthony J. Bonner: Sequences, Datalog and Transducers. PODS 1995: 23-35
  5. Giansalvatore Mecca, Anthony J. Bonner: Finite Query Languages for Sequence Databases. DBPL 1995: 12
BibTeX
ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]
ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Sat May 16 23:34:11 2009