ACM SIGMOD Anthology ACM SIGMOD dblp.uni-trier.de

NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents.

Brad Adelberg: NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured Data from Text Documents. SIGMOD Conference 1998: 283-294
@inproceedings{DBLP:conf/sigmod/Adelberg98,
  author    = {Brad Adelberg},
  editor    = {Laura M. Haas and
               Ashutosh Tiwary},
  title     = {NoDoSE - A Tool for Semi-Automatically Extracting Semi-Structured
               Data from Text Documents},
  booktitle = {SIGMOD 1998, Proceedings ACM SIGMOD International Conference
               on Management of Data, June 2-4, 1998, Seattle, Washington, USA},
  publisher = {ACM Press},
  year      = {1998},
  isbn      = {0-89791-995-5},
  pages     = {283-294},
  ee        = {http://doi.acm.org/10.1145/276304.276330, db/conf/sigmod/Adelberg98.html},
  crossref  = {DBLP:conf/sigmod/98},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX

Abstract

Often interesting structured or semistructured data is not in database systems but in HTML pages, text files, or on paper. The data in these formats is not usable by standard query processing engines and hence users need a way of extracting data from these sources into a DBMS or of writing wrappers around the sources. This paper describes NoDoSE, the Northwestern Document Structure Extractor, which is an interactive tool for semi-automatically determining the structure of such documents and then extracting their data. Using a GUI, the user hierarchically decomposes the file, outlining its interesting regions and then describing their semantics. This task is expedited by a mining component that attempts to infer the grammar of the file from the information the user has input so far. Once the format of a document has been determined, its data can be extracted into a number of useful forms. This paper describes both the NoDoSE architecture, which can be used as a test bed for structure mining algorithms in general, and the mining algorithms that have been developed by the author. The prototype, which is written in Java, is described and experiences parsing a variety of documents are reported.

Copyright © 1998 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.


ACM SIGMOD DiSC

CDROM Version: Load the CDROM "DiSC, Volume 1 Number 1" and ... Online Version (ACM WWW Account required): Full Text in PDF Format

ACM SIGMOD Anthology

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ... BibTeX

Printed Edition

Laura M. Haas, Ashutosh Tiwary (Eds.): SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. ACM Press 1998, ISBN 0-89791-995-5 BibTeX , SIGMOD Record 27(2), June 1998
Contents

Online Edition: ACM SIGMOD

[Abstract]
[Full Text (Postscript)]

References

[Abi97]
Serge Abiteboul: Querying Semi-Structured Data. ICDT 1997: 1-18 BibTeX
[Ade98]
...
[AK97a]
...
[AK97b]
Naveen Ashish, Craig A. Knoblock: Wrapper Generation for Semi-structured Internet Sources. SIGMOD Record 26(4): 8-15(1997) BibTeX
[CGMH+97]
Sudarshan S. Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom: The TSIMMIS Project: Integration of Heterogeneous Information Sources. IPSJ 1994: 7-18 BibTeX
[Gol90]
...
[HGMC+97]
...
[KGP88]
...
[KWD97]
...
[Liv90]
...

Referenced by

  1. Hasan Davulcu, Guizhen Yang, Michael Kifer, I. V. Ramakrishnan: Computational Aspects of Resilient Data Extraction from Semistructured Sources. PODS 2000: 136-144
  2. David Mattox, Leonard J. Seligman, Kenneth Smith: Rapper: A Wrapper Generator with Linguistic Knowledge. Workshop on Web Information and Data Management 1999: 6-11
  3. Arnaud Sahuguet, Fabien Azavant: Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F. VLDB 1999: 738-741
  4. David W. Embley, Y. S. Jiang, Yiu-Kai Ng: Record-Boundary Discovery in Web Documents. SIGMOD Conference 1999: 467-478
  5. Brad Adelberg, Matthew Denny: Nodose Version 2.0. SIGMOD Conference 1999: 559-561
  6. Stéphane Grumbach, Giansalvatore Mecca: In Search of the Lost Schema. ICDT 1999: 314-331
  7. Kerstin Schwarz, Ingo Schmitt, Can Türker, Michael Höding, Eyk Hildebrandt, Sören Balko, Stefan Conrad, Gunter Saake: Design Support for Database Federations. ER 1999: 445-459
  8. Wolfgang May, Rainer Himmeröder, Georg Lausen, Bertram Ludäscher: A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web. ER (Workshops) 1999: 307-320
  9. Michael Christoffel, Sebastian Pulkowski, Bethina Schmitt, Peter C. Lockemann: Electronic Market: The Roadmap for University Libraries and Members to Survive in the Information Jungle. SIGMOD Record 27(4): 68-73(1998)
  10. Tao Guan, Miao Liu, Lawrence V. Saxton: Structure-Based Queries over the World Wide Web. ER 1998: 107-120
  11. David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle, Yiu-Kai Ng, Dallan Quass, Randy D. Smith: A Conceptual-Modeling Approach to Extracting Data from the Web. ER 1998: 78-91
BibTeX
ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]
ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Sat May 16 23:40:43 2009