ACM SIGMOD Anthology TODS dblp.uni-trier.de

Duplicate Record Elimination in Large Data Files.

Dina Bitton, David J. DeWitt: Duplicate Record Elimination in Large Data Files. ACM Trans. Database Syst. 8(2): 255-265(1983)
@article{DBLP:journals/tods/BittonD83,
  author    = {Dina Bitton and
               David J. DeWitt},
  title     = {Duplicate Record Elimination in Large Data Files},
  journal   = {ACM Trans. Database Syst.},
  volume    = {8},
  number    = {2},
  year      = {1983},
  pages     = {255-265},
  ee        = {http://doi.acm.org/10.1145/319983.319987, db/journals/tods/BittonD83.html},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}
BibTeX

Abstract

The issue of duplicate elimination for large data files in which many occurrences of the same record may appear is addressed. A comprehensive cost analysis of the duplicate elimination operation is presented. This analysis is based on a combinatorial model developed for estimating the size of intermediate runs produced by a modified merge-sort procedure. The performance of this modified merge-sort procedure is demonstrated to be significantly superior to the standard duplicate elimination technique of sorting followed by a sequential pass to locate duplicate records. The results can also be used to provide critical input to a query optimizer in a relational database system.

Copyright © 1983 by the ACM, Inc., used by permission. Permission to make digital or hard copies is granted provided that copies are not made or distributed for profit or direct commercial advantage, and that copies show this notice on the first page or initial screen of a display along with the full citation.


Joint ACM SIGMOD / IEEE Computer Society Anthology

CDROM Version: Load the CDROM "Volume 3 Issue 1, TODS 1976-1990" and ... DVD Version: Load ACM SIGMOD Anthology DVD 2" and ... BibTeX

References

[1]
Morton M. Astrahan, Mike W. Blasgen, Donald D. Chamberlin, Kapali P. Eswaran, Jim Gray, Patricia P. Griffiths, W. Frank King III, Raymond A. Lorie, Paul R. McJones, James W. Mehl, Gianfranco R. Putzolu, Irving L. Traiger, Bradford W. Wade, Vera Watson: System R: Relational Approach to Database Management. ACM Trans. Database Syst. 1(2): 97-137(1976) BibTeX
[2]
Edward Babb: Implementing a Relational Database by Means of Specialized Hardware. ACM Trans. Database Syst. 4(1): 1-29(1979) BibTeX
[3]
Donald E. Knuth: The Art of Computer Programming, Volume III: Sorting and Searching. Addison-Wesley 1973, ISBN 0-201-03803-X
BibTeX
[4]
J. Ian Munro, Philip M. Spira: Sorting and Searching in Multisets. SIAM J. Comput. 5(1): 1-8(1976) BibTeX

Referenced by

  1. Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, Rajeev Motwani, Jeffrey D. Ullman: Computing Iceberg Queries Efficiently. VLDB 1998: 299-310
  2. Michael H. Böhlen, Richard T. Snodgrass, Michael D. Soo: Coalescing in Temporal Databases. VLDB 1996: 180-191
  3. David E. Simmen, Eugene J. Shekita, Timothy Malkemus: Fundamental Techniques for Order Optimization. SIGMOD Conference 1996: 57-67
  4. Goetz Graefe, Richard L. Cole: Fast Algorithms for Universal Quantification in Large Databases. ACM Trans. Database Syst. 20(2): 187-236(1995)
  5. Paolo Ciaccia, Dario Maio: Domains and Active Domains: What This Distinction Implies for the Estimation of Projection Sizes in Relational Databases. IEEE Trans. Knowl. Data Eng. 7(4): 641-655(1995)
  6. Mauricio A. Hernández, Salvatore J. Stolfo: The Merge/Purge Problem for Large Databases. SIGMOD Conference 1995: 127-138
  7. Goetz Graefe, Ann Linville, Leonard D. Shapiro: Sort versus Hash Revisited. IEEE Trans. Knowl. Data Eng. 6(6): 934-944(1994)
  8. Goetz Graefe: Volcano - An Extensible and Parallel Query Evaluation System. IEEE Trans. Knowl. Data Eng. 6(1): 120-135(1994)
  9. Goetz Graefe: Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2): 73-170(1993)
  10. Jan Van den Bussche, Dirk Van Gucht: A Hierarchy of Faithful Set Creation in Pure OODB's. ICDT 1992: 326-340
  11. Shu-Shang Wei, Yao-Nan Lien, Dik Lun Lee, Ten-Hwang Lai: Hot-Spot Based Compostion Algorithm. ICDE 1992: 48-55
  12. Mauro Negri, Giuseppe Pelagatti: Distributive Join: A New Algorithm for Joining Relations. ACM Trans. Database Syst. 16(4): 655-669(1991)
  13. Mahdi Abdelguerfi, Arun K. Sood: Computational Complexity of Sorting and Joining Relations with Duplicates. IEEE Trans. Knowl. Data Eng. 3(4): 496-503(1991)
  14. Kyu-Young Whang, Brad T. Vander Zanden, Howard M. Taylor: A Linear-Time Probabilistic Counting Algorithm for Database Applications. ACM Trans. Database Syst. 15(2): 208-229(1990)
  15. Laura M. Haas, Walter Chang, Guy M. Lohman, John McPherson, Paul F. Wilms, George Lapis, Bruce G. Lindsay, Hamid Pirahesh, Michael J. Carey, Eugene J. Shekita: Starburst Mid-Flight: As the Dust Clears. IEEE Trans. Knowl. Data Eng. 2(1): 143-160(1990)
  16. Lothar F. Mackert, Guy M. Lohman: Index Scans Using a Finite LRU Buffer: A Validated I/O Model. ACM Trans. Database Syst. 14(3): 401-424(1989)
  17. Rakesh Agrawal, Shaul Dar, H. V. Jagadish: Composition of Database Relations. ICDE 1989: 102-108
  18. Chaitanya K. Baru, Ophir Frieder: Implementing Relational Database Operations in a Cube-Connected Multicomputer System. ICDE 1987: 36-43
  19. Johann Christoph Freytag, Nathan Goodman: Translating Aggregate Queries into Iterative Programs. VLDB 1986: 138-146
  20. Kjell Bratbergsengen: Hashing Methods and Relational Algebra Operations. VLDB 1984: 323-333
  21. Dina Bitton, David J. DeWitt, Carolyn Turbyfill: Benchmarking Database Systems A Systematic Approach. VLDB 1983: 8-19
BibTeX
ACM SIGMOD Anthology - DBLP: [Home | Search: Author, Title | Conferences | Journals]
TODS, ACM SIGMOD Anthology: Copyright © by ACM (info@acm.org), Corrections: anthology@acm.org
DBLP: Copyright © by Michael Ley (ley@uni-trier.de), last change: Tue Jun 24 18:38:51 2008