An Adaptive Query Execution Engine for Data Integration

Zachary Ives*       Daniela Florescu       Marc Friedman
University of Washington       INRIA       University of Washington
zives@cs.washington.edu       Daniela.Florescu@inria.fr       friedman@cs.washington.edu

Alon Levy       Daniel S. Weld
University of Washington       University of Washington
alon@cs.washington.edu       weld@cs.washington.edu

Abstract

Traditional database query execution techniques are inadequate for data integration purposes for three reasons: the absence of quality statistics on the data (which resides in remote sources), the unpredictable and bursty data arrival rates from these sources, and the frequent overlap and redundancy of the data sources. This paper presents the fully-implemented Tukwila data integration system and describes three novel, adaptive-execution features that combine to yield improved performance. (1) The adaptive double-pipelined hash join reduces latency and transforms incrementally into a hybrid-hash join in the face of insufficient memory. (2) The dynamic collector operator computes robust and efficient unions over possibly overlapping data sources. (3) Interleaved planning and execution with partial optimization allows Tukwila to recover quickly from poor optimizer decisions based on inaccurate estimates. We demonstrate that the Tukwila architecture extends previous innovations in adaptive execution (such as query scrambling, mid-execution reoptimization, and choose nodes), and we present experimental evidence that our techniques result in significant speedup.