Fast Queries Over Heterogeneous Data Through Engine Customization

Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of heterogeneous datasets to gain insights. The different data models and formats pose a significant challenge on performing analysis over a combination of diverse datasets. Serving all queries using a single, general-purpose query engine is slow. On the other hand, using a specialized engine for each heterogeneous dataset increases complexity: queries touching a combination of datasets require an integration layer over the different engines. This paper presents a system design that natively supports heterogeneous data formats and also minimizes query execution times. For multi-format support, the design uses an expressive query algebra which enables operations over various data models. For minimal execution times, it uses a code generation mechanism to mimic the system and storage most appropriate to answer a query fast. We validate our design by building Proteus, a query engine which natively supports queries over CSV, JSON, and relational binary data, and which specializes itself to each query, dataset, and workload via code generation. Proteus outperforms state-of-the-art open-source and commercial systems on both synthetic and real-world workloads without being tied to a single data model or format, all while exposing users to a single query interface.

Published in:
Proceedings of the VLDB Endowment, 9, 12, 972-983
Presented at:
42nd International Conference on Very Large Databases, New Delhi, India, September 5-9, 2016
New York, Assoc Computing Machinery

 Record created 2016-08-08, last modified 2018-01-28

External links:
Download fulltextURL
Download fulltextPublisher's version
Rate this document:

Rate this document:
(Not yet reviewed)