Adaptive Query Processing on Raw Data Files

Nowadays, business and scientific applications accumulate data at an increasing pace. This growth of information has already started to outgrow the capabilities of database management systems (DBMS). In a typical DBMS usage scenario, the user should define a schema, load the data and tune the system for an expected workload before submitting any queries. Copying data into a database is a significant investment in terms of time and resources, and in many cases unnecessary or even no longer feasible in practice due to the explosive data growth. Additionally, the way DBMS store and organize data during data loading defines how data will be accessed for a given workload and thus, the maximum performance. Selecting the underlying data layout (row-store or column-store) is a critical first tuning decision which cannot change. Nevertheless, today query analysis is not static; it evolves as queries change. Hence, static design decisions can be suboptimal. In this thesis, we advocate in situ query processing as the principal way to manage data in a database. We reconsider the data loading phase and redesign traditional query processing architectures to work efficiently over raw data files to address the heavy initialization cost that comes with data loading. We present adaptive data loading as an alternative to traditional full a priori data loading. We explore the potential of in situ query processing in the context of current DBMS architectures. We identify performance bottlenecks specific for in situ processing and we introduce an adaptive indexing mechanism (positional map) that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure and techniques for collecting statistics over raw data files. Moreover, we design a flexible query engine that is not built around a single storage layout but it can exploit different storage layouts and data execution strategies in a single engine. It decides during query processing, which design fits the input queries and properly adapts the underlying data storage. By applying code generation techniques, we dynamically generate access operators tailored for specific classes of queries. This thesis revises the traditional paradigm of loading, tuning and then querying by using in situ query processing as the principal way to minimize data-to-query time. We show that raw data files should not be considered ``outside'' the DBMS and full data loading should not be a requirement to exploit database technology. On the contrary, proper techniques specifically tailored to overcome limitations that come with accessing raw data files can eliminate the data loading overhead making, therefore, raw data files a first-class citizen, fully integrated with the query engine. The proposed roadmap can provide guidance on how to convert any traditional DBMS into an efficient in situ query engine.


Related material