Timely and cost-efficient data exploration through adaptive tuning

Olma, Matthaios Alexandros

doi:10.5075/epfl-thesis-9636

Olma, Matthaios Alexandros

2019

Download

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Modern applications accumulate data at an exponentially increasing rate and traditional database systems struggle to keep up. Decision support systems used in industry, rely heavily on data analysis, and require real-time responses irrespective of data size. To offer real-time support, traditional databases require long preprocessing steps, such as data loading and offline tuning. Loading transforms raw data into a format that reduces data access cost. Through tuning, database systems build access paths (e.g., indexes) to improve query performance by avoiding or reducing unnecessary data access. The decision on what access paths to build depends on the expected workload, thus, the database system assumes knowledge of future queries. However, decision support systems and data exploration applications have shifting requirements. As a consequence, an offline tuner with no a priori knowledge of the full workload is unable to decide on the optimal set of access paths. Furthermore, access path size increases along with input data, thus, building precise access paths over the entire dataset limits the scalability of databases systems. Apart from long database pre-processing, offering efficient data access despite increasing data volume becomes harder due to hardware architectural constraints such as memory size. To achieve low query latency, modern database systems store data in main memory. However, there is a physical limit on main memory size in a server. Thereby, applications must trade memory space for query efficiency. To provide high performance efficiency, irrespective of dataset growth and query workload, a database system needs to (i) shift the decision of tuning from off-line to query-time, (ii) enable the query engine to exploit application properties in choosing fast access paths, and (iii) reduce the size of access paths to limit storage cost. In this thesis, we present techniques for query processing that are adaptive to workload, application requirements, and available storage resources. Specifically, to address dynamic workloads, we turn access path creation into a continuous process which fully adapts to incoming queries. We assign all decisions on data access and access path materialization to the database optimizer at query time, and enable access path materialization to take place as a by-product of query execution, thereby, removing requirements for long offline tuning processing steps. Furthermore, we take advantage of application characteristics (precision requirements, resource availability) and we design a system which can adaptively trade precision and resources for performance. By combining precise and approximate access paths, the database system reduces query response time and minimizes resource utilization. Approximate access paths (e.g., sketches) require less space in comparison to their precise counterparts, and offer constant access time. By improving data processing performance while reducing storage requirements through (i) adaptive access path materialization and (ii) using approximate and space-efficient access paths when appropriate, our work minimizes data access cost and provides real-time responses for data exploration applications irrespective of data growth.