Toward timely, predictable and cost-effective data analytics

Modern industrial, government, and academic organizations are collecting massive amounts of data at an unprecedented scale and pace. The ability to perform timely, predictable and cost-effective analytical processing of such large data sets in order to extract deep insights is now a key ingredient for success. Traditional database systems (DBMS) are, however, not the first choice for servicing these modern applications, despite 40 years of database research. This is due to the fact that modern applications exhibit different behavior from the one assumed by DBMS: a) timely data exploration as a new trend is characterized by ad-hoc queries and a short user interaction period, leaving little time for DBMS to do good performance tuning, b) accurate statistics representing relevant summary information about distributions of ever increasing data are frequently missing, resulting in suboptimal plan decisions and consequently poor and unpredictable query execution performance, and c) cloud service providers - a major winner in the data analytics game due to the low cost of (shared) storage - have shifted the control over data storage from DBMS to the cloud providers, making it harder for DBMS to optimize data access. This thesis demonstrates that database systems can still provide timely, predictable and cost-effective analytical processing, if they use an agile and adaptive approach. In particular, DBMS need to adapt at three levels (to workload, data and hardware characteristics) in order to stabilize and optimize performance and cost when faced with requirements posed by modern data analytics applications. Workload-driven data ingestion is introduced with NoDB as a means to enable efficient data exploration and reduce the data-to-insight time (i.e., the time to load the data and tune the system) by doing these steps lazily and incrementally as a side-effect of posed queries as opposed to mandatory first steps. Data-driven runtime access path decision making introduced with Smooth Scan alleviates suboptimal query execution, postponing the decision on access paths from query optimization, where statistics are heavily exploited, to query execution, where the system can obtain more details about data distributions. Smooth Scan uses access path morphing from one physical alternative to another to fit the observed data distributions, which removes the need for a priori access path decisions and substantially improves the predictability of DBMS. Hardware-driven query execution introduced with Skipper enables the usage of cold storage devices (CSD) as a cost-effective solution for storing the ever increasing customer data. Skipper uses an out-of-order CSD-driven query execution model based on multi-way joins coupled with efficient cache and I/O scheduling policies to hide the non-uniform access latencies of CSD. This thesis advocates runtime adaptivity as a key to dealing with raising uncertainty about workload characteristics that modern data analytics applications exhibit. Overall, the techniques introduced in this thesis through the three levels of adaptivity (workload, data and hardware-driven adaptivity) increase the usability of database systems and the user satisfaction in the case of big data exploration, making low-cost data analytics reality.

Related material