Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Toward timely, predictable and cost-effective data analytics
 
doctoral thesis

Toward timely, predictable and cost-effective data analytics

Borovica-Gajic, Renata  
2016

Modern industrial, government, and academic organizations are collecting massive amounts of data at an unprecedented scale and pace. The ability to perform timely, predictable and cost-effective analytical processing of such large data sets in order to extract deep insights is now a key ingredient for success. Traditional database systems (DBMS) are, however, not the first choice for servicing these modern applications, despite 40 years of database research. This is due to the fact that modern applications exhibit different behavior from the one assumed by DBMS: a) timely data exploration as a new trend is characterized by ad-hoc queries and a short user interaction period, leaving little time for DBMS to do good performance tuning, b) accurate statistics representing relevant summary information about distributions of ever increasing data are frequently missing, resulting in suboptimal plan decisions and consequently poor and unpredictable query execution performance, and c) cloud service providers - a major winner in the data analytics game due to the low cost of (shared) storage - have shifted the control over data storage from DBMS to the cloud providers, making it harder for DBMS to optimize data access. This thesis demonstrates that database systems can still provide timely, predictable and cost-effective analytical processing, if they use an agile and adaptive approach. In particular, DBMS need to adapt at three levels (to workload, data and hardware characteristics) in order to stabilize and optimize performance and cost when faced with requirements posed by modern data analytics applications. Workload-driven data ingestion is introduced with NoDB as a means to enable efficient data exploration and reduce the data-to-insight time (i.e., the time to load the data and tune the system) by doing these steps lazily and incrementally as a side-effect of posed queries as opposed to mandatory first steps. Data-driven runtime access path decision making introduced with Smooth Scan alleviates suboptimal query execution, postponing the decision on access paths from query optimization, where statistics are heavily exploited, to query execution, where the system can obtain more details about data distributions. Smooth Scan uses access path morphing from one physical alternative to another to fit the observed data distributions, which removes the need for a priori access path decisions and substantially improves the predictability of DBMS. Hardware-driven query execution introduced with Skipper enables the usage of cold storage devices (CSD) as a cost-effective solution for storing the ever increasing customer data. Skipper uses an out-of-order CSD-driven query execution model based on multi-way joins coupled with efficient cache and I/O scheduling policies to hide the non-uniform access latencies of CSD. This thesis advocates runtime adaptivity as a key to dealing with raising uncertainty about workload characteristics that modern data analytics applications exhibit. Overall, the techniques introduced in this thesis through the three levels of adaptivity (workload, data and hardware-driven adaptivity) increase the usability of database systems and the user satisfaction in the case of big data exploration, making low-cost data analytics reality.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-7140
Author(s)
Borovica-Gajic, Renata  
Advisors
Ailamaki, Anastasia  
Jury

Prof. Christoph Koch (président) ; professeure Anastasia Ailamaki (directeur de thèse) ; Prof. Karl Aberer, Dr Surajit Chaudhuri, Dr Goetz Graefe (rapporteurs)

Date Issued

2016

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2016-11-03

Thesis number

7140

Total of pages

227

Subjects

Database management systems

•

data analytics

•

data exploration

•

raw data access

•

robust query execution

•

predictable query performance

•

adaptive query processing

•

hardware-software co-design

•

cold storage devices

•

cost-effective data analysis

EPFL units
DIAS  
Faculty
IC  
School
IIF  
Doctoral School
EDIC  
Available on Infoscience
October 25, 2016
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/130642
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés