Just-in-time Analytics Over Heterogeneous Data and Hardware

Karpathiotakis, Manolis

doi:10.5075/epfl-thesis-8077

Karpathiotakis, Manolis

2017

Télécharger

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Fichiers

Résumé

Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of datasets to gain insights. At the same time, data variety increases continuously across multiple axes. First, data comes in multiple formats, such as the binary tabular data of a DBMS, raw textual files, and domain-specific formats. Second, different datasets follow different data models, such as the relational and the hierarchical one. Data location also varies: Some datasets reside in a central "data lake", whereas others lie in remote data sources. In addition, users execute widely different analysis tasks over all these data types. Finally, the process of gathering and integrating diverse datasets introduces several inconsistencies and redundancies in the data, such as duplicate entries for the same real-world concept. In summary, heterogeneity significantly affects the way data analysis is performed. In this thesis, we aim for data virtualization: Abstracting data out of its original form and manipulating it regardless of the way it is stored or structured, without a performance penalty. To achieve data virtualization, we design and implement systems that i) mask heterogeneity through the use of heterogeneity-aware, high-level building blocks and ii) offer fast responses through on-demand adaptation techniques. Regarding the high-level building blocks, we use a query language and algebra to handle multiple collection types, such as relations and hierarchies, express transformations between these collection types, as well as express complex data cleaning tasks over them. In addition, we design a location-aware compiler and optimizer that masks away the complexity of accessing multiple remote data sources. Regarding on-demand adaptation, we present a design to produce a new system per query. The design uses customization mechanisms that trigger runtime code generation to mimic the system most appropriate to answer a query fast: Query operators are thus created based on the query workload and the underlying data models; the data access layer is created based on the underlying data formats. In addition, we exploit emerging hardware by customizing the system implementation based on the available heterogeneous processors â CPUs and GPGPUs. We thus pair each workload with its ideal processor type. The end result is a just-in-time database system that is specific to the query, data, workload, and hardware instance. This thesis redesigns the data management stack to natively cater for data heterogeneity and exploit hardware heterogeneity. Instead of centralizing all relevant datasets, converting them to a single representation, and loading them in a monolithic, static, suboptimal system, our design embraces heterogeneity. Overall, our design decouples the type of performed analysis from the original data layout; users can perform their analysis across data stores, data models, and data formats, but at the same time experience the performance offered by a custom system that has been built on demand to serve their specific use case.

Détails

Titre Just-in-time Analytics Over Heterogeneous Data and Hardware

Auteur(s) Karpathiotakis, Manolis

Directeur(s)

Ailamaki, Anastasia

Pagination 201

Date 2017

Editeur Lausanne, EPFL

Mots-clés (libres)

data management; data analytics; query processing; query compilation; heterogeneous data

Langue Anglais

DOI https://doi.org/10.5075/epfl-thesis-8077

Autres identifiant(s) urn: urn:nbn:ch:bel-epfl-thesis8077-4

Laboratoires DIAS

Le document apparaît dans Production scientifique et compétences > I&C - Faculté Informatique & Communications > IINFCOM > DIAS - Laboratoire de systèmes et applications de traitement de données massives
Production scientifique et compétences > Thèses EPFL
Travail produit à l'EPFL
Publié
Thèses

Date de création de la notice 2017-11-28

Actions

Aperçu

Sélectionner le fichier :