Columnar Storage Optimization and Caching for Data Lakes

Jin, Guodong; Bian, Haoqiong; Chen, Yueguo; Du, Xiaoyong

Jin, Guodong; Bian, Haoqiong; Chen, Yueguo; Du, Xiaoyong

2022

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Résumé

As a unified data repository, data lake plays a vital role in enterprise data management and analysis. It composes the raw files into tables that are processed in-situ by various computation engines and applications. Therefore, the read performance of the tables is of great importance for analytical workloads in data lakes. In this paper, we improve the read performance from two dimensions: (1) storage-layout optimization that improves the I/O efficiency; (2) data caching that reduces the amount of I/Os. We observe that storage-layout optimization in existing work is limited by the physical row group boundary determined by data ingestion, while the existing caches in the software stack of data lakes are not dedicated to analytical queries on column stores. Therefore, we apply the inter-row-group layout optimization to overcome the former limitation and propose a columnar caching mechanism with a lazy replacing policy for analytical workloads. We also show initial evaluation results to support our design.

Détails

Titre Columnar Storage Optimization and Caching for Data Lakes

Auteur(s) Jin, Guodong ; Bian, Haoqiong ; Chen, Yueguo ; Du, Xiaoyong

Publié dans Proceedings of the 25th International Conference on Extending Database Technology (EDBT 2022)

Pagination 4

Pages 419–423

Présenté à 25th International Conference on Extending Database Technology (EDBT 2022), Edinburgh, UK, March 29 - April 1, 2022

Date 2022-03-29

Laboratoires DIAS

Le document apparaît dans Production scientifique et compétences > I&C - Faculté Informatique & Communications > IINFCOM > DIAS - Laboratoire de systèmes et applications de traitement de données massives
Publications validées par des pairs
Papiers de conférence
Travail produit à l'EPFL

Date de création de la notice 2022-12-09