Résumé

As a unified data repository, data lake plays a vital role in enterprise data management and analysis. It composes the raw files into tables that are processed in-situ by various computation engines and applications. Therefore, the read performance of the tables is of great importance for analytical workloads in data lakes. In this paper, we improve the read performance from two dimensions: (1) storage-layout optimization that improves the I/O efficiency; (2) data caching that reduces the amount of I/Os. We observe that storage-layout optimization in existing work is limited by the physical row group boundary determined by data ingestion, while the existing caches in the software stack of data lakes are not dedicated to analytical queries on column stores. Therefore, we apply the inter-row-group layout optimization to overcome the former limitation and propose a columnar caching mechanism with a lazy replacing policy for analytical workloads. We also show initial evaluation results to support our design.

Détails