Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. No data left behind: real-time insights from a complex data ecosystem
 
conference paper

No data left behind: real-time insights from a complex data ecosystem

Karpathiotakis, Manos
•
Floratou, Avrilia
•
Ozcan, Fatima
Show more
September 24, 2017
Proceedings of the 2017 Symposium on Cloud Computing
Symposium on Cloud Computing, 2017

The typical enterprise data architecture consists of several actively updated data sources (e.g., NoSQL systems, data warehouses), and a central data lake such as HDFS, in which all the data is periodically loaded through ETL processes. To simplify query processing, state-of-the-art data analysis approaches solely operate on top of the local, historical data in the data lake, and ignore the fresh tail end of data that resides in the original remote sources. However, as many business operations depend on real-time analytics, this approach is no longer viable. The alternative is hand-crafting the analysis task to explicitly consider the characteristics of the various data sources and identify optimization opportunities, rendering the overall analysis non-declarative and convoluted. Based on our experiences operating in data lake environments, we design System-PV, a real-time analytics system that masks the complexity of dealing with multiple data sources while offering minimal response times. System-PV extends Spark with a sophisticated data virtualization module that supports multiple applications - from SQL queries to machine learning. The module features a location-aware compiler that considers source complexity, and a two-phase optimizer that produces and refines the query plans, not only for SQL queries but for all other types of analysis as well. The experiments show that System-PV is often faster than Spark by more than an order of magnitude. In addition, the experiments show that the approach of accessing both the historical and the remote fresh data is viable, as it performs comparably to solely operating on top of the local, historical data.

  • Details
  • Metrics
Type
conference paper
DOI
10.1145/3127479.3131208
Author(s)
Karpathiotakis, Manos
Floratou, Avrilia
Ozcan, Fatima
Ailamaki, Anastasia  
Date Issued

2017-09-24

Published in
Proceedings of the 2017 Symposium on Cloud Computing
ISBN of the book

978-1-4503-5028-0

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
DIAS  
Event nameEvent placeEvent date
Symposium on Cloud Computing, 2017

Santa Clara, California

September 25-27, 2017

Available on Infoscience
September 18, 2018
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/148330
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés