Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. DiNoDB: Efficient Large-Scale Raw Data Analytics
 
conference paper

DiNoDB: Efficient Large-Scale Raw Data Analytics

Tian, Yongchao
•
Alagiannis, Ioannis  
•
Liarou, Erietta  
Show more
2014
Proceedings of the First International Workshop on Bringing the Value of "Big Data" to Users (Data4U 2014)
1st International Workshop on Bringing the Value of "Big Data" to Users (Data4U 2014)

Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data. In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data. Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.

  • Files
  • Details
  • Metrics
Type
conference paper
DOI
10.1145/2658840.2658841
Author(s)
Tian, Yongchao
Alagiannis, Ioannis  
Liarou, Erietta  
Ailamaki, Anastasia  
Michiardi, Pietro
Vukolic, Marko
Date Issued

2014

Published in
Proceedings of the First International Workshop on Bringing the Value of "Big Data" to Users (Data4U 2014)
Subjects

Distributed database

•

In situ query

•

positional map file

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
DIAS  
Event nameEvent placeEvent date
1st International Workshop on Bringing the Value of "Big Data" to Users (Data4U 2014)

Hangzhou, China

September 1, 2014

Available on Infoscience
August 19, 2015
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/117145
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés