Pfimbi: Accelerating Big Data Jobs Through Flow-Controlled Data Replication

The performance of HDFS is critical to big data software stacks and has been at the forefront of recent efforts from the industry and the open source community. A key problem is the lack of flexibility in how data replication is performed. To address this problem, this paper presents Pfimbi, the first alternative to HDFS that supports both synchronous and flow- controlled asynchronous data replication. Pfimbi has numerous benefits: It accelerates jobs, exploits under-utilized storage I/O bandwidth, and supports hierarchical storage I/O bandwidth allocation policies. We demonstrate that for a job trace derived from a Facebook workload, Pfimbi improves the average job runtime by 18% and by up to 46% in the best case. We also demonstrate that flow control is crucial to fully exploiting the benefits of asynchronous replication; removing Pfimbi’s flow control mechanisms resulted in a 2.7x increase in job runtime.

Presented at:
32nd International Conference on Massive Storage Systems and Technology (MSST 2016), Santa Clara, May 2-6, 2016

 Record created 2016-05-25, last modified 2018-03-17

Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)