Persona: A High-Performance Bioinformatics Framework

Byma, Stuart Anthony; Whitlock, Sam David; Flueratoru, Laura; Tseng, Ethan; Kozyrakis, Christos; Bugnion, Edouard; Larus, James

Byma, Stuart Anthony; Whitlock, Sam David; Flueratoru, Laura; Tseng, Ethan; Kozyrakis, Christos; Bugnion, Edouard; Larus, James

2017

Download

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Next-generation genome sequencing technology has reached a point at which it is becoming cost-effective to sequence all patients. Biobanks and researchers are faced with an oncoming deluge of genomic data, whose processing requires new and scalable bioinformatics architectures and systems. Processing raw genetic sequence data is computationally expensive and datasets are large. Current software systems can require many hours to process a single genome and generally run only on a single computer. Common file formats are monolithic and row-oriented, a barrier to distributed computation. To address these challenges, we built Persona, a cluster-scale, high-throughput bioinformatics framework. Persona currently supports paired-read alignment, sorting, and duplicate marking using well-known algorithms and techniques. Persona can significantly reduce end-to-end processing times for bioinformatics computations. A new Aggregate Genomic Data (AGD) format unifies sample data and analysis results, while enabling efficient distributed computation and I/O. In a case study on sequence alignment, Persona sustains 1.353 gigabases aligned per second with 101 base pair reads on a 32-node cluster and can align a full genome in ~16.7 seconds using the SNAP algorithm. Our results demonstrate that: (1) alignment computation with Persona scales linearly across servers with no measurable completion-time imbalance and negligible framework overheads; (2) on a single server, sorting with Persona and AGD is up to 2.3× faster than commonly used tools, while duplicate marking is 3× faster; (3) with AGD, a 7 node COTS network storage system can service up to 60 alignment compute nodes; (4) server cost dominates for a balanced system running Persona, while long-term data storage dwarfs the cost of computation.

Details

Title Persona: A High-Performance Bioinformatics Framework

Author(s) Byma, Stuart Anthony ; Whitlock, Sam David ; Flueratoru, Laura ; Tseng, Ethan ; Kozyrakis, Christos ; Bugnion, Edouard ; Larus, James

Conference USENIX Annual Technical Conference 2017, Santa Clara, California, USA, July 12-14, 2017

Date 2017

Keywords

systems; bioinformatics; tensorflow; big data

Additional link URL

Laboratories UPLARUS
DCSL

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > DCSL - Data Center Systems Laboratory
Scientific production and competences > I&C - School of Computer and Communication Sciences > IC Archives > UPLARUS - Prof. Larus Group
Peer-reviewed publications
Conference Papers
Work produced at EPFL
Published

Record creation date 2017-07-06

Actions

Preview

Select file: