Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Persona: A High-Performance Bioinformatics Framework
 
conference paper not in proceedings

Persona: A High-Performance Bioinformatics Framework

Byma, Stuart Anthony  
•
Whitlock, Sam David  
•
Flueratoru, Laura
Show more
2017
USENIX Annual Technical Conference 2017

Next-generation genome sequencing technology has reached a point at which it is becoming cost-effective to sequence all patients. Biobanks and researchers are faced with an oncoming deluge of genomic data, whose processing requires new and scalable bioinformatics architectures and systems. Processing raw genetic sequence data is computationally expensive and datasets are large. Current software systems can require many hours to process a single genome and generally run only on a single computer. Common file formats are monolithic and row-oriented, a barrier to distributed computation. To address these challenges, we built Persona, a cluster-scale, high-throughput bioinformatics framework. Persona currently supports paired-read alignment, sorting, and duplicate marking using well-known algorithms and techniques. Persona can significantly reduce end-to-end processing times for bioinformatics computations. A new Aggregate Genomic Data (AGD) format unifies sample data and analysis results, while enabling efficient distributed computation and I/O. In a case study on sequence alignment, Persona sustains 1.353 gigabases aligned per second with 101 base pair reads on a 32-node cluster and can align a full genome in ~16.7 seconds using the SNAP algorithm. Our results demonstrate that: (1) alignment computation with Persona scales linearly across servers with no measurable completion-time imbalance and negligible framework overheads; (2) on a single server, sorting with Persona and AGD is up to 2.3× faster than commonly used tools, while duplicate marking is 3× faster; (3) with AGD, a 7 node COTS network storage system can service up to 60 alignment compute nodes; (4) server cost dominates for a balanced system running Persona, while long-term data storage dwarfs the cost of computation.

  • Files
  • Details
  • Metrics
Type
conference paper not in proceedings
Author(s)
Byma, Stuart Anthony  
Whitlock, Sam David  
Flueratoru, Laura
Tseng, Ethan
Kozyrakis, Christos
Bugnion, Edouard  
Larus, James  
Date Issued

2017

Subjects

systems

•

bioinformatics

•

tensorflow

•

big data

URL

URL

https://www.usenix.org/conference/atc17/technical-sessions/presentation/byma
Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
VLSC  
DCSL  
Event nameEvent placeEvent date
USENIX Annual Technical Conference 2017

Santa Clara, California, USA

July 12-14, 2017

Available on Infoscience
July 6, 2017
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/138821
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés