Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Parallel and Scalable Bioinformatics
 
doctoral thesis

Parallel and Scalable Bioinformatics

Byma, Stuart Anthony  
2020

The field of genomics is likely to become the largest producer of data as a consequence of the large-scale application of next-generation sequencing technology for biological research and personalized medical treatments. The raw sequence data produced by these methods is limited in usefulness and requires computational analysis to unlock its potential. Bioinformatics is a field that combines biology, genomics, and computer science to build algorithms and software to analyze biological data. Some of the current bioinformatics tools are having difficulty keeping up with the increasing rate of data production. For example, raw sequence preprocessing, which involves aligning subsequences to a reference genome, sorting, and other operations, can take many hours. Downstream processing applications also require computational innovation -- protein sequence similarity search, an important tool in protein function characterization and the study of evolution, can take weeks or months to build high-quality databases, even relatively small ones composed of just a few thousand genomes.

This thesis shows that these computational challenges can be effectively and efficiently solved by a combination of fine-grained parallelism and horizontal scaling on highly-parallel compute clusters and data centers. This is shown through three primary contributions.

First, the preprocessing of whole-genome sequencing reads is addressed with Persona. Persona is a high performance and scalable bioinformatics system that unifies data, tools, algorithms, and processes for alignment, sorting, duplicate marking, and other operations in a common framework that scales linearly. For example, Persona can align 220 million short reads in ~17 seconds using a 32-node cluster. Second, a new technique for measuring and analyzing heap usage is introduced, which can help bioinformatics and other programs make more efficient use of memory, leading to performance gains of up to 10%. Finally, to accelerate protein similarity search, a new clustering algorithm is introduced that exposes parallelism, which, when combined with dynamic load-balancing, allows for efficient and scalable execution, leading to speedups of over 1400x over existing methods.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-10141
Author(s)
Byma, Stuart Anthony  
Advisors
Larus, James Richard  
Jury

Prof. Babak Falsafi (président) ; Prof. James Richard Larus (directeur de thèse) ; Prof. Edouard Bugnion, Prof. Ioannis Xenarios, Prof. Christophe Dessimoz (rapporteurs)

Date Issued

2020

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2020-05-08

Thesis number

10141

Total of pages

149

Subjects

bioinformatics

•

frameworks

•

big data

•

data center

•

scale-out

•

profiling

•

parallel algorithms

•

clustering

•

protein similarity search

EPFL units
VLSC  
Faculty
IC  
School
IIF  
Doctoral School
EDIC  
Available on Infoscience
April 28, 2020
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/168424
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés