Parallel and Scalable Bioinformatics

Byma, Stuart Anthony

doi:10.5075/epfl-thesis-10141

doctoral thesis

Parallel and Scalable Bioinformatics

2020

The field of genomics is likely to become the largest producer of data as a consequence of the large-scale application of next-generation sequencing technology for biological research and personalized medical treatments. The raw sequence data produced by these methods is limited in usefulness and requires computational analysis to unlock its potential. Bioinformatics is a field that combines biology, genomics, and computer science to build algorithms and software to analyze biological data. Some of the current bioinformatics tools are having difficulty keeping up with the increasing rate of data production. For example, raw sequence preprocessing, which involves aligning subsequences to a reference genome, sorting, and other operations, can take many hours. Downstream processing applications also require computational innovation -- protein sequence similarity search, an important tool in protein function characterization and the study of evolution, can take weeks or months to build high-quality databases, even relatively small ones composed of just a few thousand genomes.

This thesis shows that these computational challenges can be effectively and efficiently solved by a combination of fine-grained parallelism and horizontal scaling on highly-parallel compute clusters and data centers. This is shown through three primary contributions.

First, the preprocessing of whole-genome sequencing reads is addressed with Persona. Persona is a high performance and scalable bioinformatics system that unifies data, tools, algorithms, and processes for alignment, sorting, duplicate marking, and other operations in a common framework that scales linearly. For example, Persona can align 220 million short reads in ~17 seconds using a 32-node cluster. Second, a new technique for measuring and analyzing heap usage is introduced, which can help bioinformatics and other programs make more efficient use of memory, leading to performance gains of up to 10%. Finally, to accelerate protein similarity search, a new clustering algorithm is introduced that exposes parallelism, which, when combined with dynamic load-balancing, allows for efficient and scalable execution, leading to speedups of over 1400x over existing methods.

Type

doctoral thesis

DOI

10.5075/epfl-thesis-10141

Author(s)

Byma, Stuart Anthony

Advisors

Larus, James Richard

Jury

Prof. Babak Falsafi (président) ; Prof. James Richard Larus (directeur de thèse) ; Prof. Edouard Bugnion, Prof. Ioannis Xenarios, Prof. Christophe Dessimoz (rapporteurs)

Date Issued

2020

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2020-05-08

Thesis number

10141

Total of pages

149

Subjects

bioinformatics

•

frameworks

•

big data

•

data center

•

scale-out

•

profiling

•

parallel algorithms

•

clustering

•

protein similarity search

EPFL units

Faculty

School

Doctoral School

Available on Infoscience

April 28, 2020

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/168424