Parallel and Scalable Bioinformatics

Byma, Stuart Anthony

doi:10.5075/epfl-thesis-10141

doctoral thesis

Parallel and Scalable Bioinformatics

2020

The field of genomics is likely to become the largest producer of data as a consequence of the large-scale application of next-generation sequencing technology for biological research and personalized medical treatments. The raw sequence data produced by these methods is limited in usefulness and requires computational analysis to unlock its potential. Bioinformatics is a field that combines biology, genomics, and computer science to build algorithms and software to analyze biological data. Some of the current bioinformatics tools are having difficulty keeping up with the increasing rate of data production. For example, raw sequence preprocessing, which involves aligning subsequences to a reference genome, sorting, and other operations, can take many hours. Downstream processing applications also require computational innovation -- protein sequence similarity search, an important tool in protein function characterization and the study of evolution, can take weeks or months to build high-quality databases, even relatively small ones composed of just a few thousand genomes.

This thesis shows that these computational challenges can be effectively and efficiently solved by a combination of fine-grained parallelism and horizontal scaling on highly-parallel compute clusters and data centers. This is shown through three primary contributions.

First, the preprocessing of whole-genome sequencing reads is addressed with Persona. Persona is a high performance and scalable bioinformatics system that unifies data, tools, algorithms, and processes for alignment, sorting, duplicate marking, and other operations in a common framework that scales linearly. For example, Persona can align 220 million short reads in ~17 seconds using a 32-node cluster. Second, a new technique for measuring and analyzing heap usage is introduced, which can help bioinformatics and other programs make more efficient use of memory, leading to performance gains of up to 10%. Finally, to accelerate protein similarity search, a new clustering algorithm is introduced that exposes parallelism, which, when combined with dynamic load-balancing, allows for efficient and scalable execution, leading to speedups of over 1400x over existing methods.

Name

EPFL_TH10141.pdf

Type

N/a

Access type

openaccess

License Condition

Copyright

Size

5.01 MB

Format

Adobe PDF

Checksum (MD5)

3c62427dfef44441b18fcf7c8f1f851d