Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Parallel and Scalable Bioinformatics
 
Loading...
Thumbnail Image
doctoral thesis

Parallel and Scalable Bioinformatics

Byma, Stuart Anthony  
2020

The field of genomics is likely to become the largest producer of data as a consequence of the large-scale application of next-generation sequencing technology for biological research and personalized medical treatments. The raw sequence data produced by these methods is limited in usefulness and requires computational analysis to unlock its potential. Bioinformatics is a field that combines biology, genomics, and computer science to build algorithms and software to analyze biological data. Some of the current bioinformatics tools are having difficulty keeping up with the increasing rate of data production. For example, raw sequence preprocessing, which involves aligning subsequences to a reference genome, sorting, and other operations, can take many hours. Downstream processing applications also require computational innovation -- protein sequence similarity search, an important tool in protein function characterization and the study of evolution, can take weeks or months to build high-quality databases, even relatively small ones composed of just a few thousand genomes.

This thesis shows that these computational challenges can be effectively and efficiently solved by a combination of fine-grained parallelism and horizontal scaling on highly-parallel compute clusters and data centers. This is shown through three primary contributions.

First, the preprocessing of whole-genome sequencing reads is addressed with Persona. Persona is a high performance and scalable bioinformatics system that unifies data, tools, algorithms, and processes for alignment, sorting, duplicate marking, and other operations in a common framework that scales linearly. For example, Persona can align 220 million short reads in ~17 seconds using a 32-node cluster. Second, a new technique for measuring and analyzing heap usage is introduced, which can help bioinformatics and other programs make more efficient use of memory, leading to performance gains of up to 10%. Finally, to accelerate protein similarity search, a new clustering algorithm is introduced that exposes parallelism, which, when combined with dynamic load-balancing, allows for efficient and scalable execution, leading to speedups of over 1400x over existing methods.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH10141.pdf

Type

N/a

Access type

openaccess

License Condition

Copyright

Size

5.01 MB

Format

Adobe PDF

Checksum (MD5)

3c62427dfef44441b18fcf7c8f1f851d

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés