Scaling Out Bioinformatics in the Data Center

Whole Genome Sequencing is a process in the field of bioinformatics that transforms biological samples of DNA into an electronic dataset of genetic bases. The process consists of two sequential components. First, the laboratory process of sequencing transforms the biological DNA samples into a digital format by transcribing the sequence of bases that make up short snippets of DNA. Second, a genomic workflow uses software to transform this data into a representation that is useful for genomic analysis. Recent advancements in High-Throughput Sequencing technology enable the laboratory phase to produce data faster and at a lower cost than prior techniques were capable of. The applications and file formats used in workflows have not undergone commensurate technological advancement in order to accommodate this deluge of genomic data. This thesis introduces a redesign of genomic workflows, their component applications, and the underlying file formats in order to scale out workflows across a data center. The design builds upon on two design components. First, a unified file format supplants the myriad existing formats in order to accommodate scale-out multi-machine I/O. The file format imposes minimal feature requirements upon the storage system, thereby enabling its use in high- performance systems for processing and cost-effective cold storage systems for long-term storage. Second, a new cloud computing framework provides an API for composing workflows in an abstract logical description and delegating the execution of the logic to a common runtime. The framework's runtime executes the logical workflow description on scale-out hardware resources while abstracting the execution details. We combine the file format and framework to build a new set of workflows that scale out across data center resources. These scale-out workflows incorporate existing workflow applications (compartmentalized into libraries that the framework invokes) and new applications that leverage the features provided by the scale-out architecture. All workflows delegate work distribution and task concurrency to the framework's runtime and utilize a common set of subcomponents for auxiliary code (e.g., I/O with various storage systems, processing different file formats). These workflows are able to scale out across a data center to the point of saturating the throughput of one or more hardware resources.

Bugnion, Edouard
Lausanne, EPFL

 Record created 2019-07-24, last modified 2019-07-24

Download fulltext

Rate this document:

Rate this document:
(Not yet reviewed)