On Secure Cloud Computing for Genomic Data: From Storage to Analysis

Huang, Zhicong

doi:10.5075/epfl-thesis-8367

doctoral thesis

On Secure Cloud Computing for Genomic Data: From Storage to Analysis

2018

Although privacy is generally considered to be the right of an individual or group to control information about themselves, such a right has become challenging to protect in the digital era, this is exemplified by the case of cloud-based genomic computing. Despite the rapid progress in understanding, producing, and using genomic information, the practice of genomic data protection remains a fairly underdeveloped area. One of the indisputable reasons is that most nonexpert individuals do not realize the sensitive nature of their genomic data, unless it has been used against them. Many commercial organizations take advantage of their customers by taking control of personal genomic information, if customers want to benefit from services such as genetic analysis; even worse, these organizations often do not enforce proper protection, which could result in embarrassing data breaches. In this thesis, we investigate the potential threats of cloud- based genomic computing systems and propose various countermeasures by taking into account the functionality requirement.

We begin with the most basic system where only symmetric encryption is needed for the cloud storage of genomic data, and we propose a new solution that protects the data against brute-force attacks that threaten the security of password-based encryption in direct-to-consumer companies. The solution employs honey encryption, where plaintext messages need to be transformed to a different space with uniform distribution on elements. We present a novel distribution-transformation encoder. We provide formal security proof of our solution.

We analyze the scenario where efficient searching on encrypted data is necessary. We propose a system that provides fast retrieval on encrypted compressed data and that enables individuals to authorize access to fine-grained regions during data retrieval. Our solution addresses three critical dimensions in platforms that use large genomic data: encryption, compression, and efficient data retrieval. Compared with a previous de facto standard solution for storing aligned genomic data, our solution uses 18% less storage.

To enable complicated data analysis, we focus on a proposal for secure quality-control of genomic data by using secure multi-party computation based on garbled circuits. Our proposal is for aggregated genomic data sharing, where researchers want to collaborate to perform large-scale genome-wide association studies in order to identify significant genetic variants for certain diseases. Data quality control is the very first stage of such a collaboration and remains a driving factor for further steps. We investigate the feasibility of advanced cryptographic techniques in the data protection of this phase. We demonstrate that for certain protocols, our solution is efficient and scalable.

With the advent of precision medicine based on genomic data, the future of big data has become clearly inseparable from cloud-based genomic computing. It is important to continuously re-evaluate the standards of cloud-based genomic computing as novel technologies are developed, security threats arise, and more complex genomic analyses become possible. This is not only a battle against cyber criminals, but also against rigid and ignorant practices. Integrative solutions that carefully consider the use and misuse of personal genomic data are essential for ensuring secure, effective storage and maximizing utility in treating and preventing disease.

Name

EPFL_TH8367.pdf

Access type

openaccess

Size

7.38 MB

Format

Adobe PDF

Checksum (MD5)

34ea283f8fba76fd93b51ebd37d992fc