Self-indexing Algorithms for Graph Representations of Genomic References
High-throughput sequencing technologies continue to become more cost-effective and accurate and produce an immense volume of sequencing data. This data must be efficiently stored, transported, searched, browsed, and analysed to be useful in genomic pipelines, and the effectiveness of these pipelines depend on accessible and expressive data representations. Appropriate indexed representations are therefore necessary to facilitate access, extract, and utilise the information present in the data. Common indexed representations are based on linear reference sequences. Although linear references offer sufficient approximations for some applications, they do not offer a better or complete representation of the true underlying biological content. Applications such as de novo assembly and transcriptomic analysis are inherently graph-based problems. Furthermore, graph representations that include the diversity of each organism provide advantages in biological analysis and studies. In such cases, graphs called pangenomes are constructed as collections of genomes consolidated into a graph, where the genomic sequence of each individual is represented in the graph as walks. This approach is favoured in various genomic pipelines for its innate capacity to more accurately depict variants and alternative sequences. Despite their various advantages in representing underlying biological structures, indexing and random access to graph representations are computationally more difficult. Indexing methods on sequences, such as Burrows-Wheeler transform-based indices and minimiser tables, are well-studied for string matching problems, and offer efficient solutions on sequences. However, adapting these structures to suit the needs of graph-based workloads remains an open research problem. This thesis focuses on developing representations, data structures, and algorithms suited for indexing, compressing, and querying graph-based representations. The thesis first presents a state-of-the-art a block-based compressive FM-Index codec supporting the handling and querying of massive files, even on distributed systems. This is followed by the main contribution of the thesis: the development of a novel self-indexing algorithm for string labelled graphs of arbitrary topology, called the FM-Directory. The FM-Directory is demonstrated to answer queries empirically in linear time for typical sequence analysis problems, (polynomial time in the worst case) and can be implemented in compressed form. These developments are followed up by an implementation of the algorithms, called GIN-TONIC, and benchmarks, which exhibit excellent practical performance. This discussion is then supported by a rigorous mathematical analysis of the data structure and several possible extensions and interesting use cases that benefit from a full-graph index. The thesis also explores two other string problems. The first recasts search in Levenshtein spaces to geometric search through sequence embeddings to Euclidean spaces learned on random walks in edit spaces, and the quality of these embeddings is evaluated in tasks such as string retrieval and clustering. The second investigates compression and assembly in the context of genomics through a lightweight multiple sequence alignment algorithm optimised for sequences sharing a global common region.
EPFL_TH10994.pdf
main document
openaccess
N/A
7.87 MB
Adobe PDF
f2943cb746b82cd9c21aeafd0af8c6b7