Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. Self-indexing Algorithms for Graph Representations of Genomic References
 
doctoral thesis

Self-indexing Algorithms for Graph Representations of Genomic References

Öztürk, Ünsal  
2024

High-throughput sequencing technologies continue to become more cost-effective and accurate and produce an immense volume of sequencing data. This data must be efficiently stored, transported, searched, browsed, and analysed to be useful in genomic pipelines, and the effectiveness of these pipelines depend on accessible and expressive data representations. Appropriate indexed representations are therefore necessary to facilitate access, extract, and utilise the information present in the data. Common indexed representations are based on linear reference sequences. Although linear references offer sufficient approximations for some applications, they do not offer a better or complete representation of the true underlying biological content. Applications such as de novo assembly and transcriptomic analysis are inherently graph-based problems. Furthermore, graph representations that include the diversity of each organism provide advantages in biological analysis and studies. In such cases, graphs called pangenomes are constructed as collections of genomes consolidated into a graph, where the genomic sequence of each individual is represented in the graph as walks. This approach is favoured in various genomic pipelines for its innate capacity to more accurately depict variants and alternative sequences. Despite their various advantages in representing underlying biological structures, indexing and random access to graph representations are computationally more difficult. Indexing methods on sequences, such as Burrows-Wheeler transform-based indices and minimiser tables, are well-studied for string matching problems, and offer efficient solutions on sequences. However, adapting these structures to suit the needs of graph-based workloads remains an open research problem. This thesis focuses on developing representations, data structures, and algorithms suited for indexing, compressing, and querying graph-based representations. The thesis first presents a state-of-the-art a block-based compressive FM-Index codec supporting the handling and querying of massive files, even on distributed systems. This is followed by the main contribution of the thesis: the development of a novel self-indexing algorithm for string labelled graphs of arbitrary topology, called the FM-Directory. The FM-Directory is demonstrated to answer queries empirically in linear time for typical sequence analysis problems, (polynomial time in the worst case) and can be implemented in compressed form. These developments are followed up by an implementation of the algorithms, called GIN-TONIC, and benchmarks, which exhibit excellent practical performance. This discussion is then supported by a rigorous mathematical analysis of the data structure and several possible extensions and interesting use cases that benefit from a full-graph index. The thesis also explores two other string problems. The first recasts search in Levenshtein spaces to geometric search through sequence embeddings to Euclidean spaces learned on random walks in edit spaces, and the quality of these embeddings is evaluated in tasks such as string retrieval and clustering. The second investigates compression and assembly in the context of genomics through a lightweight multiple sequence alignment algorithm optimised for sequences sharing a global common region.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

EPFL_TH10994.pdf

Type

Main Document

Version

Not Applicable (or Unknown)

Access type

openaccess

License Condition

N/A

Size

7.87 MB

Format

Adobe PDF

Checksum (MD5)

f2943cb746b82cd9c21aeafd0af8c6b7

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés