Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. EPFL thesis
  4. On the relevance of quality score metadata in genomic sequence data for omics applications
 
doctoral thesis

On the relevance of quality score metadata in genomic sequence data for omics applications

Hernandez Lopez, Ana Angelica  
2019

High-throughput sequencing of DNA molecules has revolutionized biomedical research by enabling the quantitative analysis of the genome to study its function, structure and dynamics. It is driving sequencing-based experiments in life sciences as evidenced by the plethora of emergent omics applications powered by sequence data. However, the capacity to generate massive datasets of sequence data greatly outpaces our ability to analyze them, the notorious bottleneck in omics analyses. With the democratization of computational analyses, practical solutions to the storage, distribution and processing of sequence data will become a necessity for the progress of life science research. The intrinsic high entropy metadata, known as quality scores, is largely the cause of the substantial size of sequence data files. Despite several efforts to evidence marginal impact on downstream analyses following their lossy representation, no consensus on the limits of "safe" representation with losses exists. In this research work, we study the effect of lossy quality score representation on three applications: variant calling, gene expression and sequence alignment, to assess the relevance of this metadata for omics analyses. We confirmed negligible impact and discovered that it is possible to compute a threshold value for transparent quality score distortion in sequence alignment, allowing the identification of a "safe" representation for the quality score scale. These results align with current trends in sequencing platforms pushing for coarser resolutions to reduce the storage footprint of sequence data.

  • Files
  • Details
  • Metrics
Type
doctoral thesis
DOI
10.5075/epfl-thesis-9812
Author(s)
Hernandez Lopez, Ana Angelica  
Advisors
Mattavelli, Marco  
Jury

Prof. Andreas Peter Burg (président) ; Dr Marco Mattavelli (directeur de thèse) ; Dr Jean-Marc Vesin, Dr Paolo Ribeca, Prof. Françoise Prêteux (rapporteurs)

Date Issued

2019

Publisher

EPFL

Publisher place

Lausanne

Public defense year

2019-12-12

Thesis number

9812

Total of pages

173

Subjects

High-throughput sequencing

•

genomic sequence metadata

•

quality scores

•

variant calling

•

gene expression

•

sequence alignment

•

lossy compression of quality scores

•

omics

EPFL units
SCI-STI-MM  
Faculty
STI  
School
IEL  
Doctoral School
EDIC  
Available on Infoscience
December 9, 2019
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/163860
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés