Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Multienzyme deep learning models improve peptide de novo sequencing by mass spectrometry proteomics
 
research article

Multienzyme deep learning models improve peptide de novo sequencing by mass spectrometry proteomics

Gueto-Tettay, Carlos
•
Tang, Di
•
Happonen, Lotta
Show more
January 1, 2023
Plos Computational Biology

Author summaryIn recent years, the application of deep learning represented a breakthrough in the mass spectrometry (MS) field by improving the assignment of the correct sequence of amino acids from observable MS spectra without prior knowledge, also known as de novo MS-based peptide sequencing. However, like other modern neural networks, models do not generalize well enough as they perform poorly on highly varied N- and C-termini peptide test sets. To mitigate this generalizability problem, we conducted a systematic investigation to uncover the requirements for building generalized models and boosting the performance on the MS-based de novo peptide sequencing task. Several experiments confirmed that the training set's peptide diversity directly impacts the resulting model's generalizability. Data showed that the best models were the multienzyme models (MEMs), i.e., models trained from a compendium of highly diverse peptides, such as the one generated from digesting a broad of species samples with a group of proteases. The applicability of these MEMs was later established by fully de novo sequencing 8 of the ten polypeptide chains of five commercial antibodies and extracting over 10000 proving peptides.

Generating and analyzing overlapping peptides through multienzymatic digestion is an efficient procedure for de novo protein using from bottom-up mass spectrometry (MS). Despite improved instrumentation and software, de novo MS data analysis remains challenging. In recent years, deep learning models have represented a performance breakthrough. Incorporating that technology into de novo protein sequencing workflows require machine-learning models capable of handling highly diverse MS data. In this study, we analyzed the requirements for assembling such generalizable deep learning models by systemcally varying the composition and size of the training set. We assessed the generated models' performances using two test sets composed of peptides originating from the multienzyme digestion of samples from various species. The peptide recall values on the test sets showed that the deep learning models generated from a collection of highly N- and C-termini diverse peptides generalized 76% more over the termini-restricted ones. Moreover, expanding the training set's size by adding peptides from the multienzymatic digestion with five proteases of several species samples led to a 2-3 fold generalizability gain. Furthermore, we tested the applicability of these multienzyme deep learning (MEM) models by fully de novo sequencing the heavy and light monomeric chains of five commercial antibodies (mAbs). MEMs extracted over 10000 matching and overlapped peptides across six different proteases mAb samples, achieving a 100% sequence coverage for 8 of the ten polypeptide chains. We foretell that the MEMs' proven improvements to de novo analysis will positively impact several applications, such as analyzing samples of high complexity, unknown nature, or the peptidomics field.

  • Details
  • Metrics
Type
research article
DOI
10.1371/journal.pcbi.1010457
Web of Science ID

WOS:000955726600001

Author(s)
Gueto-Tettay, Carlos
Tang, Di
Happonen, Lotta
Heusel, Moritz
Khakzad, Hamed  
Malmstrom, Johan
Malmstrom, Lars
Date Issued

2023-01-01

Publisher

PUBLIC LIBRARY SCIENCE

Published in
Plos Computational Biology
Volume

19

Issue

1

Article Number

e1010457

Subjects

Biochemical Research Methods

•

Mathematical & Computational Biology

•

Biochemistry & Molecular Biology

•

false discovery rates

•

low-energy

•

quantification

•

identification

•

proteases

•

trypsin

•

comet

•

hcd

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LPDI  
Available on Infoscience
April 24, 2023
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/197054
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés