Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Generative power of a protein language model trained on multiple sequence alignments
 
research article

Generative power of a protein language model trained on multiple sequence alignments

Sgarbossa, Damiano  
•
Lupo, Umberto  
•
Bitbol, Anne-Florence  
February 3, 2023
Elife

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

  • Details
  • Metrics
Type
research article
DOI
10.7554/eLife.79854
Web of Science ID

WOS:000958883300001

Author(s)
Sgarbossa, Damiano  
Lupo, Umberto  
Bitbol, Anne-Florence  
Date Issued

2023-02-03

Publisher

eLIFE SCIENCES PUBL LTD

Published in
Elife
Volume

12

Article Number

e79854

Subjects

Biology

•

Life Sciences & Biomedicine - Other Topics

•

protein sequences

•

protein families

•

protein language models

•

deep learning

•

protein sequence generation

•

protein design

•

none

•

coevolutionary landscape

•

design

•

information

•

contacts

•

identification

•

capture

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
UPBITBOL  
Available on Infoscience
April 24, 2023
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/197081
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés