Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences
 
research article

Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences

Gandarilla-Pérez, Carlos A.
•
Mergny, Pierre
•
Weigt, Martin
Show more
March 20, 2020
Physical Review E

Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g., direct coupling analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins and interblock couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if their quality are imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.

  • Details
  • Metrics
Type
research article
DOI
10.1103/PhysRevE.101.032413
ArXiv ID

1912.10956

Author(s)
Gandarilla-Pérez, Carlos A.
Mergny, Pierre
Weigt, Martin
Bitbol, Anne-Florence  
Date Issued

2020-03-20

Published in
Physical Review E
Volume

101

Issue

3

Article Number

032413

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
UPBITBOL  
Available on Infoscience
March 25, 2020
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/167662
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés