Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods
 
research article

Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods

Hegarty, Bridget
•
Riddell, James
•
Bastien, Eric
Show more
February 20, 2024
Msystems

Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called "rulesets." Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, Padj >= 0.05]. Each contained VirSorter2, and five used our "tuning removal" rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%-46%) than in cellular metagenomes (7%-19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets.

  • Details
  • Metrics
Type
research article
DOI
10.1128/msystems.01105-23
Web of Science ID

WOS:001167259900002

Author(s)
Hegarty, Bridget
Riddell, James
Bastien, Eric
Langenfeld, Kathryn
Lindback, Morgan
Saini, Jaspreet Singh  
Wing, Anthony
Zhang, Jessica
Duhaime, Melissa
Date Issued

2024-02-20

Publisher

Amer Soc Microbiology

Published in
Msystems
Subjects

Life Sciences & Biomedicine

•

Bacteriophages

•

Viral Discovery

•

Microbial Ecology

•

Metagenomics

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
LBE  
FunderGrant Number

National Science Foundation (NSF)

Blue Sky Initiative of the University of Michigan College of Engineering

2055455

National Science Foundation

DGE-134012

Show more
Available on Infoscience
March 18, 2024
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/206480
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés