On Computing Breakpoint Distances for Genomes with Duplicate Genes

Shao, Mingfu; Moret, Bernard M. E.

doi:10.1089/cmb.2016.0149

2017

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

A fundamental problem in comparative genomics is to compute the distance between two genomes in terms of its higher level organization (given by genes or syntenic blocks). For two genomes without duplicate genes, we can easily define (and almost always efficiently compute) a variety of distance measures, but the problem is NP-hard under most models when genomes contain duplicate genes. To tackle duplicate genes, three formulations (exemplar, maximum matching, and any matching) have been proposed, all of which aim to build a matching between homologous genes so as to minimize some distance measure. Of the many distance measures, the breakpoint distance (the number of nonconserved adjacencies) was the first one to be studied and remains of significant interest because of its simplicity and model-free property. The three breakpoint distance problems corresponding to the three formulations have been widely studied. Although we provided last year a solution for the exemplar problem that runs very fast on full genomes, computing optimal solutions for the other two problems has remained challenging. In this article, we describe very fast, exact algorithms for these two problems. Our algorithms rely on a compact integer-linear program that we further simplify by developing an algorithm to remove variables, based on new results on the structure of adjacencies and matchings. Through extensive experiments using both simulations and biological data sets, we show that our algorithms run very fast (in seconds) on mammalian genomes and scale well beyond. We also apply these algorithms (as well as the classic orthology tool MSOAR) to create orthology assignment, then compare their quality in terms of both accuracy and coverage. We find that our algorithm for the any matching formulation significantly outperforms other methods in terms of accuracy while achieving nearly maximum coverage.

Details

Title On Computing Breakpoint Distances for Genomes with Duplicate Genes

Author(s) Shao, Mingfu ; Moret, Bernard M. E.

Published in Journal Of Computational Biology

Pagination 10

Volume 24

Issue 6

Pages 571-580

Date 2017

Publisher New Rochelle, Mary Ann Liebert, Inc

ISSN 1066-5277

Keywords

breakpoint distance; exemplar; gene family; ILP; intermediate; maximum matching; orthology assignment

DOI https://doi.org/10.1089/cmb.2016.0149

Other identifier(s) View record in Web of Science

Laboratories LCBB

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IC Archives > LCBB - Laboratory for Computational Biology and Bioinformatics
Peer-reviewed publications
Work produced at EPFL
Journal Articles
Published

Record creation date 2017-07-10