Probabilistic base calling of Solexa sequencing data

Rougemont, Jacques; Amzallag, Arnaud; Iseli, Christian; Farinelli, Laurent; Xenarios, Ioannis; Naef, Felix

doi:10.1186/1471-2105-9-431

research article

Probabilistic base calling of Solexa sequencing data

Rougemont, Jacques

•

Amzallag, Arnaud

•

Iseli, Christian

2008

BMC bioinformatics

BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots.

Type

research article

DOI

10.1186/1471-2105-9-431

Web of Science ID

WOS:000260490200001

Author(s)

Rougemont, Jacques

Amzallag, Arnaud

Iseli, Christian

Farinelli, Laurent

Xenarios, Ioannis

Naef, Felix

Date Issued

2008

Publisher

BioMed Central

Published in

BMC bioinformatics

Volume

9

Start page

431

Subjects

Software

Editorial or Peer reviewed

REVIEWED

Written at

EPFL

EPFL units

UPNAE

Available on Infoscience

November 1, 2010

Use this identifier to reference this record

https://infoscience.epfl.ch/handle/20.500.14299/56539