Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Parallel and Scalable Precise Clustering
 
Loading...
Thumbnail Image
conference paper

Parallel and Scalable Precise Clustering

Byma, Stuart  
•
Dhasade, Akash
•
Altenhoff, Adrian
Show more
January 1, 2020
Pact '20: Proceedings Of The Acm International Conference On Parallel Architectures And Compilation Techniques
ACM International Conference on Parallel Architectures and Compilation Techniques (PACT)

This paper describes a new technique for parallelizing protein clustering, an important bioinformatics computation for the analysis of protein sequences. Protein clustering identifies groups of proteins that are similar because they share long sequences of similar amino acids. Given a collection of protein sequences, clustering can significantly reduce the computational effort required to identify all similar sequences by avoiding many negative comparisons. The challenge, however, is to build a clustering that misses as few similar sequences (or elements, more generally) as possible.

In this paper, we introduce precise clustering, a property that requires each pair of similar elements to appear together in at least one cluster. We show that transitivity in the data can be leveraged to merge clusters while maintaining a precise clustering, providing a basis for independently forming clusters. This allows us reformulate clustering as a bottom-up merge of independent clusters in a new algorithm called ClusterMerge. ClusterMerge exposes parallelism, enabling fast and scalable implementations.

We apply ClusterMerge to find similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full O(n(2)) comparison, with only half as many comparisons. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604 times on 768 cores (1400 times faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.

  • Details
  • Metrics
Type
conference paper
DOI
10.1145/3410463.3414646
Web of Science ID

WOS:000723645400021

Author(s)
Byma, Stuart  
•
Dhasade, Akash
•
Altenhoff, Adrian
•
Dessimoz, Christophe
•
Larus, James R.  
Date Issued

2020-01-01

Publisher

ASSOC COMPUTING MACHINERY

Publisher place

New York

Published in
Pact '20: Proceedings Of The Acm International Conference On Parallel Architectures And Compilation Techniques
ISBN of the book

978-1-4503-8075-1

Series title/Series vol.

International Conference on Parallel Architectures and Compilation Techniques

Start page

217

End page

228

Subjects

Computer Science, Hardware & Architecture

•

Computer Science, Software Engineering

•

Computer Science, Theory & Methods

•

Computer Science

•

bioinformatics

•

protein clustering

•

parallel algorithms

•

algorithm

•

protein

•

web

Peer reviewed

REVIEWED

Written at

EPFL

EPFL units
UPLARUS  
Event nameEvent placeEvent date
ACM International Conference on Parallel Architectures and Compilation Techniques (PACT)

ELECTR NETWORK

Oct 03-07, 2020

Available on Infoscience
December 18, 2021
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/183935
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés