Assessing the Crash-Failure Assumption of Group Communication Protocols

Mena, Sergio; Basile, Claudio; Kalbarczyk, Zbigniew; Schiper, André; Iyer, Ravi K.

doi:10.1109/ISSRE.2005.9

Mena, Sergio; Basile, Claudio; Kalbarczyk, Zbigniew; Schiper, André; Iyer, Ravi K.

2005

Download

Formats

Format
BibTeX
MARC
MARCXML
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Designing and correctly implementing Group Communication Systems (GCSs) is notoriously difficult. Assuming that processes fail only by crashing provides a powerful means to simplify the theoretical development of these systems. When making this assumption, however, one should not forget that clean crash failures provide only a coarse approximation of the effects that errors can have in distributed systems. Ignoring such a discrepancy can lead to complex GCS-based applications that pay a large price in terms of performance overhead yet fail to deliver the promised level of dependability. This paper provides a thorough study of error effects in real systems by demonstrating a \emph{error-injection-driven design methodology}, where error injection is integrated in the core steps of the design process of a robust fault-tolerant system. The methodology is demonstrated for the \emph{Fortika} toolkit, a Java-based GCS. Error injection enables us to uncover subtle reliability bottlenecks both in the design of Fortika and in the implementation of Java. Based on the obtained insights, we enhance Fortika's design to reduce the identified bottlenecks. Finally, a comparison of the results obtained for Fortika with the results obtained for the OCAML-based Ensemble system in a previous work, allows us to investigate the reliability implications that the choice of the development platform (Java versus OCAML) can have.

Details

Title Assessing the Crash-Failure Assumption of Group Communication Protocols

Author(s) Mena, Sergio ; Basile, Claudio ; Kalbarczyk, Zbigniew ; Schiper, André ; Iyer, Ravi K.

Published in Proceedings of the 16th IEEE International Symposium on Software Reliability Engineering

Conference 16th IEEE International Symposium on Software Reliability Engineering, Chicago, USA, November 8-11, 2005

Date 2005

Keywords

Fault injection; Atomic broadcast; Crash-stop model; Group communication; Fault tolerance

DOI https://doi.org/10.1109/ISSRE.2005.9

Other identifier(s) View record in Web of Science

Additional link URL

Laboratories LSR

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IC Archives > LSR - Distributed Systems Laboratory
Peer-reviewed publications
Conference Papers
Work produced at EPFL
Published

Record creation date 2005-11-25

Files

Abstract

Details

Actions