Improving availability with recursive microreboots: a soft-state system case study

Candea, G.; Cutler, J.; Fox, A.

doi:10.1016/j.peva.2003.07.007

Candea, G.; Cutler, J.; Fox, A.

2004

Formats

Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Abstract

Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover. All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail too, so systems must know how to intelligently retry their recovery. We present here a recursive approach, in which a minimal subset of components is recovered first; if that does not work, progressively larger subsets are recovered. Our domain of interest is Internet services; these systems experience primarily transient or intermittent failures that can typically be resolved by rebooting. Conceding that failure-free software continues eluding us for years to come, we undertake a systematic investigation of fine grain component-level restarts, microreboots, as high availability medicine. Building and maintaining an accurate model of large Internet systems is nearly impossible, due to their scale and constantly evolving nature, so we take an application-generic approach, that relies on empirical observations to manage recovery. We apply recursive microreboots to Mercury, a commercial off-the-shelf (COTS)-based satellite ground station that is based on an Internet service platform. Mercury has been in successful operation for over 3 years. From our experience with Mercury, we draw design guidelines and lessons for the application of recursive microreboots to other software systems. We also present a set of guidelines for building systems amenable to recursive reboots, known as "crash-only software systems"

Details

Title Improving availability with recursive microreboots: a soft-state system case study

Author(s) Candea, G. ; Cutler, J. ; Fox, A.

Published in Performance Evaluation

Volume 56

Issue 1-4

Pages 213-48

Date 2004

ISSN 0166-5316

Keywords

computer bootstrapping; fault tolerant computing; Internet; software performance evaluation; system recovery; software engineering; recursive microreboots; dependable systems; Internet service platform; recovery-oriented computing; failure-free software; COTS

DOI https://doi.org/10.1016/j.peva.2003.07.007

Laboratories DSLAB

Record Appears in Scientific production and competences > I&C - School of Computer and Communication Sciences > IINFCOM > DSLAB - Dependable Systems Laboratory
Peer-reviewed publications
Work outside EPFL
Journal Articles
Published

Record creation date 2006-12-22