Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Conferences, Workshops, Symposiums, and Seminars
  4. Reducing recovery time in a small recursively restartable system
 
conference paper

Reducing recovery time in a small recursively restartable system

Candea, G.  
•
Cutler, J.
•
Fox, A.
Show more
2002
Proceedings International Conference on Dependable Systems and Networks

We present ideas on how to structure software systems for high availability by considering MTTR/MTTF characteristics of components in addition to the traditional criteria, such as functionality or state sharing. Recursive restartability (RR), a recently proposed technique for achieving high availability, exploits partial restarts at various levels within complex software infrastructures to recover from transient failures and rejuvenate software components. Here we refine the original proposal and apply the RR philosophy to Mercury, a COTS-based satellite ground station that has been in operation for over 2 years. We develop three techniques for transforming component group boundaries such that time-to-recover is reduced, hence increasing system availability. We also further RR by defining the notions of an oracle, restart group and restart policy, while showing how to reason about system properties in terms of restart groups. From our experience with applying RR to Mercury, we draw design guidelines and lessons for the systematic application of recursive restartability to other software systems amenable to RR

  • Details
  • Metrics
Type
conference paper
DOI
10.1109/DSN.2002.1029006
Author(s)
Candea, G.  
Cutler, J.
Fox, A.
Doshi, R.
Garg, P.
Gowda, R.
Date Issued

2002

Published in
Proceedings International Conference on Dependable Systems and Networks
Start page

605

End page

14

Subjects

aerospace computing

•

ground support systems

•

software reliability

•

system recovery

•

software systems

•

MTTR/MTTF characteristics

•

functionality

•

state sharing

•

small recursively restartable system

•

recovery time reduction

•

high availability

•

partial restarts

•

complex software infrastructures

•

transient failure recovery

•

software component rejuvenation

•

Mercury

•

COTS-based satellite ground station

•

component group boundaries

•

oracle

•

restart group

•

restart policy

Editorial or Peer reviewed

REVIEWED

Written at

OTHER

EPFL units
DSLAB  
Available on Infoscience
December 22, 2006
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/238632
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés