Fault-tolerant parallel applications with dynamic parallel schedules: a programmer's perspective

Dynamic parallel schedules (DPS) is a flow graph based framework for developing parallel applications on clusters of workstations. The DPS flow graph execution model enables automatic pipelined parallel execution of applications. DPS supports graceful degradation of parallel applications in case of node failures. The fault-tolerance mechanism relies on a set of backup threads stored in the volatile storage of alternate nodes that are kept up to date by both duplicating transmitted data objects and performing periodical checkpointing. The current state of a failed node can be reconstructed on its backup threads by re-executing the application since the last checkpoint. A valid execution order is automatically deduced from the flow graph. The addition of fault-tolerance to a DPS application requires only minor changes to the application's source code. The present contribution focuses on the development of fault-tolerant parallel applications with DPS from a programmer's perspective

Published in:
Dependable Systems: Software, Computing, Networks. Research Results of the DICS Program, 195-210

 Record created 2007-01-31, last modified 2018-03-17

Rate this document:

Rate this document:
(Not yet reviewed)