Abstract

Fault injection is an often overlooked component of the software test cycle, yet it is critical for building robust systems. The main reasons for this neglect are ineffectual tools, an overwhelmingly large number of possible faults to inject, and extensive manual labor required to do such tests. We present AFEX, a system that automates fault injection for software systems, finds and ranks important faults faster and more accurately than random injection, and automatically characterizes the quality of the resulting fault sets. AFEX is parallelized, such that test time decreases linearly with the number of test nodes available. AFEX also includes four fault injectors that simulate faults in major layers in the system stack: hardware, network, libraries, and human operators. We show how AFEX uses metric-driven search algorithms to efficiently find top-ranked faults in real systems like MySQL cluster and rsync.

Details