Files

Abstract

Advancements in technology enable integration of multiple devices on a single core, resulting in increased on chip power and temperature densities. Higher temperatures, in turn, present a significant challenge for reliability. In this work we propose a comprehensive framework for analyzing reliability of multi-core systems, considering permanent faults. We show that aggressive power management can have an impact on reliability due to temperature cycling. Our cycle-accurate simulation methodology shows fine-grained variations of device failure rates over short time scales, thus enabling workload analysis and scheduling to control the reliability impact. On the other hand, the statistical reliability simulator and optimizer give a view into the long time horizon reliability analysis—over system lifetime, and help us optimize a power management policy under reliability and performance constraints. We show that our optimization strategy can achieve large power savings while still meeting the reliability and performance constraints.

Details

PDF