Near-optimal thermal monitoring framework for many-core systems on chip
Chip designers place on-chip thermal sensors to measure local temperatures, thus preventing thermal runaway situations in many-core processing architectures. However, the quality of the thermal reconstruction is directly dependent on the number of placed sensors, which should be minimized, while guaranteeing full detection of all the worst case temperature gradient. In this paper, we present an entire framework for the thermal management of complex many-core architectures, such that we can precisely recover the thermal distribution from a minimal number of sensors. The proposed sensor placement algo- rithm is guaranteed to reduce the impact of noisy measurements on the reconstructed thermal distribution. We achieve significant improvements compared to the state of the art, in terms of both computational complexity and reconstruction precision. For example, if we consider a 64 cores SoC with 64 noisy sensors (σ^2 = 4), we achieve an average reconstruction error of 1.5C, that is less than the half of what previous state-of-the-art methods achieve. We also study the practical limits of the proposed method and show that we do not need realistic workloads to learn the model and efficiently place the sensors. In fact, we show that the reconstruction error is not significantly increased if we randomly generate the power-traces of the components or if we have just a part of the correct workload.