Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Journal articles
  4. COCKTAIL: Multi-Core Co-Optimization Framework With Proactive Reliability Management
 
research article

COCKTAIL: Multi-Core Co-Optimization Framework With Proactive Reliability Management

Huang, Darong  
•
Pahlevan, Ali  
•
Zapater Sancho, Marina  
Show more
2022
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

High performance computing (HPC) servers aim to meet an increase in the number and complexity of tasks and, consequently, to address the energy efficiency challenge. In addition to energy efficiency, it is essential to manage lifetime limitations of power-hungry components of servers (e.g., cores and cache), hence avoiding server failure before its lifetime period. Traditional approaches focus on either using hybrid caches to reduce the leakage power of traditional static random-access memory (SRAM) cache, and thus increase the energy efficiency, or the trade-off between the lifetime and performance of multi-core processors. However, these approaches fall short in terms of flexibility and applicability for HPC tasks in terms of multi-parametric optimization including quality-of-service (QoS), lifetime reliability, and energy efficiency. As a result, in this paper we propose COCKTAIL, a holistic strategy framework to jointly optimize the energy efficiency of multi-core server processors and tasks performance in the HPC context, while guaranteeing the lifetime reliability. First, we analyze the best cache technology among traditional SRAM and resistive random access memory (RRAM), within the context of hybrid cache architectures, to improve the energy efficiency and manage cache endurance limits with respect to tasks requirements. Second, we introduce a novel efficient proactive queue optimization policy to reorder HPC tasks for execution considering their end time and possible reliability effects on the use of the hybrid caches. Third, we present a dynamic model predictive control (MPC)-based reliability management method to maximize task performance, by controlling the frequency, temperature, and target lifetime of the server processor. Our results demonstrate that, while consuming similar energy, COCKTAIL provides up to 60% QoS improvement when compared to latest state-of-the-art energy optimization and reliability management techniques in the HPC context. Moreover, our strategy guarantees a design lifetime longer than 5 years for the whole HPC system.

  • Files
  • Details
  • Metrics
Loading...
Thumbnail Image
Name

COCKTAIL_final_version_v3.pdf

Type

Postprint

Version

http://purl.org/coar/version/c_ab4af688f83e57aa

Access type

openaccess

License Condition

Copyright

Size

6.54 MB

Format

Adobe PDF

Checksum (MD5)

ec1316ed64953f33fbdf5fb7d16f1304

Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés