JUNE 18–22, 2017

Presentation Details

Name: Global Extensible Open Power Manager: A Vehicle for HPC Community Collaboration Toward Co-Designed Energy Management Solutions
Time: Tuesday, June 20, 2017
09:30 am - 10:00 am
Room:   Panorama 3
Messe Frankfurt
Breaks:10:00 am - 11:00 am Coffee Break
07:30 am - 10:00 am Welcome Coffee
Speaker:   Jonathan Eastep, Intel
Abstract:   Performance of future large-scale HPC systems will be constrained by power costs. Some HPC centers are already incentivized through government taxes to operate their systems at more energy-efficient points below peak performance and power. Other centers may prefer peak performance today, but they are facing cost-pressure of a different kind to deploy more efficient systems in the future. System power draw is increasing substantially, generation-over-generation; without a breakthrough in system energy efficiency, industry trends forewarn that large-scale systems will exceed the limits of the existent power delivery infrastructure at typical centers by a 2-3x margin by 2022, forcing costly upgrades or limited system scale. Overcoming this 2-3x gap will require co-designed hardware and software system energy management solutions as well as increased collaboration between hardware vendors and the HPC software community. In this work, we introduce the Global Extensible Open Power Manager: a tree-hierarchical, open source runtime framework that we are contributing to the HPC community to accelerate collaboration and research toward co-designed energy management solutions. Through its plug-in extensible architecture, GEOPM enables rapid prototyping of new energy management strategies. Different plug-ins can be tailored to the specific performance or energy efficiency priorities of each HPC center. To demonstrate the potential of the framework, this work develops an example plug-in. This power rebalancing plug-in targets power-capped systems and improves efficiency by minimizing job time-to-solution within a power budget. First results demonstrate up to 32% improvements in the runtime of CORAL system procurement benchmarks on a Xeon Phi cluster.