JUNE 18–22, 2017

Session Details

Name: BoF 26: Leveraging Machine Learning for Parallel Performance Analysis
Time: Wednesday, June 21, 2017
03:45 pm - 04:45 pm
Room:   Kontrast  
Breaks:04:45 pm - 05:15 pm Coffee Break
Speaker:   Hans-Christian Hoppe, Intel
  Jesús Labarta, BSC
  Bernd Mohr, JSC
  Felix Wolf, TU Darmstadt
Abstract:   Tools for thorough analysis and characterisation of HPC and high performance data analytics (HPDA) applications have been a traditional research area for many years, and several families of extremely capable tools have emerged, some of them able to scale to the largest HPC systems and to correlate a wide variety of metrics. These tools create vast amounts of data for real-world applications and systems, and sifting through this sea of data and driving the analysis process requires deep competency and dedication on the side of the tool user, or the assistance of an experienced performance engineer. Different approaches to automate the analysis process have been tried, and statistics and rule-based systems have improved the situation. The spectacular advances in machine learning and artificial intelligence, in particular in deep neural networks (DNNs) promise to bring significant benefits to parallel performance analysis and characterisation – examples are automatic detection of patterns in metrics and guiding the tool user through the analysis process. This BoF assembles a distinguished panel of speakers from the relevant performance tools efforts (including BSC, JSC and TU Dresden), and invites experts from the machine learning area. Short presentations will highlight the ways ML can be taken up in this field, and show success stories of using ML in the adjacent fields of system monitoring and resource management. A panel-style discussion led and minuted by the BoF organisers will follow after the presentations. The presentations and the minutes from the discussion with the audience will be made publicly available.

Targeted Audience
This BoF will be of prime interest to developers of performance analysis and characterisation tools for parallel HPC workloads, and for machine learning experts. These groups will profit from interacting in the technical discussion by identifying ways to combine ML and traditional performance analysis. A secondary group of interest are developers of HPC and HPDA applications, who can learn about and influence significant new workload analysis and characterisation features made possible by the combination of traditional tool approaches with machine learning.