JUNE 18–22, 2017

Session Details

Name: BoF 16: Scaling Up/Out Deep Learning on HPC Clusters
Time: Wednesday, June 21, 2017
08:30 am - 09:30 am
Room:   Kontrast  
Breaks:08:00 am - 09:00 am Welcome Coffee
Speaker:   David N. Lombard, Intel
  Jun Nakajima, Intel
  Matthieu Ospici, Atos
  Karl W. Schulz, Intel
Abstract:   Deep learning techniques are increasingly used in various areas as they can equal or even surpass human-level performance for object recognition or classification problems. To reach such performance, the underlying neural network architecture must contain many layers (very deep network) and the model must be trained with a huge dataset. Consequently, the training is highly compute and I/O intensive. Furthermore, the development workflow requires many iterations to empirically evaluate the best neural network architecture. At each cycle, model training is performed, which can be time consuming (e.g. days, weeks). The use of HPC clusters equipped with accelerators (such GPU, FPGA) and low latency network is logically considered for running this kind of application, in particular during the development phase to training and improve development productivity. And scaling up/out deep learning involves different techniques for parallel processing, namely, data parallelism and model parallelism, requiring iterative synchronization across the cluster nodes. This BoF aims at tackling the usage of HPC clusters for training deep learning models with this agenda: - a brief introduction on deep learning science - an example of a deep learning application with TensorFlow - an overview of the motivations for using HPC technologies and challenges - technologies to scale up/out training of deep learning on HPC clusters - the different ways to implement a “Deep learning as a service” stack on HPC - open discussions

Targeted Audience
Anybody interested in large-scale deep learning, especially technical problems (e.g. various bottlenecks), solutions, and advantages of using HPC clusters. The audience would also learn popular machine learning frameworks such as TensorFlow.