JUNE 18–22, 2017

Presentation Details

Name: A Superfacility Model for Data Intensive Science
Time: Wednesday, June 21, 2017
11:00 am - 11:20 am
Room:   Panorama 2
Messe Frankfurt
Breaks:10:00 am - 11:00 am Coffee Break
Speaker:   Kathy Yelick, LBNL & UC Berkeley
In the same way that the Internet has combined with web content and search engines to revolutionize every aspect of our lives, the scientific process is poised to undergo a radical transformation based on the ability to access, analyze, and merge complex data sets. Scientists will be able to combine their own data with that of other scientists, validating models, interpreting experiments, re-using and re-analyzing data, and making use of sophisticated mathematical analyses and simulations to drive the discovery of relationships across data sets.  This “scientific web” will yield higher quality science, more insights per experiment, a higher impact from major investments in scientific instruments, and an increased democratization of science—allowing people from a wide variety of backgrounds to participate in the science process.  

Scientists have always demanded some of the fastest computers for computer simulations, and while this has not abated, there is a new driver for computer performance with the need to analyze large experimental and observational data sets. The exponential growth rates in detectors, sequencers and other observational technology, data sets across many science disciplines are outstripping the storage, computing, and algorithmic techniques available to individual scientists. The first step in realizing this is to consider the model used for scientific user facilities, including experimental facilities, wide area networks, computing and data facilities. To maximize scientific productivity and the efficiency of the infrastructure, these facilities should be viewed as a single tightly integrated “superfacility” where data streams between locations and experiments can be integrated with high-speed analytics and simulation. 

Equally important to this model is the need for advanced research in computer science, applied mathematics, and statistics to deal with increasingly sophisticated scientific questions and the complexity of the data. In this talk I will describe some examples of how science disciplines such as biology, material science and cosmology are changing in the face of their own data explosions, and how this will lead to a set of research questions due to the scale of the data sets, the data rates, inherent noise and complexity, and the need to “fuse” disparate data sets. What is really needed for data-driven science workloads in terms of hardware, systems software, networks and programming environments and how well can those be supported on systems that also run simulation codes? How will the imminent hardware disruptions affect the ability to perform data analysis computations and what types of algorithms will be required?