JUNE 18–22, 2017

Presentation Details

Name: (RP16) Optimizing Massive Data Access for Large Scale Population Genomics Analysis Using HDF5
Time: Tuesday, June 20, 2017
08:35 am - 09:45 am
Room:   Substanz 1+2  
Breaks:07:30 am - 10:00 am Welcome Coffee
Presenter:   Hui Yan, NSCCGZ
More and more DNA sequencing data are generated, which enables population scale modeling for both scientific and clinical purposes. The traditional plain organization and layout of these data volumes don’t fit well with large scale analysis. Genotype imputation needs to analyze the same genome region of all individuals, thus small partial data of a large amount of files will be read. Such kind of data access significantly increases the workload of parallel file system, causing performance bottleneck. To tackle this, HDF5 file format is employed as kind of container for these raw data files. Naturally one single HDF5 file for a human chromosome, inside the HDF5 file two layouts are proposed and tested. The first one is one-dimensional, data distributed as different individuals/samples. The second one is two-dimensional, data distributed along both fixed size genome regions and different individuals/samples. Our experiment shows that both layouts improve the performance significantly, 3.4x speedup is observed. And two-dimensional layout performs even better because the feasibility to locate a certain region. It is clear that our work solves the metadata congestion as well as improvement in data access performance.

Junrong Yang, South China University of Technology
Peihao Liu, National University of Defense Technology
Guixin Guo, National Supercomputer Center in Guangzhou
Hanquan Liang, National Supercomputer Center in Guangzhou
BingQiang Wang, National Supercomputer Center in Guangzhou
Shoubin Dong, South China University of Technology
Yutong Lu, National Supercomputer Center in Guangzhou

RP16_Yan.pdf (864 KB)