the hdf group january 8, 2016 2016 esip winter meeting data container study: hdf5 in a posix file...

The HDF Group

1 www.hdfgroup.orgJanuary 8, 20162016 ESIP Winter Meeting

Data Container Study:HDF5 in a POSIX File System

orHDF5 C3: Compression, Chunking, Clusters

Aleksandar Jelenak, John Readey, H. Joe Lee,Ted Habermann

www.hdfgroup.org2016 ESIP Winter Meeting 2

Hardware

• Using Open Science Data Cloud Griffin cluster• Xeon systems with 1-16 cores• 60 compute nodes• 10Gb Ethernet• Ephemeral local POSIX file system• Shared persistent storage (Ceph object store, S3

Software

• HDF5 library v1.8.15• Compression libraries: MAFISC/GZIP/BLOSC• Operating system: Ubuntu Linux• Linux development tools• Any HDF5-supported C compiler• HDF5 tools: h5dump, h5repack, etc.• Python 3• Python packages: h5py, NumPy, ipyparallel,

PyTables

• NCEP/DOE Reanalysis II, for GSSTF, Daily Grid, v3• 0.25×0.25 deg, global• Time span 1987-2008• 7,850 daily files, 120GB

• NOAA Coral Reef Temperature Anomaly Database (CoRTAD) version 5• 0.04165×0.04165 deg (~4km), global• Time range 1982-2012, weekly time step• 8 files, 253GB

WorkflowDownload data as HDF5

files from archive and transfer to S3 object store

Repack original file(s) using HDF5 chunking and compression,

transfer to S3 store

Collate data from original files into one file with HDF5 chunking

& compression, transfer to S3 store

Launch a number of VMs and connect them into a ipyparallel

cluster

Data Ingest/Preprocessing

Data Analysis

Distribute input HDF5 data from S3 store to cluster VMs

Execute data analysis task on cluster VMs

Collect data analysis results from cluster VMs and prepare the

report

Shut down the cluster and VMs

Index data in file(s) by collecting descriptive statistics (min, max,

etc.) for each HDF5 chunk.

System Architecture

HDF5 Chunks

• Chunking is one of storage layouts for HDF5 datasets

• HDF5 dataset’s byte stream is broken up in chunks and stored at various locations in the file

• Chunks are of equal size in dataset’s dataspace but may not be of equal byte size in the file

• HDF5 filtering works on chunks only• Filters for compression/decompression, scaling,

checksum calculation, etc.

Findings: Chunking

• Two different chunking algorithms:• Unidata’s optimal chunking formula for 3D datasets• h5py formula

• Three different chunk sizes chosen for the collated NCEP data set:• Synoptic map: 1×72×144• Data rod: 7850×1×1• Data cube: 25×20×20

Findings: Chunking

• Input was collated NCEP data file:• 7850×720×1440, 5 datasets, 121 gigabytes

• Outputs:

Chunk Size Filter File Size Change (%) Runtime (hour)1×72×144 GZIP level 9 -63.6 9.57850×1×1 GZIP level 9 -62.1 1025×20×20 GZIP level 9 -64.5 6

Findings: Compression

• Compression filters: GZIP, SZIP, MAFISC, Blosc• NCEP data set: 7,850 files• Chunk size: 45×180

Filter Total File Size Change (%) Runtime (hour)

GZIP, level 9 -63.2 7.33

SZIP -70.3 22

MAFISC -86.4 22

Blosc -61.5 4.67

Data Indexing

• Value range information (min, max) captured for each HDF5 dataset chunk

• These value, plus chunk dataset dataspace coordinates stored in a PyTables file

• ~30 minutes to collect index data from the collated NCEP data file

• Work on incorporating this information in processing is on-going

Findings: Parallel

• Load time improved up to 16 nodes• Run time improved super-linearly with

more nodes (up to 64)

Conclusion

• Using a computing environment where POSIX file system is not persistent storage poses unique challenges

• Chunk size does influence runtime• Compression filter performance:

Blosc < GZIP9 < MAFISC• Increasing number of compute nodes reduces

the observed differences in runtime

the hdf group january 8, 2016 2016 esip winter meeting data container study: hdf5 in a posix file...

Documents

esip federation activities

chunking text

chunking: what do i study?. chunking is not “flash...

chunking -...

restful hdf5 - interface specification - version 0 ·...

hdf5 advanced topics - chunking

september 9, 2008speedup workshop - hdf5 tutorial1 new...

esip 2011 jdmoore

www.hdfgroup.org the hdf group hdf5 chunking and compression...

esip noaa climate_shea

esip & geospatial one-stop (gos) registering esip products...

esip 101 - an introduction to all things esip

february 2007csa3050: tagging iii and chunking 1 csa2050:...

multiplication by chunking

hdf5 documentation

subtraction by chunking

0603 esip fed wash dc tech pres 060103 esip aq tech track

chunking in spatial memory 1 running head: chunking in

division by chunking

hdf5 intro