the hdf group january 8, 2016 2016 esip winter meeting data container study: hdf5 in a posix file...
Post on 18-Jan-2018
218 Views
Preview:
DESCRIPTION
TRANSCRIPT
The HDF Group
1 www.hdfgroup.orgJanuary 8, 20162016 ESIP Winter Meeting
Data Container Study:HDF5 in a POSIX File System
orHDF5 C3: Compression, Chunking, Clusters
Aleksandar Jelenak, John Readey, H. Joe Lee,Ted Habermann
www.hdfgroup.org2016 ESIP Winter Meeting 2
Hardware
• Using Open Science Data Cloud Griffin cluster• Xeon systems with 1-16 cores• 60 compute nodes• 10Gb Ethernet• Ephemeral local POSIX file system• Shared persistent storage (Ceph object store, S3
API)
www.hdfgroup.org2016 ESIP Winter Meeting 3
Software
• HDF5 library v1.8.15• Compression libraries: MAFISC/GZIP/BLOSC• Operating system: Ubuntu Linux• Linux development tools• Any HDF5-supported C compiler• HDF5 tools: h5dump, h5repack, etc.• Python 3• Python packages: h5py, NumPy, ipyparallel,
PyTables
www.hdfgroup.org2016 ESIP Winter Meeting 4
Data
• NCEP/DOE Reanalysis II, for GSSTF, Daily Grid, v3• 0.25×0.25 deg, global• Time span 1987-2008• 7,850 daily files, 120GB
• NOAA Coral Reef Temperature Anomaly Database (CoRTAD) version 5• 0.04165×0.04165 deg (~4km), global• Time range 1982-2012, weekly time step• 8 files, 253GB
www.hdfgroup.org2016 ESIP Winter Meeting 5
WorkflowDownload data as HDF5
files from archive and transfer to S3 object store
Repack original file(s) using HDF5 chunking and compression,
transfer to S3 store
Collate data from original files into one file with HDF5 chunking
& compression, transfer to S3 store
Launch a number of VMs and connect them into a ipyparallel
cluster
Data Ingest/Preprocessing
Data Analysis
Distribute input HDF5 data from S3 store to cluster VMs
Execute data analysis task on cluster VMs
Collect data analysis results from cluster VMs and prepare the
report
Shut down the cluster and VMs
Index data in file(s) by collecting descriptive statistics (min, max,
etc.) for each HDF5 chunk.
www.hdfgroup.org2016 ESIP Winter Meeting 6
System Architecture
www.hdfgroup.org2016 ESIP Winter Meeting 7
HDF5 Chunks
• Chunking is one of storage layouts for HDF5 datasets
• HDF5 dataset’s byte stream is broken up in chunks and stored at various locations in the file
• Chunks are of equal size in dataset’s dataspace but may not be of equal byte size in the file
• HDF5 filtering works on chunks only• Filters for compression/decompression, scaling,
checksum calculation, etc.
www.hdfgroup.org2016 ESIP Winter Meeting 8
Findings: Chunking
• Two different chunking algorithms:• Unidata’s optimal chunking formula for 3D datasets• h5py formula
• Three different chunk sizes chosen for the collated NCEP data set:• Synoptic map: 1×72×144• Data rod: 7850×1×1• Data cube: 25×20×20
www.hdfgroup.org2016 ESIP Winter Meeting 9
Findings: Chunking
• Input was collated NCEP data file:• 7850×720×1440, 5 datasets, 121 gigabytes
• Outputs:
Chunk Size Filter File Size Change (%) Runtime (hour)1×72×144 GZIP level 9 -63.6 9.57850×1×1 GZIP level 9 -62.1 1025×20×20 GZIP level 9 -64.5 6
www.hdfgroup.org2016 ESIP Winter Meeting 10
Findings: Compression
• Compression filters: GZIP, SZIP, MAFISC, Blosc• NCEP data set: 7,850 files• Chunk size: 45×180
Filter Total File Size Change (%) Runtime (hour)
GZIP, level 9 -63.2 7.33
SZIP -70.3 22
MAFISC -86.4 22
Blosc -61.5 4.67
www.hdfgroup.org2016 ESIP Winter Meeting 11
Data Indexing
• Value range information (min, max) captured for each HDF5 dataset chunk
• These value, plus chunk dataset dataspace coordinates stored in a PyTables file
• ~30 minutes to collect index data from the collated NCEP data file
• Work on incorporating this information in processing is on-going
www.hdfgroup.org2016 ESIP Winter Meeting 12
Findings: Parallel
• Load time improved up to 16 nodes• Run time improved super-linearly with
more nodes (up to 64)
www.hdfgroup.org2016 ESIP Winter Meeting 13
Conclusion
• Using a computing environment where POSIX file system is not persistent storage poses unique challenges
• Chunk size does influence runtime• Compression filter performance:
Blosc < GZIP9 < MAFISC• Increasing number of compute nodes reduces
the observed differences in runtime
top related