hdf 1 hdf5 advanced topics object’s properties storage methods and filters datatypes hdf and...
TRANSCRIPT
1 HDFHDF
HDF5 Advanced TopicsHDF5 Advanced TopicsObject’s PropertiesObject’s Properties
Storage Methods and FiltersStorage Methods and FiltersDatatypesDatatypes
HDF and HDF-EOS Workshop VIIIHDF and HDF-EOS Workshop VIII
October 26, 2004October 26, 2004
2 HDFHDF
TopicsTopics
General Introduction to HDF5 properties HDF5 Dataset properties
I/O and Storage Properties (filters)
HDF5 File properties I/O and Storage Properties (drivers)
Datatypes Compound Variable Length Reference to object and dataset region
3 HDFHDF
General Introduction to General Introduction to HDF5 PropertiesHDF5 Properties
4 HDFHDF
PropertiesPropertiesDefinitionDefinition
• Mechanism to control different features of the HDF5 objects– Implemented via H5P Interface (‘Property lists’)
– HDF5 Library sets objects’ default features
– HDF5 ‘Property lists’ modify default features• At object creation time (creation properties)
• At object access time (access or transfer properties)
5 HDFHDF
PropertiesPropertiesDefinitionsDefinitions
• A property list is a list of name-value pairs– Values may be of any datatype
• A property list is passed as an optional parameters to the HDF5 APIs
• Property lists are used/ignored by all the layers of the library, as needed
6 HDFHDF
Type of PropertiesType of Properties
• Predefined and User defined property lists• Predefined:
– File creation
– File access
– Dataset creation
– Dataset access
• Will cover each of these
7 HDFHDF
Properties (Example)Properties (Example)HDF5 FileHDF5 File
• H5Fcreate(…,creation_prop_id,…)• Creation properties (how file is created?)
– Library’s defaults• no user’s block• predefined sizes of offsets and addresses of the objects in the
file (64-bit for DEC Alpha, 32-bit on Windows)
– User’s settings • User’s block • 32-bit sizes on 64-bit platform• Control over B-trees for chunking storage (split factor)
8 HDFHDF
Properties (Example)Properties (Example)HDF5 FileHDF5 File
• H5Fcreate(…,access_prop_id)• Access properties or drivers (How is file accessed?
What is the physical layout on the disk?)– Library defaults
• STDIO Library (UNIX fwrite, fread)
– User’s defined• MPI I/O for parallel access• Family of files (100 Gb HDF5 represented by 50 2Gb UNIX
files)• Size of the chunk cache
9 HDFHDF
Properties (Example)Properties (Example)HDF5 DatasetHDF5 Dataset
• H5Dcreate(…,creation_prop_id)• Creation properties (how dataset is created)
– Library’s defaults• Storage: Contiguous• Compression: None• Space is allocated when data is first written• No fill value is written
– User’s settings • Storage: Compact, or chunked, or external • Compression• Fill value• Control over space allocation in the file for raw data
– at creation time– at write time
10 HDFHDF
Properties (Example)Properties (Example)HDF5 DatasetHDF5 Dataset
• H5Dwrite<read>(…,access_prop_id)• Access (transfer) properties
– Library defaults• 1MB conversion buffer• Error detection on read (if was set during write)• MPI independent I/O for parallel access
– User defined• MPI collective I/O for parallel access• Size of the datatype conversion buffer• Control over partial I/O to improve performance
11 HDFHDF
Properties Properties Programming modelProgramming model
• Use predefined property type– H5P_FILE_CREATE – H5P_FILE_ACCESS– H5P_DATASET_CREATE– H5P_DATASET_ACCESS
• Create new property instance– H5Pcreate – H5Pcopy– H5*get_access_plist; H5*get_create_plist
• Modify property (see H5P APIs)• Use property to modify object feature• Close property when done
– H5Pclose
12 HDFHDF
PropertiesPropertiesProgramming modelProgramming model
• General model of usage: get plist, set values, pass to library
hid_t plist = H5Pcreate(copy)(predefined_plist); OR hid_t plist = H5Xget_create(access)_plist(…);
H5Pset_foo( plist, vals);
H5Xdo_something( Xid, …, plist);
H5Pclose(plist);
13 HDFHDF
HDF5 Dataset Creation HDF5 Dataset Creation Properties and Predefined Properties and Predefined
FiltersFilters
14 HDFHDF
Dataset Creation PropertiesDataset Creation Properties
• Storage– Contiguous (default)– Compact – Chunked – External
• Filters applied to raw data– Compression– Checksum
• Fill value• Space allocation for raw data in the file
15 HDFHDF
Dataset Creation Properties Dataset Creation Properties Storage LayoutsStorage Layouts
• Storage layout is important for I/O performance and size of the HDF5 files
• Contiguous (default)• Used when data will be written/read at once• H5Dcreate(…,H5P_DEFAULT)
• Compact • Used for small datasets (order of O(bytes)) for better I/O• Raw data is written/read at the time when dataset is open• File is less fragmented• To create a compact dataset follow the ‘Properties
programming model’
16 HDFHDF
Creating Compact DatasetCreating Compact Dataset
• Create a dataset creation property list• Set property list to use compact storage layout• Create dataset with the above property list
plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_layout(plist, H5D_COMPACT); dset_id = H5Dcreate (…, “Compact”,…, plist); H5Pclose(plist);
17 HDFHDF
Creating chunked DatasetCreating chunked Dataset
• Chunked layout is needed for– Extendible datasets– Compression and other filters– To improve partial I/O for big datasets
Better subsetting access time; extendiblechunked
Only two chunks will be written/read
18 HDFHDF
Creating Chunked DatasetCreating Chunked Dataset
• Create a dataset creation property list• Set property list to use chunked storage layout• Create dataset with the above property list
plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk(plist, rank, ch_dims); dset_id = H5Dcreate (…, “Chunked”,…, plist); H5Pclose(plist);
19 HDFHDF
Dataset Creation Properties Dataset Creation Properties Compression and other I/O Pipeline FiltersCompression and other I/O Pipeline Filters
• HDF5 provides a mechanism (“I/O filters”) to manipulate data while transferring it between memory and disk
• H5Z and H5P interfaces• HDF5 predefined filters (H5P interface)
– Compression (gzip, szip)– Shuffling and checksum filters
• User defined filters (H5Z and H5P interfaces)– Example: Bzip2 compression
http://hdf.ncsa.uiuc.edu/HDF5/papers/bzip2
20 HDFHDF
Compression and other I/O Pipeline FiltersCompression and other I/O Pipeline Filters(continued)(continued)
• Currently used only with chunked datasets• Filters can be combined together
– GZIP + shuffle+checksum filters– Checksum filter + user define encryption filter
• Filters are called in the order they are defined on writing and in the reverse order on reading
• User is responsible for “filters pipeline sanity”– GZIP +SZIP + shuffle doesn’t make sense– Shuffle + SZIP does
21 HDFHDF
Creating compressed DatasetCreating compressed Dataset
• Compression– Improves transmission speed– Improves storage efficiency– Requires chunking– May increase CPU time needed for compression
Compressed
Memory File
22 HDFHDF
Creating compressed datasetsCreating compressed datasets
• Create a dataset creation property list• Set chunking (and specify chunk dimensions)• Set compression method• Create dataset with the above property list
plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk (plist, ndims, chkdims);
H5Pset_deflate (plist, level); /*GZIP */ OR
H5Pset_szip (plist, options-mask, numpixels);/*SZIP*/
dset_id = H5Dcreate (file_id, “comp-data”, “H5T_NATIVE_FLOAT,space_id, plist);
23 HDFHDF
Creating external DatasetCreating external Dataset
• Dataset’s raw data is stored in an external file• Easy to include existing data into HDF5 file• Easy to export raw data if application needs it• Disadvantage: user has to keep track of additional files
to preserve integrity of the HDF5 file
Metadata for “A”
Dataset “A”
HDF5 fileHDF5 file
External fileExternal file
Raw data for “ARaw data for “A””
Raw data can be stored in external file
24 HDFHDF
Creating External DatasetCreating External Dataset
• Create a dataset creation property list• Set property list to use external storage layout• Create dataset with the above property list
plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_external(plist,
“raw_data.ext”, offset, size); dset_id = H5Dcreate (…, “Chunked”,…, plist); H5Pclose(plist);
25 HDFHDF
Example of External FilesExample of External Files
This example shows how a contiguous, one-dimensional dataset is partitioned into three parts and each of those parts is stored in a segment of an external file.
plist = H5Pcreate (H5P_DATASET_CREATE);HPset_external (plist, “raw.data”, 3000, 1000);H5Pset_external (plist, “raw.data”, 0, 2500);H5Pset_external (plist, “raw.data”, 4500, 1500);
26 HDFHDF
Checksum FilterChecksum Filter
• HDF5 includes the Fletcher32 checksum algorithm
for error detection. • It is automatically included in HDF5• To use this filter you must add it to the filter pipeline
with H5Pset_filter.
Checksum value
Memory
27 HDFHDF
Enabling Checksum FilterEnabling Checksum Filter
• Create a dataset creation property list• Set chunking (and specify chunk dimensions)• Add the filter to the pipeline• Create your dataset specifying this property list• Close property list
plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk (plist, ndims, chkdims); H5Pset_filter (plist, H5Z_FILTER_FLETCHER32, 0, 0, NULL); H5Dcreate (…,”Checksum”,…,plist) H5Pclose(plist);
28 HDFHDF
Shuffling filterShuffling filter
• Predefined HDF5 filter• Not a compression; change of byte order in a stream of data• Example
– 1 23 43
• Hexadecimal form– 0x01 0x17 0x2B
• Big-endian machine– 0x00 0x00 0x00 0x01 0x00 0x00 0x00 0x17 0x00 0x00 0x00 0x2B
• Shuffling– 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x01 0x17 0x2B
29 HDFHDF
00 00 00 01 00 00 00 17 00 00 00 2B
00 00 00 00 00 00 01 17 2B
00 00 00 01 00 00 00 17 00 00 00 2B
00 00 00
30 HDFHDF
Enabling Shuffling FilterEnabling Shuffling Filter
• Create a dataset creation property list• Set chunking (and specify chunk dimensions)• Add the filter to the pipeline• Define compression filter• Create your dataset specifying this property list• Close property list
plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk (plist, ndims, chkdims); H5Pset_shuffle(plist); H5Pset_deflate(plist,level); H5Dcreate (…,”BetterComp”,…,plist) H5Pclose(plist);
31 HDFHDF
Effect of data shuffling (H5Pset_shuffle Effect of data shuffling (H5Pset_shuffle + H5Pset_deflate)+ H5Pset_deflate)
File size Total time Write Time
No Shuffle 102.9MB 671.049 629.45
Shuffle 67.34MB 83.353 78.268
Compression combined with shuffling provides•Better compression ratio•Better I/O performance
• Write 4-byte integer dataset 256x256x1024 (256MB)• Using chunks of 256x16x1024 (16MB)• Values: random integers between 0 and 255
32 HDFHDF
HDF5 Dataset Access (Transfer) HDF5 Dataset Access (Transfer) PropertiesProperties
33 HDFHDF
Dataset Access/Transfer PropertiesDataset Access/Transfer Properties
• Improve performance• H5Pset_buffer
– Sets the size of the datatype conversion buffer during I/O
– Size should be large enough to hold the slice along the slowest changing dimension
– Example: Hyperslab 100x200x300, buffer 200x300
• H5Pset_hyper_vector_size– Sets the number of hyperslab offset and length pairs– Improves performance for partial I/O
34 HDFHDF
Dataset Access/Transfer PropertiesDataset Access/Transfer Properties
• H5Pset_edc_check– For datasets created with error detection filter enabled
– Enables error checking during read operation
– H5Z_ENABLE_EDC (default)
– N5Z_DISABLE_EDC
• H5Pset_dxpl_mpio– Sets data transfer mode for parallel I/O
– H5FD_MPIO_INDEPENDENT (default)
– H5FD_MPIO_COLLECTIVE
35 HDFHDF
User-defined FiltersUser-defined Filters
36 HDFHDF
Standard Interface for User-defined FiltersStandard Interface for User-defined Filters
• H5Zregister : Register filter so that HDF5
knows about it• H5Zunregister: Unregister a filter• H5Pset_filter: Adds a filter to the filter pipeline• H5Pget_filter: Returns information about a filter
in the pipeline• H5Zfilter_avail: Check if filter is available
37 HDFHDF
File Creation PropertiesFile Creation Properties
38 HDFHDF
File Creation PropertiesFile Creation Properties
• H5Pset_userblock– User block stores user-defined information (e.g ASCII text to describe a file) at
the beginning of the file– Cat my.txt hdf5.h5 > myhdf5.h5– Sets the size of the user block – 512 bytes, 1024 bytes, 2^N
• H5Pset_sizes– Sets the byte size of the offsets and lengths used to address objects in the file
• H5Pset_sym_k– Controls the rank of groups B-trees for groups – Default is 16
• H5Pset_istore_k– Controls the rank of groups B-trees for chunked datasets– Default is 32
39 HDFHDF
File Access PropertiesFile Access Properties
40 HDFHDF
File Access Properties (Performance)File Access Properties (Performance)
• H5Pset_cache– Sets metadata cache and raw data chunk parameters– Improper size will degrade performance
• H5Pset_meta_block_size– Reduces the number of small objects in the file– Block of metadata is written in a single I/O operation (default 2K)– VFL driver has to set H5FD_AGGREGATE_METADATA
• H5Pset_sieve_buffer– Improves partial I/O– Need a picture
• VFL layer: file drivers
41 HDFHDF
File Access Properties (Physical storage File Access Properties (Physical storage and Usage of Low-level I/O Libraries)and Usage of Low-level I/O Libraries)
• VFL layer: file drivers
• Define physical storage of the HDF5 file– Memory driver (HDF5 file in the application’s memory)
– Stream driver (HDF5 file written to a socket)
– Split(multi) files driver
– Family driver
• Define low level I/O library– MPI I/O driver for parallel access
– STDIO vs. SEC2
42 HDFHDF
Files needn’t be files - Virtual File LayerFiles needn’t be files - Virtual File Layer
VFL: A public API for writing I/O drivers
memorympiostdio
Hid_t
Files Memory
““File” HandleFile” Handle
I/O drivers
network
Network
VFL: Virtual File I/O LayerVFL: Virtual File I/O Layer
““Storage”Storage”
splitfamily
SRB
SRB Repository
43 HDFHDF
Split FilesSplit Files• Allows you to split metadata and data into separate files
• May reside on different file systems for better I/O
• Disadvantage: User has to keep track of the files
Dataset “A”
Dataset “B” Data A
Data B
Metadata file Raw data file
HDF5 file
44 HDFHDF
Creating Split FilesCreating Split Files
• Create a file access property list• Set up file access property list to use split files• Create the file with this property list• Close the property
plist = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_fapl_family(plist, “.met”, H5P_DEFAULT,”.dat”, H5P_DEFAULT);
file = H5Fcreate (H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT,
plist);H5Pclose(plist);
45 HDFHDF
File FamiliesFile Families
• Allows you to access files larger than 2GB on file systems that don't support large files
• Any HDF5 file can be split into a family of files and vice versa
• A family member size must be a power of two
46 HDFHDF
Creating a File FamilyCreating a File Family
• Create a file access property list
• Set up file access property list to use file family
• Create the file with this property list
plist = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_fapl_family (plist, family_size, H5P_DEFAULT);
file = H5Fcreate (H5FILE_NAME, H5F_ACC_TRUNC,
H5P_DEFAULT, plist);
H5Pclose(plist);
47 HDFHDF
HDF5 DatatypesHDF5 Datatypes
48 HDFHDF
DatatypesDatatypes
• A datatype is– A classification specifying the interpretation of
a data element– Specifies for a given data element
• the set of possible values it can have• the operations that can be performed• how the values of that type are stored
– May be shared between different datasets in one file
49 HDFHDF
HDF5 datatypesHDF5 datatypes
• Atomic types– standard integer & float– user-definable scalars (e.g. 13-bit integer)– bitfields– variable length types (e.g. strings)– pointers - references to objects/dataset regions– enumeration - names mapped to integers
50 HDFHDF
General Operations on HDF5 DatatypesGeneral Operations on HDF5 Datatypes
• Create – H5Tcreate creates a datatype of the HT_COMPOUND, H5T_OPAQUE,
and H5T_ENUM classes
• Copy– H5Tcopy creates another instance of the datatype; can be applied to any
datatypes
• Commit– H5Tcommit creates an Datatype Object in the HDF5 file; comitted
datatype can be shared between different datatsets
• Open– H5Topen opens the datatypes stored in the file
• Close– H5Tclose closes datatype object
51 HDFHDF
Programming model for HDF5 DatatypesProgramming model for HDF5 Datatypes
• Use predefined HDF5 types – No need to close
• OR– Create
• Create a datatype (by copying existing one or by creating from the one of H5T_COMPOUND(ENAUM,OPAQUE) classes)
• Create a datatype by queering datatype of a dataset
– Open committed datatype from the file
• (Optional) Discover datatype properties (size, precision, members, etc.)
• Use datatype to create a dataset/attribute, to write/read dataset/attribute, to set fill value
• (Optional) Save datatype in the file• Close
52 HDFHDF
HDF5 Compound DatatypesHDF5 Compound Datatypes
• Compound types– Comparable to C structs – Members can be atomic or compound types – Members can be multidimensional– Can be written/read by a field or set of fields– Non all data filters can be applied (shuffling, SZIP)– H5Tcreate(H5T_COMPOUND), H5Tinsert calls to
create a compound datatype– See H5Tget_member* functions for discovering
properties of the HDF5 compound datatype
53 HDFHDF
•Data
Time•Data
•Data
•Data
•Data
•Data
•Data
•Data
•Data
Time
HDF5 Fixed and Variable length array HDF5 Fixed and Variable length array storagestorage
54 HDFHDF
HDF5 Variable Length DatatypesHDF5 Variable Length DatatypesProgramming issuesProgramming issues
• Each element is represented by C struct typedef struct {
size_t length;
void *p;
} hvl_t;
• Base type can be any HDF5 type
55 HDFHDF
HDF5 Variable Length DatatypesHDF5 Variable Length Datatypes
Global heapGlobal heap
Dataset with variable length datatypeDataset with variable length datatype
Raw dataRaw data
56 HDFHDF
HDF InformationHDF Information
• HDF Information Center– http://hdf.ncsa.uiuc.edu/
• HDF Help email address– [email protected]
• HDF users mailing list– [email protected]