hdf hierarchical data format nancy yeager mike folk ncsa university of illinois at urbana-champaign,...
TRANSCRIPT
HDFHDF
Hierarchical Data FormatHierarchical Data Format
Nancy YeagerNancy Yeager
Mike FolkMike Folk
NCSANCSA
University of Illinois at Urbana-Champaign, USAUniversity of Illinois at Urbana-Champaign, USA
[email protected]@ncsa.uiuc.edu
What is HDF?What is HDF?
• A scientific data format and supporting softwareA scientific data format and supporting software• Stores images, multidimensional arrays, tables, Stores images, multidimensional arrays, tables,
annotationsannotations• Free and commercial software supportFree and commercial software support• Emphasis on standards Emphasis on standards • Users from many engineering and scientific fieldsUsers from many engineering and scientific fields• Biggest user: NASA Earth Observing System Data Biggest user: NASA Earth Observing System Data
and Information System (EOSDIS)and Information System (EOSDIS)• HDF4 and HDF5HDF4 and HDF5
HDF file with a mixture of objectsHDF file with a mixture of objects
March 15, 1990. Simulation with
k=10.0, beta=1.22e3. Calculate
the magnitude ...
3-D array3-D arrayRaster imageRaster image
2-D array2-D array
groupgroup
Raster Raster imageimage
palettepalette
annotationannotation
HDF fileHDF file
lat | lon | temp----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6
TableTable
Utilities and applications for Utilities and applications for manipulating, viewing, and manipulating, viewing, and analyzing data in HDF files.analyzing data in HDF files.
A software library:A software library:– High-level, object-specific APIs.High-level, object-specific APIs.– Low-level I/O drivers.Low-level I/O drivers.
A physical file or other medium A physical file or other medium (network, memory, etc.). (network, memory, etc.).
Applications
ApplicationProgramming
Interfaces
Low-levelInterface
HDFfile
HDF software layersHDF software layers
A Sampling of HDF Visualization and A Sampling of HDF Visualization and Data Analysis ToolsData Analysis Tools
MATLAB , IDLMATLAB , IDL Commercial Commercial
NOeSYS, Transform CommercialNOeSYS, Transform Commercial
HDF Explorer CommercialHDF Explorer Commercial
JHV, Java BrowserJHV, Java Browser NCSA shareware NCSA shareware
Scientific Data Browser NCSA sharewareScientific Data Browser NCSA shareware
DIAL RaytheonDIAL Raytheon
EOSDISEOSDIS
• Open standard for exchange of remote-sensed dataOpen standard for exchange of remote-sensed data– Scores of instruments and datasetsScores of instruments and datasets
– 2+ terabytes per day2+ terabytes per day
– 1,000 primary users, 30,000 secondary1,000 primary users, 30,000 secondary
• HDF RequirementsHDF Requirements– Support for scientists, data producers, archiving, etc.Support for scientists, data producers, archiving, etc.
– Library and file structure optimizationLibrary and file structure optimization
– HDF tools, utilities, access softwareHDF tools, utilities, access software
– Software maintenance and QASoftware maintenance and QA
HDF Shortcomings Exposed by HDF Shortcomings Exposed by EOSDISEOSDIS
• Very Large Datasets Very Large Datasets – Object and File Sizes > 2GBObject and File Sizes > 2GB– Number of Objects > 20 KNumber of Objects > 20 K
• Concurrent Access: parallel I/O, threadsConcurrent Access: parallel I/O, threads
• Richer, More Flexible Data ModelRicher, More Flexible Data Model– complex data structurescomplex data structures– complex subsettingcomplex subsetting
HDF5HDF5
• A A successorsuccessor to HDF (currently HDF4.1) to HDF (currently HDF4.1)
• A A newnew API, file structure, library API, file structure, library
• Addresses New Demands : Addresses New Demands : – Supports Large Data ModelsSupports Large Data Models– Parallel I/O (MPIO) , threads (not Parallel I/O (MPIO) , threads (not
implemented yet)implemented yet)– Complex Data StructuresComplex Data Structures– Smaller, fasterSmaller, faster
“/”
“/foo”
“/foo/bar”
HDF5 Data Model : GroupHDF5 Data Model : Group
• A UNIX-like directory A UNIX-like directory structure containing structure containing groups, datasets, groups, datasets, annotations annotations
• Directory is a Directory is a graph, rather graph, rather than a treethan a tree
HDF5 Data Model: DatasetHDF5 Data Model: Dataset
RecordRecord
int8int8 int4int4 int16int16 float32float32
Dimensionality: 5 x 3Dimensionality: 5 x 3
Number type:Number type:
3
5
A MultiDimensional Array of Records
Data Records can be: Data Records can be:
• Atomic Datatype ( standard integer )Atomic Datatype ( standard integer )
• Compound Datatype ( C structs )Compound Datatype ( C structs )
• MultidimensionalMultidimensional
• Pointer ( reference to dataset, region )Pointer ( reference to dataset, region )
RecordRecord
int8int8 int4int4 int16int16 float32float32Number type:Number type:
Metadata header
Dataset “Fred”
Data
int16
time = 32.4
pressure = 987
temp = 56
Datatype
Attributes
Dataspace
2 Dim_3=2
Dim_2=4
Dim_1=5Rank
Dimensions
Chunked; compressed
Storage info
Dataset componentsDataset components
• array of data elements array of data elements
• metadatametadata– datatypedatatype– dataspacedataspace– attributesattributes– storage infostorage info
Dataset elements (datatypes)Dataset elements (datatypes)
• standard integer & floatstandard integer & float• user-definable scalars (e.g. 13-bit integer)user-definable scalars (e.g. 13-bit integer)• variable length types (e.g. strings)variable length types (e.g. strings)• pointers - references to objects/regions of pointers - references to objects/regions of
datasetsdatasets• enumeration - names mapped to integersenumeration - names mapped to integers• compound typescompound types
– Comparable to C structs Comparable to C structs
– Members can be atomic or compound types Members can be atomic or compound types
– Members can be multidimensionalMembers can be multidimensional
Attributes Attributes
• Are small pieces of data Are small pieces of data
• Attached to datasets or groupsAttached to datasets or groups
• Operations are scaled down versions of Operations are scaled down versions of the dataset operations the dataset operations – Not extendible Not extendible – No compression No compression – No partial I/O No partial I/O
Dataset featuresDataset features
• Extendible in any directionExtendible in any direction
• Special storage options Special storage options – contiguouscontiguous
– external, chunked, compressedexternal, chunked, compressed
– users can add othersusers can add others
• User-defined attribute listUser-defined attribute list
Dataset Storage OptionsDataset Storage Options
• chunkedchunked
• compressedcompressed
• extendableextendable
• split filesplit file
Metadata for Fred
Dataset “Fred”
File AFile A
File BFile B
Data for FredData for Fred
Better subsetting access time; extendable
Improves storage efficiency, transmission speed
Arrays can be extended in any direction
Metadata in one file, raw data in another.
Dataset Selection OptionsDataset Selection Options
• Selection describes how data points are Selection describes how data points are organized to form a dataset organized to form a dataset
• Select a subset of points for partial I/O Select a subset of points for partial I/O
• Selection can be:Selection can be:– a set of pointsa set of points– a region within an arraya region within an array
• Selections describe array in memory or Selections describe array in memory or in the filein the file
(c) A sequence of points from a 2D array to a sequence of points in a 3D array.
(d) Union of hyperslabs in file to union of hyperslabs in memory. Number of elements must be equal.
(b) A regular series of blocks from a 2D array to a contiguous sequence at a certain offset in a 1D array
(a) A hyperslab from a 2D array to the corner of a smaller 2D array
Selection region in memory can be different Selection region in memory can be different shape from selection region in fileshape from selection region in file
Sub-selection OptionsSub-selection Options
• Flexibility in mappings between data in memory and object in fileFlexibility in mappings between data in memory and object in file
• Selection regions can beSelection regions can be– pointspoints– hyperslabshyperslabs– unions of hyperslabsunions of hyperslabs
• Selection region in memory can be different shape from selection in Selection region in memory can be different shape from selection in filefile
• Supports parallel I/O via MPI-I/OSupports parallel I/O via MPI-I/O– hyperslab selections translated to MPI derived types before performing I/O hyperslab selections translated to MPI derived types before performing I/O
HDF5 Raw Data PipelineHDF5 Raw Data Pipeline
• Handles data transformations between file Handles data transformations between file and memory. and memory.
• Deals with multiple storage optionsDeals with multiple storage options– chunking, compression, number conversion,...chunking, compression, number conversion,...
• Optimized performance for common usageOptimized performance for common usage• Hooks for new filtersHooks for new filters
– compression schemes, encryption, checksum,...compression schemes, encryption, checksum,...– user-specified filtersuser-specified filters
ERBE cloud tracking Index Dataset
RadianceDataset
Rich Framework for Building SearchRich Framework for Building SearchApplicationsApplications
INDEX RECORD CONTAINSINDEX RECORD CONTAINSPOINTER to REGION in POINTER to REGION in DATASETDATASET
Surface Temperature Dataset
Data Structures for Building Efficient External Data Structures for Building Efficient External Indexes and Storing them in the Data fileIndexes and Storing them in the Data file
Rich Framework for BuildingRich Framework for BuildingSearch ApplicationsSearch Applications
• Efficient data Structures for Building Efficient data Structures for Building External Indexes and Storing them in file External Indexes and Storing them in file with datawith data
• ERBE cloud system trackingERBE cloud system tracking– Automation of ftp and tape storage on a PC for Automation of ftp and tape storage on a PC for
large data volumes (90GB in; 13,000 images out)large data volumes (90GB in; 13,000 images out)– Code to sort data from time-ordered basis to Code to sort data from time-ordered basis to
spatial time sequences for 100 million footprints spatial time sequences for 100 million footprints per data mo..per data mo..
HDF InformationHDF Information
• HDF Information CenterHDF Information Center– http://hdf.ncsa.uiuc.edu/http://hdf.ncsa.uiuc.edu/
• HDF Help email addressHDF Help email address– [email protected]@ncsa.uiuc.edu
• HDF users mailing listHDF users mailing list– [email protected]@ncsa.uiuc.edu