hdf hierarchical data format nancy yeager mike folk ncsa university of illinois at urbana-champaign,...

23
HDF HDF Hierarchical Data Format Hierarchical Data Format Nancy Yeager Nancy Yeager Mike Folk Mike Folk NCSA NCSA University of Illinois at Urbana- University of Illinois at Urbana- Champaign, USA Champaign, USA [email protected] [email protected]

Upload: anabel-garrett

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

HDFHDF

Hierarchical Data FormatHierarchical Data Format

Nancy YeagerNancy Yeager

Mike FolkMike Folk

NCSANCSA

University of Illinois at Urbana-Champaign, USAUniversity of Illinois at Urbana-Champaign, USA

[email protected]@ncsa.uiuc.edu

Page 2: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

What is HDF?What is HDF?

• A scientific data format and supporting softwareA scientific data format and supporting software• Stores images, multidimensional arrays, tables, Stores images, multidimensional arrays, tables,

annotationsannotations• Free and commercial software supportFree and commercial software support• Emphasis on standards Emphasis on standards • Users from many engineering and scientific fieldsUsers from many engineering and scientific fields• Biggest user: NASA Earth Observing System Data Biggest user: NASA Earth Observing System Data

and Information System (EOSDIS)and Information System (EOSDIS)• HDF4 and HDF5HDF4 and HDF5

Page 3: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

HDF file with a mixture of objectsHDF file with a mixture of objects

March 15, 1990. Simulation with

k=10.0, beta=1.22e3. Calculate

the magnitude ...

3-D array3-D arrayRaster imageRaster image

2-D array2-D array

groupgroup

Raster Raster imageimage

palettepalette

annotationannotation

HDF fileHDF file

lat | lon | temp----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6

TableTable

Page 4: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Utilities and applications for Utilities and applications for manipulating, viewing, and manipulating, viewing, and analyzing data in HDF files.analyzing data in HDF files.

A software library:A software library:– High-level, object-specific APIs.High-level, object-specific APIs.– Low-level I/O drivers.Low-level I/O drivers.

A physical file or other medium A physical file or other medium (network, memory, etc.). (network, memory, etc.).

Applications

ApplicationProgramming

Interfaces

Low-levelInterface

HDFfile

HDF software layersHDF software layers

Page 5: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

A Sampling of HDF Visualization and A Sampling of HDF Visualization and Data Analysis ToolsData Analysis Tools

MATLAB , IDLMATLAB , IDL Commercial Commercial

NOeSYS, Transform CommercialNOeSYS, Transform Commercial

HDF Explorer CommercialHDF Explorer Commercial

JHV, Java BrowserJHV, Java Browser NCSA shareware NCSA shareware

Scientific Data Browser NCSA sharewareScientific Data Browser NCSA shareware

DIAL RaytheonDIAL Raytheon

Page 6: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

EOSDISEOSDIS

• Open standard for exchange of remote-sensed dataOpen standard for exchange of remote-sensed data– Scores of instruments and datasetsScores of instruments and datasets

– 2+ terabytes per day2+ terabytes per day

– 1,000 primary users, 30,000 secondary1,000 primary users, 30,000 secondary

• HDF RequirementsHDF Requirements– Support for scientists, data producers, archiving, etc.Support for scientists, data producers, archiving, etc.

– Library and file structure optimizationLibrary and file structure optimization

– HDF tools, utilities, access softwareHDF tools, utilities, access software

– Software maintenance and QASoftware maintenance and QA

Page 7: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

HDF Shortcomings Exposed by HDF Shortcomings Exposed by EOSDISEOSDIS

• Very Large Datasets Very Large Datasets – Object and File Sizes > 2GBObject and File Sizes > 2GB– Number of Objects > 20 KNumber of Objects > 20 K

• Concurrent Access: parallel I/O, threadsConcurrent Access: parallel I/O, threads

• Richer, More Flexible Data ModelRicher, More Flexible Data Model– complex data structurescomplex data structures– complex subsettingcomplex subsetting

Page 8: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

HDF5HDF5

• A A successorsuccessor to HDF (currently HDF4.1) to HDF (currently HDF4.1)

• A A newnew API, file structure, library API, file structure, library

• Addresses New Demands : Addresses New Demands : – Supports Large Data ModelsSupports Large Data Models– Parallel I/O (MPIO) , threads (not Parallel I/O (MPIO) , threads (not

implemented yet)implemented yet)– Complex Data StructuresComplex Data Structures– Smaller, fasterSmaller, faster

Page 9: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

“/”

“/foo”

“/foo/bar”

HDF5 Data Model : GroupHDF5 Data Model : Group

• A UNIX-like directory A UNIX-like directory structure containing structure containing groups, datasets, groups, datasets, annotations annotations

• Directory is a Directory is a graph, rather graph, rather than a treethan a tree

Page 10: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

HDF5 Data Model: DatasetHDF5 Data Model: Dataset

RecordRecord

int8int8 int4int4 int16int16 float32float32

Dimensionality: 5 x 3Dimensionality: 5 x 3

Number type:Number type:

3

5

A MultiDimensional Array of Records

Page 11: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Data Records can be: Data Records can be:

• Atomic Datatype ( standard integer )Atomic Datatype ( standard integer )

• Compound Datatype ( C structs )Compound Datatype ( C structs )

• MultidimensionalMultidimensional

• Pointer ( reference to dataset, region )Pointer ( reference to dataset, region )

RecordRecord

int8int8 int4int4 int16int16 float32float32Number type:Number type:

Page 12: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Metadata header

Dataset “Fred”

Data

int16

time = 32.4

pressure = 987

temp = 56

Datatype

Attributes

Dataspace

2 Dim_3=2

Dim_2=4

Dim_1=5Rank

Dimensions

Chunked; compressed

Storage info

Dataset componentsDataset components

• array of data elements array of data elements

• metadatametadata– datatypedatatype– dataspacedataspace– attributesattributes– storage infostorage info

Page 13: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Dataset elements (datatypes)Dataset elements (datatypes)

• standard integer & floatstandard integer & float• user-definable scalars (e.g. 13-bit integer)user-definable scalars (e.g. 13-bit integer)• variable length types (e.g. strings)variable length types (e.g. strings)• pointers - references to objects/regions of pointers - references to objects/regions of

datasetsdatasets• enumeration - names mapped to integersenumeration - names mapped to integers• compound typescompound types

– Comparable to C structs Comparable to C structs

– Members can be atomic or compound types Members can be atomic or compound types

– Members can be multidimensionalMembers can be multidimensional

Page 14: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Attributes Attributes

• Are small pieces of data Are small pieces of data

• Attached to datasets or groupsAttached to datasets or groups

• Operations are scaled down versions of Operations are scaled down versions of the dataset operations the dataset operations – Not extendible Not extendible – No compression No compression – No partial I/O No partial I/O

Page 15: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Dataset featuresDataset features

• Extendible in any directionExtendible in any direction

• Special storage options Special storage options – contiguouscontiguous

– external, chunked, compressedexternal, chunked, compressed

– users can add othersusers can add others

• User-defined attribute listUser-defined attribute list

Page 16: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Dataset Storage OptionsDataset Storage Options

• chunkedchunked

• compressedcompressed

• extendableextendable

• split filesplit file

Metadata for Fred

Dataset “Fred”

File AFile A

File BFile B

Data for FredData for Fred

Better subsetting access time; extendable

Improves storage efficiency, transmission speed

Arrays can be extended in any direction

Metadata in one file, raw data in another.

Page 17: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Dataset Selection OptionsDataset Selection Options

• Selection describes how data points are Selection describes how data points are organized to form a dataset organized to form a dataset

• Select a subset of points for partial I/O Select a subset of points for partial I/O

• Selection can be:Selection can be:– a set of pointsa set of points– a region within an arraya region within an array

• Selections describe array in memory or Selections describe array in memory or in the filein the file

Page 18: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

(c) A sequence of points from a 2D array to a sequence of points in a 3D array.

(d) Union of hyperslabs in file to union of hyperslabs in memory. Number of elements must be equal.

(b) A regular series of blocks from a 2D array to a contiguous sequence at a certain offset in a 1D array

(a) A hyperslab from a 2D array to the corner of a smaller 2D array

Selection region in memory can be different Selection region in memory can be different shape from selection region in fileshape from selection region in file

Page 19: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Sub-selection OptionsSub-selection Options

• Flexibility in mappings between data in memory and object in fileFlexibility in mappings between data in memory and object in file

• Selection regions can beSelection regions can be– pointspoints– hyperslabshyperslabs– unions of hyperslabsunions of hyperslabs

• Selection region in memory can be different shape from selection in Selection region in memory can be different shape from selection in filefile

• Supports parallel I/O via MPI-I/OSupports parallel I/O via MPI-I/O– hyperslab selections translated to MPI derived types before performing I/O hyperslab selections translated to MPI derived types before performing I/O

Page 20: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

HDF5 Raw Data PipelineHDF5 Raw Data Pipeline

• Handles data transformations between file Handles data transformations between file and memory. and memory.

• Deals with multiple storage optionsDeals with multiple storage options– chunking, compression, number conversion,...chunking, compression, number conversion,...

• Optimized performance for common usageOptimized performance for common usage• Hooks for new filtersHooks for new filters

– compression schemes, encryption, checksum,...compression schemes, encryption, checksum,...– user-specified filtersuser-specified filters

Page 21: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

ERBE cloud tracking Index Dataset

RadianceDataset

Rich Framework for Building SearchRich Framework for Building SearchApplicationsApplications

INDEX RECORD CONTAINSINDEX RECORD CONTAINSPOINTER to REGION in POINTER to REGION in DATASETDATASET

Surface Temperature Dataset

Data Structures for Building Efficient External Data Structures for Building Efficient External Indexes and Storing them in the Data fileIndexes and Storing them in the Data file

Page 22: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

Rich Framework for BuildingRich Framework for BuildingSearch ApplicationsSearch Applications

• Efficient data Structures for Building Efficient data Structures for Building External Indexes and Storing them in file External Indexes and Storing them in file with datawith data

• ERBE cloud system trackingERBE cloud system tracking– Automation of ftp and tape storage on a PC for Automation of ftp and tape storage on a PC for

large data volumes (90GB in; 13,000 images out)large data volumes (90GB in; 13,000 images out)– Code to sort data from time-ordered basis to Code to sort data from time-ordered basis to

spatial time sequences for 100 million footprints spatial time sequences for 100 million footprints per data mo..per data mo..

Page 23: HDF Hierarchical Data Format Nancy Yeager Mike Folk NCSA University of Illinois at Urbana-Champaign, USA nyeager@ncsa.uiuc.edu

HDF InformationHDF Information

• HDF Information CenterHDF Information Center– http://hdf.ncsa.uiuc.edu/http://hdf.ncsa.uiuc.edu/

• HDF Help email addressHDF Help email address– [email protected]@ncsa.uiuc.edu

• HDF users mailing listHDF users mailing list– [email protected]@ncsa.uiuc.edu