projection indexes in hdf5
Post on 01-Jan-2016
18 Views
Preview:
DESCRIPTION
TRANSCRIPT
11
Projection Indexes in Projection Indexes in HDF5HDF5
Rishi Rakesh SinhaRishi Rakesh Sinha
The HDF GroupThe HDF Group
22
Science Produces Large DatasetsScience Produces Large Datasets
Observation/experiment drivenObservation/experiment driven
Simulation driven
Information driven
144 MB/hr
200 GB/run
> 7GB/expt
33
Why Not Commercial DMBSs?Why Not Commercial DMBSs?
Proprietary formatProprietary format Lack of portabilityLack of portability Low scalabilityLow scalability Lack of desirable access modesLack of desirable access modes Presence of expensive concurrency Presence of expensive concurrency
control and logging mechanismcontrol and logging mechanism Expensive parallel versionsExpensive parallel versions
44
State of the Art Not EnoughState of the Art Not Enough
Scientific file formatsScientific file formats and associated and associated I/O APIsI/O APIs Concentrating on HDF5Concentrating on HDF5
Data recovery is Data recovery is navigationalnavigational
SubsettingSubsetting only on a small set of only on a small set of attributesattributes
55
Why Indexes?Why Indexes?
Easy
Not So Easy
66
Previous Indexing EffortsPrevious Indexing Efforts
Implicit indexing in HDF5Implicit indexing in HDF5 JPL use of HDF VdatasJPL use of HDF Vdatas HDF-EOS point dataHDF-EOS point data PyTablesPyTables HDF5 internal B-Tree structuresHDF5 internal B-Tree structures
77
Why a Standard Indexing API?Why a Standard Indexing API?
Avoid duplication of effortAvoid duplication of effort PyTablesPyTables
Standardize indexing in HDF5Standardize indexing in HDF5 Standard API can be differently Standard API can be differently
implementedimplemented Make indexes portableMake indexes portable
Store indexes in HDF5 filesStore indexes in HDF5 files
88
H5IN APIH5IN API
Create_indexCreate_index Parameters: location of index, location of Parameters: location of index, location of
data, binning information, memory limitsdata, binning information, memory limits Returns: location of the indexReturns: location of the index
QueryQuery Parameters: dataset to query, query stringParameters: dataset to query, query string Returns: selection representing subset of the Returns: selection representing subset of the
data corresponding to the querydata corresponding to the query
99
Design DecisionsDesign Decisions
Limited scope of the prototypeLimited scope of the prototype Index stored in a separate datasetIndex stored in a separate dataset Returns a selectionReturns a selection Projection indexProjection index Support for simple boolean queriesSupport for simple boolean queries
1010
Limited ScopeLimited Scope
11stst indexing prototype in HDF5 indexing prototype in HDF5 Presence of implicit indexingPresence of implicit indexing
Index on single datasetsIndex on single datasets Query over single datasetsQuery over single datasets
Conditions should be over a single datasetConditions should be over a single dataset Result could be mapped to a separate Result could be mapped to a separate
datasetdataset
1111
Index StorageIndex StorageRoot Group: /
DAY1 DAY2 DAY3 DAY4
F3F3F2F2F1F1
Location DataPressureTemperature
1212
Index StorageIndex StorageRoot Group: /
DAY3
F3F3F2F2F1F1
Location Data
LD_INDEX
F1 F2
1313
Index StorageIndex StorageRoot Group: /
DAY3
PressureTemperature
T_IN P_IN
PressureTemperature
1414
Returns a SelectionReturns a Selection
Temperature Pressure
Concise StorageConcise Storage Efficient Boolean operationsEfficient Boolean operations
FIND PRESSURE WHERE TEMP IN [100, 200]
1515
Projection IndexProjection Index
TempTemp CategoryCategory PressurePressure
5252 AA 3232
4242 DD 3434
5757 FF 2121
2222 AA 2222
6767 DD 2727
AA
DD
FF
AA
FF
DD
1616
BinningBinning
11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414 1515
1-31-3 4-64-6 7-97-9 10-1210-12 13-1513-15
1717
Projection IndexProjection Index
605040
313029Pressure
Temp
1818
Why Projection Index ?Why Projection Index ?
Data is read onlyData is read only Mostly dataset once written is not changedMostly dataset once written is not changed
Index does not need to be updatedIndex does not need to be updated Projection indexes well suitedProjection indexes well suited
Number of disk accesses is same as in case Number of disk accesses is same as in case of a B-Treeof a B-Tree
Are not considering multidimensional Are not considering multidimensional queriesqueries
1919
Only Simple Boolean QueriesOnly Simple Boolean Queries
Query FormatQuery FormatSELECT SELECT SELECTIONSELECTION
WHEREWHERE c11 < Attribute1 < c12c11 < Attribute1 < c12
AND c21 < Attribute2 < c22AND c21 < Attribute2 < c22
…… Results being selections boolean operations Results being selections boolean operations
can be done inside the library can be done inside the library
2020
ConclusionConclusion
Developing a standard indexing API in Developing a standard indexing API in HDF5HDF5
Creating a proof of concept prototype Creating a proof of concept prototype using projection indexesusing projection indexes
Take first step towards developing a Take first step towards developing a query language for HDF5query language for HDF5
2121
Future WorkFuture Work
Multi-dimensionalityMulti-dimensionality Multiple datasets in same fileMultiple datasets in same file Multiple datasets across filesMultiple datasets across files Indexes on attributesIndexes on attributes Allow user to index subset of datasetsAllow user to index subset of datasets
top related