projection indexes in hdf5

Post on 01-Jan-2016

18 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Projection Indexes in HDF5. Rishi Rakesh Sinha The HDF Group. 144 MB/hr. 200 GB/run. Science Produces Large Datasets. Observation/experiment driven. Simulation driven. Information driven. > 7GB/expt. Why Not Commercial DMBSs?. Proprietary format Lack of portability Low scalability - PowerPoint PPT Presentation

TRANSCRIPT

11

Projection Indexes in Projection Indexes in HDF5HDF5

Rishi Rakesh SinhaRishi Rakesh Sinha

The HDF GroupThe HDF Group

22

Science Produces Large DatasetsScience Produces Large Datasets

Observation/experiment drivenObservation/experiment driven

Simulation driven

Information driven

144 MB/hr

200 GB/run

> 7GB/expt

33

Why Not Commercial DMBSs?Why Not Commercial DMBSs?

Proprietary formatProprietary format Lack of portabilityLack of portability Low scalabilityLow scalability Lack of desirable access modesLack of desirable access modes Presence of expensive concurrency Presence of expensive concurrency

control and logging mechanismcontrol and logging mechanism Expensive parallel versionsExpensive parallel versions

44

State of the Art Not EnoughState of the Art Not Enough

Scientific file formatsScientific file formats and associated and associated I/O APIsI/O APIs Concentrating on HDF5Concentrating on HDF5

Data recovery is Data recovery is navigationalnavigational

SubsettingSubsetting only on a small set of only on a small set of attributesattributes

55

Why Indexes?Why Indexes?

Easy

Not So Easy

66

Previous Indexing EffortsPrevious Indexing Efforts

Implicit indexing in HDF5Implicit indexing in HDF5 JPL use of HDF VdatasJPL use of HDF Vdatas HDF-EOS point dataHDF-EOS point data PyTablesPyTables HDF5 internal B-Tree structuresHDF5 internal B-Tree structures

77

Why a Standard Indexing API?Why a Standard Indexing API?

Avoid duplication of effortAvoid duplication of effort PyTablesPyTables

Standardize indexing in HDF5Standardize indexing in HDF5 Standard API can be differently Standard API can be differently

implementedimplemented Make indexes portableMake indexes portable

Store indexes in HDF5 filesStore indexes in HDF5 files

88

H5IN APIH5IN API

Create_indexCreate_index Parameters: location of index, location of Parameters: location of index, location of

data, binning information, memory limitsdata, binning information, memory limits Returns: location of the indexReturns: location of the index

QueryQuery Parameters: dataset to query, query stringParameters: dataset to query, query string Returns: selection representing subset of the Returns: selection representing subset of the

data corresponding to the querydata corresponding to the query

99

Design DecisionsDesign Decisions

Limited scope of the prototypeLimited scope of the prototype Index stored in a separate datasetIndex stored in a separate dataset Returns a selectionReturns a selection Projection indexProjection index Support for simple boolean queriesSupport for simple boolean queries

1010

Limited ScopeLimited Scope

11stst indexing prototype in HDF5 indexing prototype in HDF5 Presence of implicit indexingPresence of implicit indexing

Index on single datasetsIndex on single datasets Query over single datasetsQuery over single datasets

Conditions should be over a single datasetConditions should be over a single dataset Result could be mapped to a separate Result could be mapped to a separate

datasetdataset

1111

Index StorageIndex StorageRoot Group: /

DAY1 DAY2 DAY3 DAY4

F3F3F2F2F1F1

Location DataPressureTemperature

1212

Index StorageIndex StorageRoot Group: /

DAY3

F3F3F2F2F1F1

Location Data

LD_INDEX

F1 F2

1313

Index StorageIndex StorageRoot Group: /

DAY3

PressureTemperature

T_IN P_IN

PressureTemperature

1414

Returns a SelectionReturns a Selection

Temperature Pressure

Concise StorageConcise Storage Efficient Boolean operationsEfficient Boolean operations

FIND PRESSURE WHERE TEMP IN [100, 200]

1515

Projection IndexProjection Index

TempTemp CategoryCategory PressurePressure

5252 AA 3232

4242 DD 3434

5757 FF 2121

2222 AA 2222

6767 DD 2727

AA

DD

FF

AA

FF

DD

1616

BinningBinning

11 22 33 44 55 66 77 88 99 1010 1111 1212 1313 1414 1515

1-31-3 4-64-6 7-97-9 10-1210-12 13-1513-15

1717

Projection IndexProjection Index

605040

313029Pressure

Temp

1818

Why Projection Index ?Why Projection Index ?

Data is read onlyData is read only Mostly dataset once written is not changedMostly dataset once written is not changed

Index does not need to be updatedIndex does not need to be updated Projection indexes well suitedProjection indexes well suited

Number of disk accesses is same as in case Number of disk accesses is same as in case of a B-Treeof a B-Tree

Are not considering multidimensional Are not considering multidimensional queriesqueries

1919

Only Simple Boolean QueriesOnly Simple Boolean Queries

Query FormatQuery FormatSELECT SELECT SELECTIONSELECTION

WHEREWHERE c11 < Attribute1 < c12c11 < Attribute1 < c12

AND c21 < Attribute2 < c22AND c21 < Attribute2 < c22

…… Results being selections boolean operations Results being selections boolean operations

can be done inside the library can be done inside the library

2020

ConclusionConclusion

Developing a standard indexing API in Developing a standard indexing API in HDF5HDF5

Creating a proof of concept prototype Creating a proof of concept prototype using projection indexesusing projection indexes

Take first step towards developing a Take first step towards developing a query language for HDF5query language for HDF5

2121

Future WorkFuture Work

Multi-dimensionalityMulti-dimensionality Multiple datasets in same fileMultiple datasets in same file Multiple datasets across filesMultiple datasets across files Indexes on attributesIndexes on attributes Allow user to index subset of datasetsAllow user to index subset of datasets

top related