ohio state university department of computer science and engineering automatic data virtualization -...

28
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Automatic Data Virtualization - Supporting XML based Supporting XML based abstractions on HDF5 Datasets abstractions on HDF5 Datasets Swarup Kumar Sahoo Gagan Agrawal

Upload: daniel-harmon

Post on 29-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Automatic Data Virtualization - Automatic Data Virtualization - Supporting XML based abstractions Supporting XML based abstractions

on HDF5 Datasetson HDF5 Datasets

Swarup Kumar Sahoo

Gagan Agrawal

Page 2: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

RoadmapRoadmap• Motivation • Introduction• System Overview• XQuery, Low and High Level schema and HDF5

storage• Compiler Analysis and Algorithm• Experiment • Summary and Future Work

Page 3: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

MotivationMotivation

• Emergence of grid-based data repositories– Can enable sharing of data

• Emergence of applications that process large datasets– Complicated by complex and specialized storage formats

• Need for easily portable applications– Compatibility with web/grid services

Page 4: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Data VirtualizationData Virtualization

An abstract view of data

dataset

Data Service Data

Virtualization

By Global Grid Forum’s DAIS working group:• A Data Virtualization describes an abstract view of data.• A Data Service implements the mechanism to access and process data through the Data Virtualization

Page 5: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Introduction : Automatic Data Introduction : Automatic Data VirtualizationVirtualization

• Goal : Enable Automatic creation of efficient data services

– Support a high-level or abstract view of data

– Data is stored in low-level format

• Application development: – assume a high-level or virtual view

• Application Execution: – On actual low-level layout

Page 6: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Overview of Our Automatic Data Overview of Our Automatic Data Virtualization WorkVirtualization Work

• Previous work on XML Based virtualization – Techniques for XQuery Compilation (Li and Agrawal, ICS

2003, DBPL 2003)

– Supporting XML Based high-level abstraction on flat-file datasets (LCPC 2003, XIME-P 2004)

• Relational Table/SQL Based Implementation– Supporting SQL Select and Where (HPDC 2004)

– Supporting SQL-3 Aggregations (LCPC 2004)

Page 7: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

XML-based VirtualizationXML-based Virtualization

TEXT

NetCDF

RDBMS

HDF5

XML

XQuery

???

Page 8: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Challenges and ContributionsChallenges and Contributions

• Challenges – Compiler generates efficient data processing code

» Uses the information about the low-level layout and mapping between virtual and low-level layout

– Challenge in compilation» High level to low level

» to ensure high locality in processing of large datasets

• Contributions of this paper – An improved data- centric transformation algorithm

– An implementation specific to HDF5 as the low-level format

Page 9: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

System OverviewSystem Overview

High levelXML Schema

Mapping Schema

XQuery Source Code

Compiler

Generated Code

Processor and Disk

System OverviewSystem Overview

Low levelXML Schema

HDF5 Library

Page 10: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

XQuery and HDF5XQuery and HDF5

• High-level declarative languages ease application development– XQuery is a high-level language for processing XML datasets

– Derived from database, declarative, and functional languages!

• HDF5:– Hierarchical Data Format

– Widely used in scientific communities

– A case study with a format which has optimized access libraries

Page 11: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Use of XML SchemasUse of XML Schemas

• High-level schema– XML is used to provide a virtual view of the dataset

• Low-level schema – reflects actual physical layout in HDF5

• Mapping schema:– describes mapping between each element of high-level

schema and low-level schema

Page 12: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Oil Reservoir SimulationOil Reservoir Simulation• Support cost-effective Oil

Production• Simulations on a 3-D grid• 17 variables and cell

locations in 3-D grid at each time step

• Computation of bypassed regions– Expression to determine if a

cell is bypassed for a time-step– Within a spatial region and

range of time steps– Grid cells that are bypassed for

every time-step in the rangeOil Reservoir management

Page 13: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

High-Level SchemaHigh-Level Schema< xs:element name="data" maxOccurs="unbounded" >

< xs:complexType > < xs:sequence >

< xs:element name="x" type="xs:integer"/ > < xs:element name="y" type="xs:integer"/ > < xs:element name="z" type="xs:integer"/ > < xs:element name="time" type="xs:integer"/ > < xs:element name="velocity" type="xs:float"/ > < xs:element name="mom" type="xs:float"/ >

< /xs:sequence >

< /xs:complexType >

< /xs:element >

Page 14: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

High-Level XQuery Code Of Oil High-Level XQuery Code Of Oil Reservoir managementReservoir management

unordered( for $i in ($x1 to $x2)

for $j in ($y1 to $y2) for $k in ($z1 to $z2)

let $p := document("OilRes.xml")/datawhere ($p/x=$i) and ($p/y = $j) and ($p/z = $k) and ($p/time >= $tmin) and ($p/time <= $tmax) return <info> <coord> {$i, $j, $k} </x-coord> <summary> { analyze($p) } </summary> </info>

)

Page 15: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Low-Level SchemaLow-Level Schema<file name="info">

<sequence> <group name="data">

<attribute name="time"> <datatype> integer </datatype> <dataspace> <rank> 1 </rank> <dimension> [1] </dimension> </dataspace> </attribute>

<dataset name="velocity"> <datatype> float </datatype> <dataspace> <rank> 1 </rank> <dimension> [x] </dimension> </dataspace> </dataset>

..............

</group> </sequence>

</file>

Page 16: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Mapping SchemaMapping Schema

//high/data/velocity //low/info/data/velocity

//high/data/time //low/info/data/time

//high/data/mom //low/info/data/mom [index(//low/info/data/velocity, 1)]

//high/data/x //low/coord/x [index(//low/info/data/velocity, 1)]

Page 17: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Compiler AnalysisCompiler Analysis

• Problem with direct translation :– Each let expression involves complete scan over dataset– So final code will need several passes over the data

• Solution :– Apply Data Centric Transformations to read a portion HDF5

dataset only once

Page 18: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Naïve Strategy Naïve Strategy

DatasetOutput

Requires 3 Scans

Page 19: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Data Centric StrategyData Centric Strategy

DatasetsOutput

Requires just one scan

Page 20: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Data Centric TransformationData Centric Transformation

• Overall Idea in Data-Centric Transformation – Iterate over each data element in actual storage – Find out iterations of the original loop in which they are accessed.– Execute computation corresponding to those iterations.

• Previous Work – Pingali et al.: blocking – Ferreira and Agrawal: data-parallel Java on disk-resident datasets– Li and Agrawal: XQuery, invert getData functions

• Our contribution: – Use Low-Level and Mapping Schema – Extend the idea when multiple datasets need to be accessed

Page 21: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Data Centric TransformationData Centric Transformation

• Mapping Function T :Iteration space → High-Level data

• Mapping Function C : High-Level data → Low-Level data

• Mapping Function C · T = M : Iteration space → Low-Level data

• Our Goal is to compute M-1.

Page 22: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Data Centric TransformationData Centric Transformation

• Choose one dataset as base dataset S1 from n datasets to be accessed

• Apply M1-1 to compute set of iterations.

• The expression Mi · M1

-1 gives the portion of dataset Si that needs to be accessed along with S1

• Choice of base dataset might impact the data locality.

Page 23: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Choice of Base DatasetChoice of Base Dataset

• Min-IO-Volume Strategy – Minimize repeated access to any dataset

• Min-Seek-Time Strategy – Minimize any discontinuity in access

Page 24: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Template for Generated CodeTemplate for Generated CodeGenerated_Query { Create an abstract iteration space using Source code. Allocate and initialize an array of output element corresponding to

iteration space. For k = 1, …, NO_OF_CHUNKS

{ Read kth chunk of dataset S1 using HDF5 functions and structural tree. Foreach of the other datasets S2, … , Sn

access the required chunk of the dataset. Foreach data element in the chunks of data

{ compute the iteration instance. apply the reduction computation and update the output.

} }

}

Page 25: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

ExperimentExperiment

Impact of Strategy and Chunk-Size, Dataset1

0

200

400

600

800

1000

1200

1400

1 5 15 31 62 125

Read Chunk-Size(x1000 elements)

Tim

e(s

ec) )

Min-Seek-Time

Min-IO-Volume

200*200*200 grid with 10 time steps (1.28 GB)

50*50*50 Storage Chunk Size

Page 26: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

ExperimentExperiment

Impact of Strategy and Chunk-Size, Dataset2

0

50

100

150

200

250

300

350

1 5 15 31 62

Read Chunk-Size(x1000 elements)

Tim

e(s

ec) )

Min-Seek-Time

Min-IO-Volume

50*50*50 grid with 200 time steps (400 MB)

25*25*25 Storage Chunk Size

Page 27: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

Key ObservationsKey Observations

• Overall minimum execution time – Min-IO-Volume strategy when read chuck size matches

storage chunk size

• Execution time – Very sensitive to Read Chunk-Size in Min-IO-Volume

Strategy

– Not sensitive to Read Chunk-Size in Min-Seek-Time Strategy due to buffering of Storage chunks

Page 28: Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Ohio State University Department of Computer Science and Engineering

SummarySummary• Compiler techniques

– Support High-level abstractions on complex low-level data formats

– Enables use of the same source code across a variety of data formats

– Perform data centric transformations automatically– Experimental result shows minor change in strategy can affect

performance significantly • Future Work

– Cost models to guide strategy and chunk size selection – Compare performance with manual implementations – parallelizing data processing– extend applicability of the algorithm to more general class of

queries