compiler support for efficient processing of xml datasets

24
Compiler Support for Efficient Processing of XML Datasets Xiaogang Li Renato Ferreira Gagan Agrawal The Ohio State University

Upload: karah

Post on 12-Jan-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Compiler Support for Efficient Processing of XML Datasets. Xiaogang Li Renato Ferreira Gagan Agrawal The Ohio State University. Motivation. The need Analysis of datasets is becoming crucial for scientific advances - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Compiler Support for Efficient Processing of XML Datasets

Compiler Support for Efficient Processing of XML Datasets

Xiaogang Li

Renato Ferreira

Gagan Agrawal

The Ohio State University

Page 2: Compiler Support for Efficient Processing of XML Datasets

Motivation

The need Analysis of datasets is becoming crucial for scientific

advances Complex data formats complicate processing Need for applications that are easily portable –

compatibility with web/grid services The opportunity

The emergence of XML and related technologies developed by W3C

XML is already extensively used as part of Grid/Distributed Computing

Can XML help in scientific data processing?

Page 3: Compiler Support for Efficient Processing of XML Datasets

Motivation (Contd.)

High-level declarative languages ease application development Popularity of Matlab for scientific computations

New challenges in compiling them for efficient

execution XQuery is a high-level language for

processing XML datasets Derived from database, declarative, and

functional languages !

Page 4: Compiler Support for Efficient Processing of XML Datasets

Approach / Contributions

Show how XML schemas and XQuery simplify scientific data processing

Identify issues in compiling XQuery for this class of applications

New compiler analyses and algorithms Developed and evaluated a compiler

Page 5: Compiler Support for Efficient Processing of XML Datasets

Outline

XQuery language features Scientific data processing applications Compilation Challenges Compiler analyses

Identifying and parallelizing reductions Data-centric transformations Type inferencing for XQuery to C++ translation

System Architecture Use of Active Data Repository (ADR) as the runtime

system Experimental results Conclusions

Page 6: Compiler Support for Efficient Processing of XML Datasets

XQuery Overview

XQuery -A language for querying and

processing XML document - Functional language - Single Assignment - Strongly typed

XQuery Expression - for let where return (FLWR) - unordered - path expression

Unordered(For $d in document(“depts.xml”)//deptno let $e:=document(“emps.xml”)//emp [Deptno= $d] where count($e)>=10 return <big-dept> {$d, <count> {count($e) }</count> <avg> {avg($e/salary)}<avg> } </big-dept> )

Page 7: Compiler Support for Efficient Processing of XML Datasets

Target Applications

Focus on scientific data processing applications

Arise in a number of scientific domains Frequently involve generalized reductions Can be expressed conveniently in XQuery Low-level data layout is hidden from the

application programmers

Page 8: Compiler Support for Efficient Processing of XML Datasets

Satellite- XQuery Code

Unordered (

for $i in ( $minx to $maxx)

for $j in ($miny to $maxy)

let p:=document(“sate.xml”)

/data/pixel

return

<pixel>

<latitude> {$i} </latitude>

<longitude> {$j} <longitude>

<sum>{accumulate($p)}</sum>

</pixel>

)

Define function accumulate ($p) as double { let $inp := item-at($p,1) let $NVDI := (( $inp/band1

-$inp/band0)div($inp/band1+$inp/band0)+1)*512

return if (empty( $p) ) then 0 else { max($NVDI,

accumulate(subsequence ($p, 2 ))) }

Page 9: Compiler Support for Efficient Processing of XML Datasets

VMScope- XQuery Code

Unordered ( for $i in ( $x1 to $x2) for $j in ($y1 to $y2) let p:=document(“vmscope.xml”) /data/pixel [(x=$i) and ( y=$j)

and (scale >=$z1) return <pixel> <latitude> {$i} </latitude> <longitude>{$j} <longitude> <sum>{accumulate($p)}</sum> </pixel> )

Define function accumulate ($p) as element { if (empty( $p) then $null else let

$max=accumulate(subsequence($p,2)) let $q := item-at( $p, 1) return if ($q/scale < $max/scale ) or ($max = $null ) then $max else $q }

Page 10: Compiler Support for Efficient Processing of XML Datasets

Challenges

Reductions expressed as recursive functions Direct execution can be very expensive

Generating code in an imperative language For either direct compilation or use a part of a

runtime system Requires type conversion

Enhancing locality Data-centric execution on XQuery constructs Use information on low-level data layout

Page 11: Compiler Support for Efficient Processing of XML Datasets

Data Distribution Service

Data Indexing Service

XML SchemaXML Files

XQuery Sources

Compiler

XML Mapping Service

System Architecture

Page 12: Compiler Support for Efficient Processing of XML Datasets

Compilation Framework

ADR Code Generation

Local Reduction Global Reduction

Recursion Analysis

Type Analysis

Data Centric Analysis

Com

piler A

nalysis

XQuery Schema

XQuery Parser Schema Parser

Compiler Front-end

Page 13: Compiler Support for Efficient Processing of XML Datasets

Compiler Analysis and Tasks

Analysis of recursive function - Extracting associative and commutative operations involved

- Transform recursive function to iterative operations

Data centric transformation -Reconstruct unordered loops of the query

-New strategy requires only one scan of the dataset

Type inferencing and analysis

Page 14: Compiler Support for Efficient Processing of XML Datasets

Recursion Analysis

Assumption -Expressed in Canonical form 1) Linear recursive function 2) Operation is associative and

commutative

Objective - Extracting associative and

commutative operations - Extracting initialization

conditions - Transform into iterative

operations - Generate a global reduction

function

Canonical Form

Define function F($t) { if (p1) then F1 ($t)Else F2(F3($t),

F4(F(F5($t)))) }

Page 15: Compiler Support for Efficient Processing of XML Datasets

Recursion analysis -Algorithm Algorithm1. Add leaf nodes represent or are

defined by recursive function to Set S.

2. Keep only nodes that may be

returned as the final value to Set S. 3. Recursively find a least common

ancestor A of all nodes in S.4. Return the subtree with A as

Root. 5. Examine if the subtree

represents an associative and communicative operation

Example

define function accumulate ($p)return double if (empty($p) ) then 0

else let $val := accumulate(subsequence($p,2)) let $q := item-at($p,1)

return If ($q >= $val) then $val

else $q

Page 16: Compiler Support for Efficient Processing of XML Datasets

Data Centric Transformation

Objective -Reconstruct unordered loops of the query so that

only one scan of the entire dataset is sufficient

Algorithm 1. Perform loop fusion that is necessary2. Generate abstract iteration space3. Extracting necessary and sufficient conditions that

maps an element in the dataset to the iteration space

Page 17: Compiler Support for Efficient Processing of XML Datasets

Naïve Strategy

DatasetOutput

Requires 3 Scans

Page 18: Compiler Support for Efficient Processing of XML Datasets

Data Centric Strategy

DatasetsOutput

Requires just one scan

Page 19: Compiler Support for Efficient Processing of XML Datasets

Type inferencing

Objective - Type analysis for generation of C/C+

+ code

Challenge - An XQuery expression may return

multiple types - XQuery supports parametric

polymorphism

Algorithm -Constraint-based type inference

algorithm -Bottom-up and recursive -Compute a set of all possible types

an expression may return

typeswitch ($pixel){ case element double pixel return max( $pixel/d_value,0) case element integer pixel return max( $pixel/i_value,0) default 0}

Page 20: Compiler Support for Efficient Processing of XML Datasets

Type inference

C++ code generation

-Use Union to represent multiple simple types

-Use C++ class polymorphism represent multiple complex types

-Use function clone for parametric Polymorphism

class t_pixel pixel;struct tmp result_1{ union { double r1 ; int r2 ; } tmp result_1 t1;if (pixel.tag double pixel) t1.r1 = max 1(pixel.d vaule,0) ;else if (pixel .tag integer pixel) t1.r2 = max 2(pixel.i value,0) ;elset1.r2 = 0;

Page 21: Compiler Support for Efficient Processing of XML Datasets

Evaluating Data centric Transformation

0100200300400500600700800900

1000

150M 600M0

200400600800

100012001400160018002000

150M 600M

Opt

Naïve

Virtual Microscope Satellite

Page 22: Compiler Support for Efficient Processing of XML Datasets

Parallel Performance- VMScope

0

500

1000

1500

2000

2500

3000

3500

4000

4500

1 2 4 8

COMM

NO COMM

Page 23: Compiler Support for Efficient Processing of XML Datasets

Parallel Performance- Satellite

050

100150200250300350400450500

1 2 4 8

COMM

NON-COMM

Page 24: Compiler Support for Efficient Processing of XML Datasets

Conclusions

• A case for the use of XML technologies in

scientific data analysis • XQuery – a data parallel language ? • Identified and addressed compilation

challenges • A compilation system has been built

• Very large performance gains from data-centric transformations

• Achieves good parallel performance