towards a benchmark for etl workflows panos vassiliadis anastasios karagiannis vasiliki tziovara...

58
Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

Post on 18-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

Towards a Benchmark for ETL Workflows

Panos VassiliadisAnastasios KaragiannisVasiliki Tziovara

Alkis Simitsis

Univ. of Ioannina

Almaden Research Center

Page 2: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

2

Goal of this work

• The goal of this paper is to contribute towards a benchmark that can serve as the experimental testbed for the assessment of ETL methods or tools concerning their basic behavioral properties (measures) over a broad range of ETL workflows.

Page 3: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

3

Roadmap

• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues

Page 4: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

4

Roadmap

• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues

Page 5: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

5

Add_SPK1

SUPPKEY=1

SK1

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SUPPKEY

$2€

COST DATE

DS.PS2 Add_SPK2

SUPPKEY=2

SK2

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SUPPKEYCOST DATE=SYSDATE

AddDate CheckQTY

QTY>0

U

DS.PS1

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF1

DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW

1

DS.PS_OLD

1

DW.PARTSUPP

Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP1S1_PARTSU

PP

S2_PARTSUPP

FTP2

DS.PS_NEW

2

DIFF2

DS.PS_OLD

2

DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY

Sources DW

DSA

ETL workflows

Page 6: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

6

Off-line vs. Active ETL

• Off-line mode: the workflow is executed during a specific time window of some hours (typically at night), when the systems are not servicing their end-users. – Due to the low load of both the source systems and the

warehouse, both the refreshment of data and any other administrative activities (cleanups, auditing, etc) are easier to complete.

• Active mode: the sources continuously (with high frequency) try to send new data to the warehouse. – Not necessarily instantly; rather, small groups of data are

collected and sent to the warehouse for further processing.

• Differences of the two modes:– frequency of the workflow execution– load incurred to all the systems involved.

Page 7: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

7

Problem & Motivation

• When experimenting with ETL wokflows– what are the important problem parameters

& what are the realistic values for them?– what test suites should we use?

• We have faced the problem twice so far:– Logical optimization of the ETL process

(transposition of activities to speed up the workflow – ICDE05, TKDE05)

– Physical optimization of the ETL process (which physical operators to use for optimal execution of the process – DOLAP07)

Page 8: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

8

Logical Optimization

PARTSPARTS1

A2E(DATE)

σ(€COST)

γ(DATE)

1 987

2 5 6

UNN(€COST)

3

$2€($COST)

4

PARTS2

PARTS

PARTS1

σ(€COST)

U

γ(DATE)

1

97

3

2 8_2 6

NN(€COST)

σ(€COST)

8_1

$2€($COST)

4

PARTS2A2E(DATE)

5

Can we push selection early enough?

Can we aggregate before $2€ takes place?

How about naming conflicts?

Page 9: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

9

Problem & Motivation

• Existing standards are insufficient– TPC-H– TPC-DS

• Practical cases are not publishable• We resort in devising our own ad-

hoc test scenarios– either through a specific set of

scenarios– or, through a scenario generator

Page 10: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

10

Roadmap

• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues

Page 11: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

11

Goal of this work

• We are interested in understanding– The fundamental families of activities

performed in an ETL scenario– The frequent ways with which activities

and recordsets interconnect in an ETL scenario

– The important parameters to be tuned in an experiment & the appropriate values for them

– The appropriate measures to be measured during an experiment

Page 12: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

12

The goal is not …

• To compare existing systems…• To provide specialized performance

measures for very specific tasks in the overall process…

• To exhaustively enumerate all the possible alternatives for specific operations…

• …

Page 13: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

13

Fundamental goals of any ETL workflow

• Effectiveness– Does the workflow execution reach the

maximum possible (or, at least, the minimum tolerable) level of data freshness, completeness and consistency in the warehouse within the necessary time (or resource) constraints?

– Is the workflow execution resilient to occasional failures?

• Efficiency– How fast is the workflow executed?– What resource overheads does the workflow

incur at the source and the warehouse side?

Page 14: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

14

Effectiveness

• Bring the data from the sources to the DW s.t.:– Periodically, as many of them as possible are

loaded. – The loaded data are as fresh as possible– At “checkpoint” timestamps, all data have been

loaded in the warehouse– Data respect both database and business rules.

• Potential problems :– Recovery from failures. – Missing changes at the source. – Invalid records due to failed transactions.

Page 15: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

15

Efficiency

• Data exchange between sources and warehouse is done as fast as possible

• Minimal overhead is incurred – at the source systems. – at the data warehouse.

Page 16: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

16

Problem parameters

• the size of the workflow (i.e., the number of nodes contained in the graph),

• the structure of the workflow (i.e., the variation of the nature of the involved nodes and their interconnection as the workflow graph)

• the size of input data originating from the sources,

• the overall selectivity of the workflow, based on the selectivities of the activities of the workflow,

• the values of probabilities of failure.

Page 17: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

17

Measures

• Measures for data freshness and data consistency– (M1.1) Percentage of data that violate

business rules.– (M1.2) Percentage of data that should

be present at their appropriate warehouse targets, but they are not.

• Measures for the resilience to failures – (M2) Percentage of successfully

resumed workflow executions.

Page 18: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

18

Measures

• Speed of the overall process– (M3.1) Throughput of regular workflow

execution (this may also be measured as total completion time).

– (M3.2) Throughput of workflow execution including a specific percentage of failures and their resumption.

– (M3.3) Average latency per tuple in regular execution.

Page 19: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

19

Measures• Overheads

– (M4.1) Min/Max/Avg/ timeline of memory consumed by the ETL process at the source system.

– (M4.2) Time needed to complete the processing of a certain number of OLTP transactions in the presence (as opposed to the absence) of ETL software at the source, in regular source operation.

– (M4.3) The same as 4.2, but in the case of source failure, where ETL tasks are to be performed too, concerning the recovered data.

– (M4.4) Min/Max/Avg/ timeline of memory consumed by the ETL process at the warehouse system.

– (M4.5) (active warehousing) Time needed to complete the processing of a certain number of decision support queries in the presence (as opposed to the absence) of ETL software at the warehouse, in regular operation.

– (M4.6) The same as M4.5, but in the case of any (source or warehouse) failure, where ETL tasks are to be performed too at the warehouse side.

Page 20: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

20

Goal of this work

• Covered:The important parameters to be tuned in

an experiment & the appropriate values for them

The appropriate measures to be measured during an experiment

• Pending: The fundamental families of activities

performed in an ETL scenario The frequent ways with which activities

and recordsets interconnect in an ETL scenario

Page 21: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

21

Roadmap

• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues

Page 22: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

22

Formal treatment

• We model an ETL workflow as a directed acyclic graph G(V,E). – Each node vV is either an activity a or

a recordset r. – An edge (a,b)E denotes that b

receives data from node a for further processing.

Page 23: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

23

Formal treatment• Intra-node constraints

– Each recordset r is a pair (r.name, r.schema), with the schema being a finite list of attribute names.

– Each activity a is a tuple (N,I,O,S,A). N is the activity’s name. I is a finite set of input schemata. O is a finite set of output schemata. S is a declarative description of the relationship of its output schema with its input schemata in an appropriate language (without delving into algorithmic or implementation issues). A is the algorithm chosen for activity’s execution.

• Inter-node constraints – The data consumer of a recordset cannot be another

recordset. Still, more than one consumer is allowed for recordsets.

– Each activity must have at least one provider, either another activity or a recordset. When an activity has more than one data providers, these providers can be other activities or activities combined with recordsets.

– Feedback of data is not allowed; i.e., the data consumer of an activity cannot be the same activity.

Page 24: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

24

Resulting problem…

• Due to the intra-node characteristics, we need a “taxonomy” for ETL activities (micro-level)

• Due to the infinite possibilities of connecting nodes (activities and recordsets) we need a set of “design patterns” as abstractions of how frequently encountered ETL graphs look like (macro-level)

Page 25: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

25

Micro level

• Micro level: derive a set of fundamental classes, where frequently encountered activities can be classified

• Why a taxonomy of ETL activities?– Impossible to predict any possible

script / algorithm / operator /…– No algebra available right now

Not necessary only for the benchmark, but orthogonally for other tasks (e.g.,

optimization, statistics, etc)

Page 26: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

26

Page 27: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

27

Logical classification

• Row-level operations, which are locally applied to a single row.

• Router operations, which locally decide, for each row, which of the many (output) destinations it should be sent to.

• Unary Grouper operations, which transform a set of rows to a single row.

• Unary Holistic operations, which perform a transformation to the entire data set. These are usually blocking operations.

• Binary or N-ary operations, which combine many inputs into one output.

Page 28: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

28

Micro level

blocking

semi-blocking

non-blocking

binary# inputs

Final classification

Physical-level characteristics

unary

N-ary

row-level

router grouper

Page 29: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

29

Micro-level

  Unary N-ary

Blocking HolisticGrouper

HolisticGrouper

Semi-blocking Grouper Grouper

Non-blocking Row-levelRouter

 

Page 30: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

30

Macro level

• Even harder!• How to derive a set of typical

structural patterns for an ETL scenario?– Top down: delve to the fundamental

constituents of such a scenario– Bottom up: explore scenarios and try to

abstract common parts

• We did a little bit of both, and derived a fundamental pattern of structure

Page 31: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

31

A butterfly is an ETL workflow that consists of three distinct components:

• Body: a central, detailed point of persistence (fact or dimension table) that is populated with the data produced by the left wing.

• Left wing: sources, activities and intermediate results. Performs extraction, cleaning and transformation + loads the data to the body.

• Right wing: materialized views, reports, spreadsheets, as well as the activities that populate them, to support reporting and analysis

Butterflies to the rescue!

Page 32: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

32

γA,Β

4

V

S

R Z

γA

5

W

100000

100000

sel1=0.6

sel2=0.1

sel4=0.2

p1=0.003

p2=0.004

p4=0.001

σA<600

1

σA>300

2

sel5=0.5

p5=0.005

wv A=A

3

sel3=0.2

p3=0.001

Butterflies to the rescue!

Page 33: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

33

γA,Β

4

V

S

R Z

γA

5

W

100000

100000

sel1=0.6

sel2=0.1

sel4=0.2

p1=0.003

p2=0.004

p4=0.001

σA<600

1

σA>300

2

sel5=0.5

p5=0.005

wv A=A

3

sel3=0.2

p3=0.001

Butterflies to the rescue!

Page 34: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

34

Butterfly classes

• Butterflies constitute a fundamental pattern of reference– Line– Balanced butterfly

• Left-winged variants (heavy of the ETL part)– Primary flow– Wishbone– Tree

• Right winged variants (heavy on the “reporting” part)– Fork

• Irregular variants

Page 35: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

35

1 2 3 5 6 8

Not Null (Part Key, Order Key, Supp Key)

Currency (Ext. Price, Discount,

Tax)Derive Fnc

(Profit)

Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)

Return Status = True

Sum (Profit), Sum (Ext. Price)Group by (Part Key)

11

Line Status <> ‘Delivered’

Line Item.D+

DW.LView01

View02

View03

4DW.D+9

Line Status = ‘Delivered’7

10

12

Line

Page 36: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

36

1 2 3

6

Not Null (CustKey) SK(custkey) PhoneFormat

New - Old

Customer.new

CUSTOMER

7C.D+

Error

4 5

SK(custkey) PhoneFormat

Customer.old

Cnew

Cold

Wishbone

Page 37: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

37

Order

1

2L-status

L-cust3

L-ord

4 DW.Order

SK (OrderStatus)

SK (CustKey)

SK (Order Key)

SCD T1Insert or Update

Primary Flow

Page 38: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

38

7

4SK(custkey)

DIFF I,U

PS1.new PS1new

PS1old

1

Sort(Pkey, SuppKey)

135

DIFF I,U

PS2.new PS2new

PS2old

2

Sort(Pkey, SuppKey)

6

DIFF I,U

PS3.new PS3new

PS3old

3

Sort(Pkey, SuppKey)

10

UNION SORTED (Pkey)

DW.PS11

PS.D 12

Sum (AvailQty)Group by (Part Key)

View04

Insert or Update

Insert or Update

9

SK(custkey)

8

SK(custkey)

Tree

Page 39: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

39

Line Item.D+

2 3 4

DW.Lineitem

6 7

8

12

View05

View06

View08

SK(Part Key, Order Key, Supp Key)

Date Key (Ship Date, Receipt

Date)

Currency (Ext. Price, Discount,

Tax)

Derive Fnc (Profit)

1

Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)

10 View07

Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)

Sum (Profit), Avg (Discount)Group by (Part Key, Supp Key)

Avg (Profit), Avg(Ext. Price)Group by (Part Key, Line Status)5

D+.LI

13

11

9

Fork

Page 40: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

40

Part Supp+

Supplier+

4

1 2

DW.PS

View097

8 10

View12

View10

5 12 14

View13

View11

SK(PartKey, SuppKey) Derive Fnc

(Total Cost)

SK(SuppKey)Phone Format

(Phone)

SP_Supp Key = S_Supp Key

Max (S. C.), Min (S. C.)Group by (Nation Key, Part Key)

Max (S. C.), Min (S. C.)Group by (Part Key)

Sum (T. C.) Group by (Nation Key, Supp Key)

Sum (T. C.)Group by (Supp Key)

3

D+.PS

DW.S 6

D+.S

9 11

1513

Full recomputation for all right-wing views

Balanced Butterfly

Page 41: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

41

SK+RK(partkey)

2

1

DIFFPart.new PART

Part.old 5P.DU

P.D++ 4

Most recent PART3

SK+RK(partkey)

Balanced ButterflySlowly Changing Dimension of

Type II

Not a typical butterfly…

Page 42: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

42

Roadmap

• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues

Page 43: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

43

Open Issues• Data sizes

– the numbers given by TPC-H can be a valid point of reference for data warehouse contents.

– Important: fraction of source data over the warehouse contents. Values in the range 0.01 to 0.7?

• Selectivity of the left wing of a butterfly – Values between 0.5 and 1.2?

• Failure rates – Range of 10-4 and 10-2?

• Workflow size – Although we provide scenarios of small scale, medium–size

and large-size scenarios are also needed.

• A careful assignment of values based on a large number of real-world case studies (that we do not possess) should be a topic for a full-fledged benchmark.

Page 44: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

44

Open Issues

• Nature of data (now: relational; also: XML, multimedia, …)

• Active vs. off-line modus operandi• Auxiliary structures and processes

(indexes, backup & maintenance scenarios, etc)

• Parallelism and Partitioning

Page 45: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

45

Message to you

• We need a commonly agreed benchmark that realistically reflects real-world ETL scenarios

• Butterflies to the rescue• Feedback from the industry is

necessary – workflow complexity – the frequencies of typically encountered ETL

operations)

Page 46: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

46

Thank you!

All pictures are imported from MS Clipart

Page 47: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

47

Auxiliary slides

Page 48: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

48

Page 49: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

49

V WγA,B

3

R σA>300

1

σB>400

2

Line

Page 50: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

50

γΑ

4

V

S

R

W

σA<600

1

γΒ

2

wv A=A

3

Wishbone

Page 51: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

51

WR σA>300

1

σB>400

2

wv A=A

3

S

Primary Flow

Page 52: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

52

Q

R σA>50 X

S σA<50

2

Y

wv A=A

3

DW

Z

wv A=A

4

1

Tree

Page 53: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

53

γA

2

V

Z

σA>300

1

γA,Β

3

W

γB4

Y

R

Fork

Page 54: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

54

σA=500

3

V

U

γB,C

5

W

TγA

2

γA

6

Q

R σA>300

1γC

4

Z

Right Deep Hierarchy

Page 55: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

55

Body

Right wing Left wing

n1

n2

nm

n1

n2

nk

V

Page 56: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

56

R

t1

t2A B

S

R t2 A B

S

Original state

(a) Sequential execution

t1

R

A B

S

(b) pipelining

t1t2

R1 A1 B1

S1

t1

(c) partitioning

Rk Ak Bk

Sk

tk

...tn

t3

...tn

t3

...tn

t2k...tn

Partitioning & parallelism

Page 57: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

57

Statistics per pattern

Legend:

•N+M (left wing + right wing)

•INCR: incremental maintenance

•I/U: insert and/or update

•FULL: full recomputation

Page 58: Towards a Benchmark for ETL Workflows Panos Vassiliadis Anastasios Karagiannis Vasiliki Tziovara Alkis Simitsis Univ. of Ioannina Almaden Research Center

58