towards a benchmark for etl workflows panos vassiliadis anastasios karagiannis vasiliki tziovara...

Towards a Benchmark for ETL Workflows

Panos VassiliadisAnastasios KaragiannisVasiliki Tziovara

Alkis Simitsis

Univ. of Ioannina

Almaden Research Center

2

Goal of this work

• The goal of this paper is to contribute towards a benchmark that can serve as the experimental testbed for the assessment of ETL methods or tools concerning their basic behavioral properties (measures) over a broad range of ETL workflows.

3

Roadmap

• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues

4

Roadmap


5

Add_SPK1

SUPPKEY=1

SK1

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SUPPKEY

$2€

COST DATE

DS.PS2 Add_SPK2

SUPPKEY=2

SK2

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SUPPKEYCOST DATE=SYSDATE

AddDate CheckQTY

QTY>0

U

DS.PS1

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF1

DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW

1

DS.PS_OLD

1

DW.PARTSUPP

Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP1S1_PARTSU

PP

S2_PARTSUPP

FTP2

DS.PS_NEW

2

DIFF2

DS.PS_OLD

2

DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY

Sources DW

DSA

ETL workflows

6

Off-line vs. Active ETL

• Off-line mode: the workflow is executed during a specific time window of some hours (typically at night), when the systems are not servicing their end-users. – Due to the low load of both the source systems and the

warehouse, both the refreshment of data and any other administrative activities (cleanups, auditing, etc) are easier to complete.

• Active mode: the sources continuously (with high frequency) try to send new data to the warehouse. – Not necessarily instantly; rather, small groups of data are

collected and sent to the warehouse for further processing.

• Differences of the two modes:– frequency of the workflow execution– load incurred to all the systems involved.

7

Problem & Motivation

• When experimenting with ETL wokflows– what are the important problem parameters

& what are the realistic values for them?– what test suites should we use?

• We have faced the problem twice so far:– Logical optimization of the ETL process

(transposition of activities to speed up the workflow – ICDE05, TKDE05)

– Physical optimization of the ETL process (which physical operators to use for optimal execution of the process – DOLAP07)

8

Logical Optimization

PARTSPARTS1

A2E(DATE)

σ(€COST)

γ(DATE)

1 987

2 5 6

UNN(€COST)

3

$2€($COST)

4

PARTS2

PARTS

PARTS1

σ(€COST)

U

γ(DATE)

1

97

3

2 8_2 6

NN(€COST)

σ(€COST)

8_1

$2€($COST)

4

PARTS2A2E(DATE)

5

Can we push selection early enough?

Can we aggregate before $2€ takes place?

How about naming conflicts?

9

Problem & Motivation

• Existing standards are insufficient– TPC-H– TPC-DS

• Practical cases are not publishable• We resort in devising our own ad-

hoc test scenarios– either through a specific set of

scenarios– or, through a scenario generator

10

Roadmap


11

Goal of this work

• We are interested in understanding– The fundamental families of activities

performed in an ETL scenario– The frequent ways with which activities

and recordsets interconnect in an ETL scenario

– The important parameters to be tuned in an experiment & the appropriate values for them

– The appropriate measures to be measured during an experiment

12

The goal is not …

• To compare existing systems…• To provide specialized performance

measures for very specific tasks in the overall process…

• To exhaustively enumerate all the possible alternatives for specific operations…

• …

13

Fundamental goals of any ETL workflow

• Effectiveness– Does the workflow execution reach the

maximum possible (or, at least, the minimum tolerable) level of data freshness, completeness and consistency in the warehouse within the necessary time (or resource) constraints?

– Is the workflow execution resilient to occasional failures?

• Efficiency– How fast is the workflow executed?– What resource overheads does the workflow

incur at the source and the warehouse side?

14

Effectiveness

• Bring the data from the sources to the DW s.t.:– Periodically, as many of them as possible are

loaded. – The loaded data are as fresh as possible– At “checkpoint” timestamps, all data have been

loaded in the warehouse– Data respect both database and business rules.

• Potential problems :– Recovery from failures. – Missing changes at the source. – Invalid records due to failed transactions.

15

Efficiency

• Data exchange between sources and warehouse is done as fast as possible

• Minimal overhead is incurred – at the source systems. – at the data warehouse.

16

Problem parameters

• the size of the workflow (i.e., the number of nodes contained in the graph),

• the structure of the workflow (i.e., the variation of the nature of the involved nodes and their interconnection as the workflow graph)

• the size of input data originating from the sources,

• the overall selectivity of the workflow, based on the selectivities of the activities of the workflow,

• the values of probabilities of failure.

17

Measures

• Measures for data freshness and data consistency– (M1.1) Percentage of data that violate

business rules.– (M1.2) Percentage of data that should

be present at their appropriate warehouse targets, but they are not.

• Measures for the resilience to failures – (M2) Percentage of successfully

resumed workflow executions.

18

Measures

• Speed of the overall process– (M3.1) Throughput of regular workflow

execution (this may also be measured as total completion time).

– (M3.2) Throughput of workflow execution including a specific percentage of failures and their resumption.

– (M3.3) Average latency per tuple in regular execution.

19

Measures• Overheads

– (M4.1) Min/Max/Avg/ timeline of memory consumed by the ETL process at the source system.

– (M4.2) Time needed to complete the processing of a certain number of OLTP transactions in the presence (as opposed to the absence) of ETL software at the source, in regular source operation.

– (M4.3) The same as 4.2, but in the case of source failure, where ETL tasks are to be performed too, concerning the recovered data.

– (M4.4) Min/Max/Avg/ timeline of memory consumed by the ETL process at the warehouse system.

– (M4.5) (active warehousing) Time needed to complete the processing of a certain number of decision support queries in the presence (as opposed to the absence) of ETL software at the warehouse, in regular operation.

– (M4.6) The same as M4.5, but in the case of any (source or warehouse) failure, where ETL tasks are to be performed too at the warehouse side.

20

Goal of this work

• Covered:The important parameters to be tuned in

an experiment & the appropriate values for them

The appropriate measures to be measured during an experiment

• Pending: The fundamental families of activities

performed in an ETL scenario The frequent ways with which activities

and recordsets interconnect in an ETL scenario

21

Roadmap


22

Formal treatment

• We model an ETL workflow as a directed acyclic graph G(V,E). – Each node vV is either an activity a or

a recordset r. – An edge (a,b)E denotes that b

receives data from node a for further processing.

23

Formal treatment• Intra-node constraints

– Each recordset r is a pair (r.name, r.schema), with the schema being a finite list of attribute names.

– Each activity a is a tuple (N,I,O,S,A). N is the activity’s name. I is a finite set of input schemata. O is a finite set of output schemata. S is a declarative description of the relationship of its output schema with its input schemata in an appropriate language (without delving into algorithmic or implementation issues). A is the algorithm chosen for activity’s execution.

• Inter-node constraints – The data consumer of a recordset cannot be another

recordset. Still, more than one consumer is allowed for recordsets.

– Each activity must have at least one provider, either another activity or a recordset. When an activity has more than one data providers, these providers can be other activities or activities combined with recordsets.

– Feedback of data is not allowed; i.e., the data consumer of an activity cannot be the same activity.

24

Resulting problem…

• Due to the intra-node characteristics, we need a “taxonomy” for ETL activities (micro-level)

• Due to the infinite possibilities of connecting nodes (activities and recordsets) we need a set of “design patterns” as abstractions of how frequently encountered ETL graphs look like (macro-level)

25

Micro level

• Micro level: derive a set of fundamental classes, where frequently encountered activities can be classified

• Why a taxonomy of ETL activities?– Impossible to predict any possible

script / algorithm / operator /…– No algebra available right now

Not necessary only for the benchmark, but orthogonally for other tasks (e.g.,

optimization, statistics, etc)

27

Logical classification

• Row-level operations, which are locally applied to a single row.

• Router operations, which locally decide, for each row, which of the many (output) destinations it should be sent to.

• Unary Grouper operations, which transform a set of rows to a single row.

• Unary Holistic operations, which perform a transformation to the entire data set. These are usually blocking operations.

• Binary or N-ary operations, which combine many inputs into one output.

28

Micro level

blocking

semi-blocking

non-blocking

binary# inputs

Final classification

Physical-level characteristics

unary

N-ary

row-level

router grouper

29

Micro-level

Unary N-ary

Blocking HolisticGrouper

HolisticGrouper

Semi-blocking Grouper Grouper

Non-blocking Row-levelRouter

30

Macro level

• Even harder!• How to derive a set of typical

structural patterns for an ETL scenario?– Top down: delve to the fundamental

constituents of such a scenario– Bottom up: explore scenarios and try to

abstract common parts

• We did a little bit of both, and derived a fundamental pattern of structure

31

A butterfly is an ETL workflow that consists of three distinct components:

• Body: a central, detailed point of persistence (fact or dimension table) that is populated with the data produced by the left wing.

• Left wing: sources, activities and intermediate results. Performs extraction, cleaning and transformation + loads the data to the body.

• Right wing: materialized views, reports, spreadsheets, as well as the activities that populate them, to support reporting and analysis

Butterflies to the rescue!

32

γA,Β

4

V

S

R Z

γA

5

W

100000

100000

sel1=0.6

sel2=0.1

sel4=0.2

p1=0.003

p2=0.004

p4=0.001

σA<600

1

σA>300

2

sel5=0.5

p5=0.005

wv A=A

3

sel3=0.2

p3=0.001


33

γA,Β

4

V

S

R Z

γA

5

W

100000

100000

sel1=0.6

sel2=0.1

sel4=0.2

p1=0.003

p2=0.004

p4=0.001

σA<600

1

σA>300

2

sel5=0.5

p5=0.005

wv A=A

3

sel3=0.2

p3=0.001


34

Butterfly classes

• Butterflies constitute a fundamental pattern of reference– Line– Balanced butterfly

• Left-winged variants (heavy of the ETL part)– Primary flow– Wishbone– Tree

• Right winged variants (heavy on the “reporting” part)– Fork

• Irregular variants

35

1 2 3 5 6 8

Not Null (Part Key, Order Key, Supp Key)

Currency (Ext. Price, Discount,

Tax)Derive Fnc

(Profit)

Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)

Return Status = True

Sum (Profit), Sum (Ext. Price)Group by (Part Key)

11

Line Status <> ‘Delivered’

Line Item.D+

DW.LView01

View02

View03

4DW.D+9

Line Status = ‘Delivered’7

10

12

Line

36

1 2 3

6

Not Null (CustKey) SK(custkey) PhoneFormat

New - Old

Customer.new

CUSTOMER

7C.D+

Error

4 5

SK(custkey) PhoneFormat

Customer.old

Cnew

Cold

Wishbone

37

Order

1

2L-status

L-cust3

L-ord

4 DW.Order

SK (OrderStatus)

SK (CustKey)

SK (Order Key)

SCD T1Insert or Update

Primary Flow

38

7

4SK(custkey)

DIFF I,U

PS1.new PS1new

PS1old

1

Sort(Pkey, SuppKey)

135

DIFF I,U

PS2.new PS2new

PS2old

2

Sort(Pkey, SuppKey)

6

DIFF I,U

PS3.new PS3new

PS3old

3

Sort(Pkey, SuppKey)

10

UNION SORTED (Pkey)

DW.PS11

PS.D 12

Sum (AvailQty)Group by (Part Key)

View04

Insert or Update

Insert or Update

9

SK(custkey)

8

SK(custkey)

Tree

39

Line Item.D+

2 3 4

DW.Lineitem

6 7

8

12

View05

View06

View08

SK(Part Key, Order Key, Supp Key)

Date Key (Ship Date, Receipt

Date)

Currency (Ext. Price, Discount,

Tax)

Derive Fnc (Profit)

1


10 View07


Sum (Profit), Avg (Discount)Group by (Part Key, Supp Key)

Avg (Profit), Avg(Ext. Price)Group by (Part Key, Line Status)5

D+.LI

13

11

9

Fork

40

Part Supp+

Supplier+

4

1 2

DW.PS

View097

8 10

View12

View10

5 12 14

View13

View11

SK(PartKey, SuppKey) Derive Fnc

(Total Cost)

SK(SuppKey)Phone Format

(Phone)

SP_Supp Key = S_Supp Key

Max (S. C.), Min (S. C.)Group by (Nation Key, Part Key)

Max (S. C.), Min (S. C.)Group by (Part Key)

Sum (T. C.) Group by (Nation Key, Supp Key)

Sum (T. C.)Group by (Supp Key)

3

D+.PS

DW.S 6

D+.S

9 11

1513

Full recomputation for all right-wing views

Balanced Butterfly

41

SK+RK(partkey)

2

1

DIFFPart.new PART

Part.old 5P.DU

P.D++ 4

Most recent PART3

SK+RK(partkey)

Balanced ButterflySlowly Changing Dimension of

Type II

Not a typical butterfly…

42

Roadmap


43

Open Issues• Data sizes

– the numbers given by TPC-H can be a valid point of reference for data warehouse contents.

– Important: fraction of source data over the warehouse contents. Values in the range 0.01 to 0.7?

• Selectivity of the left wing of a butterfly – Values between 0.5 and 1.2?

• Failure rates – Range of 10-4 and 10-2?

• Workflow size – Although we provide scenarios of small scale, medium–size

and large-size scenarios are also needed.

• A careful assignment of values based on a large number of real-world case studies (that we do not possess) should be a topic for a full-fledged benchmark.

44

Open Issues

• Nature of data (now: relational; also: XML, multimedia, …)

• Active vs. off-line modus operandi• Auxiliary structures and processes

(indexes, backup & maintenance scenarios, etc)

• Parallelism and Partitioning

45

Message to you

• We need a commonly agreed benchmark that realistically reflects real-world ETL scenarios

• Butterflies to the rescue• Feedback from the industry is

necessary – workflow complexity – the frequencies of typically encountered ETL

operations)

46

Thank you!

All pictures are imported from MS Clipart

47

Auxiliary slides

49

V WγA,B

3

R σA>300

1

σB>400

2

Line

50

γΑ

4

V

S

R

W

σA<600

1

γΒ

2

wv A=A

3

Wishbone

51

WR σA>300

1

σB>400

2

wv A=A

3

S

Primary Flow

52

Q

R σA>50 X

S σA<50

2

Y

wv A=A

3

DW

Z

wv A=A

4

1

Tree

53

γA

2

V

Z

σA>300

1

γA,Β

3

W

γB4

Y

R

Fork

54

σA=500

3

V

U

γB,C

5

W

TγA

2

γA

6

Q

R σA>300

1γC

4

Z

Right Deep Hierarchy

55

Body

Right wing Left wing

n1

n2

nm

n1

n2

nk

V

56

R

t1

t2A B

S

R t2 A B

S

Original state

(a) Sequential execution

t1

R

A B

S

(b) pipelining

t1t2

R1 A1 B1

S1

t1

(c) partitioning

Rk Ak Bk

Sk

tk

...tn

t3

...tn

t3

...tn

t2k...tn

Partitioning & parallelism

57

Statistics per pattern

Legend:

•N+M (left wing + right wing)

•INCR: incremental maintenance

•I/U: insert and/or update

•FULL: full recomputation

towards a benchmark for etl workflows panos vassiliadis anastasios karagiannis vasiliki tziovara...

Documents

etl scenario

etl wokflows

etl workflow open issues

assessment of etl methods

process dolap07 slide

scenario generator slide

active etl offline mode

broad range of etl workflows