towards a benchmark for etl workflows panos vassiliadis anastasios karagiannis vasiliki tziovara...
Post on 18-Dec-2015
219 views
TRANSCRIPT
Towards a Benchmark for ETL Workflows
Panos VassiliadisAnastasios KaragiannisVasiliki Tziovara
Alkis Simitsis
Univ. of Ioannina
Almaden Research Center
2
Goal of this work
• The goal of this paper is to contribute towards a benchmark that can serve as the experimental testbed for the assessment of ETL methods or tools concerning their basic behavioral properties (measures) over a broad range of ETL workflows.
3
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
4
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
5
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2 Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SUPPKEYCOST DATE=SYSDATE
AddDate CheckQTY
QTY>0
U
DS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW
1
DS.PS_OLD
1
DW.PARTSUPP
Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP1S1_PARTSU
PP
S2_PARTSUPP
FTP2
DS.PS_NEW
2
DIFF2
DS.PS_OLD
2
DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY
Sources DW
DSA
ETL workflows
6
Off-line vs. Active ETL
• Off-line mode: the workflow is executed during a specific time window of some hours (typically at night), when the systems are not servicing their end-users. – Due to the low load of both the source systems and the
warehouse, both the refreshment of data and any other administrative activities (cleanups, auditing, etc) are easier to complete.
• Active mode: the sources continuously (with high frequency) try to send new data to the warehouse. – Not necessarily instantly; rather, small groups of data are
collected and sent to the warehouse for further processing.
• Differences of the two modes:– frequency of the workflow execution– load incurred to all the systems involved.
7
Problem & Motivation
• When experimenting with ETL wokflows– what are the important problem parameters
& what are the realistic values for them?– what test suites should we use?
• We have faced the problem twice so far:– Logical optimization of the ETL process
(transposition of activities to speed up the workflow – ICDE05, TKDE05)
– Physical optimization of the ETL process (which physical operators to use for optimal execution of the process – DOLAP07)
8
Logical Optimization
PARTSPARTS1
A2E(DATE)
σ(€COST)
γ(DATE)
1 987
2 5 6
UNN(€COST)
3
$2€($COST)
4
PARTS2
PARTS
PARTS1
σ(€COST)
U
γ(DATE)
1
97
3
2 8_2 6
NN(€COST)
σ(€COST)
8_1
$2€($COST)
4
PARTS2A2E(DATE)
5
Can we push selection early enough?
Can we aggregate before $2€ takes place?
How about naming conflicts?
9
Problem & Motivation
• Existing standards are insufficient– TPC-H– TPC-DS
• Practical cases are not publishable• We resort in devising our own ad-
hoc test scenarios– either through a specific set of
scenarios– or, through a scenario generator
10
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
11
Goal of this work
• We are interested in understanding– The fundamental families of activities
performed in an ETL scenario– The frequent ways with which activities
and recordsets interconnect in an ETL scenario
– The important parameters to be tuned in an experiment & the appropriate values for them
– The appropriate measures to be measured during an experiment
12
The goal is not …
• To compare existing systems…• To provide specialized performance
measures for very specific tasks in the overall process…
• To exhaustively enumerate all the possible alternatives for specific operations…
• …
13
Fundamental goals of any ETL workflow
• Effectiveness– Does the workflow execution reach the
maximum possible (or, at least, the minimum tolerable) level of data freshness, completeness and consistency in the warehouse within the necessary time (or resource) constraints?
– Is the workflow execution resilient to occasional failures?
• Efficiency– How fast is the workflow executed?– What resource overheads does the workflow
incur at the source and the warehouse side?
14
Effectiveness
• Bring the data from the sources to the DW s.t.:– Periodically, as many of them as possible are
loaded. – The loaded data are as fresh as possible– At “checkpoint” timestamps, all data have been
loaded in the warehouse– Data respect both database and business rules.
• Potential problems :– Recovery from failures. – Missing changes at the source. – Invalid records due to failed transactions.
15
Efficiency
• Data exchange between sources and warehouse is done as fast as possible
• Minimal overhead is incurred – at the source systems. – at the data warehouse.
16
Problem parameters
• the size of the workflow (i.e., the number of nodes contained in the graph),
• the structure of the workflow (i.e., the variation of the nature of the involved nodes and their interconnection as the workflow graph)
• the size of input data originating from the sources,
• the overall selectivity of the workflow, based on the selectivities of the activities of the workflow,
• the values of probabilities of failure.
17
Measures
• Measures for data freshness and data consistency– (M1.1) Percentage of data that violate
business rules.– (M1.2) Percentage of data that should
be present at their appropriate warehouse targets, but they are not.
• Measures for the resilience to failures – (M2) Percentage of successfully
resumed workflow executions.
18
Measures
• Speed of the overall process– (M3.1) Throughput of regular workflow
execution (this may also be measured as total completion time).
– (M3.2) Throughput of workflow execution including a specific percentage of failures and their resumption.
– (M3.3) Average latency per tuple in regular execution.
19
Measures• Overheads
– (M4.1) Min/Max/Avg/ timeline of memory consumed by the ETL process at the source system.
– (M4.2) Time needed to complete the processing of a certain number of OLTP transactions in the presence (as opposed to the absence) of ETL software at the source, in regular source operation.
– (M4.3) The same as 4.2, but in the case of source failure, where ETL tasks are to be performed too, concerning the recovered data.
– (M4.4) Min/Max/Avg/ timeline of memory consumed by the ETL process at the warehouse system.
– (M4.5) (active warehousing) Time needed to complete the processing of a certain number of decision support queries in the presence (as opposed to the absence) of ETL software at the warehouse, in regular operation.
– (M4.6) The same as M4.5, but in the case of any (source or warehouse) failure, where ETL tasks are to be performed too at the warehouse side.
20
Goal of this work
• Covered:The important parameters to be tuned in
an experiment & the appropriate values for them
The appropriate measures to be measured during an experiment
• Pending: The fundamental families of activities
performed in an ETL scenario The frequent ways with which activities
and recordsets interconnect in an ETL scenario
21
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
22
Formal treatment
• We model an ETL workflow as a directed acyclic graph G(V,E). – Each node vV is either an activity a or
a recordset r. – An edge (a,b)E denotes that b
receives data from node a for further processing.
23
Formal treatment• Intra-node constraints
– Each recordset r is a pair (r.name, r.schema), with the schema being a finite list of attribute names.
– Each activity a is a tuple (N,I,O,S,A). N is the activity’s name. I is a finite set of input schemata. O is a finite set of output schemata. S is a declarative description of the relationship of its output schema with its input schemata in an appropriate language (without delving into algorithmic or implementation issues). A is the algorithm chosen for activity’s execution.
• Inter-node constraints – The data consumer of a recordset cannot be another
recordset. Still, more than one consumer is allowed for recordsets.
– Each activity must have at least one provider, either another activity or a recordset. When an activity has more than one data providers, these providers can be other activities or activities combined with recordsets.
– Feedback of data is not allowed; i.e., the data consumer of an activity cannot be the same activity.
24
Resulting problem…
• Due to the intra-node characteristics, we need a “taxonomy” for ETL activities (micro-level)
• Due to the infinite possibilities of connecting nodes (activities and recordsets) we need a set of “design patterns” as abstractions of how frequently encountered ETL graphs look like (macro-level)
25
Micro level
• Micro level: derive a set of fundamental classes, where frequently encountered activities can be classified
• Why a taxonomy of ETL activities?– Impossible to predict any possible
script / algorithm / operator /…– No algebra available right now
Not necessary only for the benchmark, but orthogonally for other tasks (e.g.,
optimization, statistics, etc)
26
27
Logical classification
• Row-level operations, which are locally applied to a single row.
• Router operations, which locally decide, for each row, which of the many (output) destinations it should be sent to.
• Unary Grouper operations, which transform a set of rows to a single row.
• Unary Holistic operations, which perform a transformation to the entire data set. These are usually blocking operations.
• Binary or N-ary operations, which combine many inputs into one output.
28
Micro level
blocking
semi-blocking
non-blocking
binary# inputs
Final classification
Physical-level characteristics
unary
N-ary
row-level
router grouper
29
Micro-level
Unary N-ary
Blocking HolisticGrouper
HolisticGrouper
Semi-blocking Grouper Grouper
Non-blocking Row-levelRouter
30
Macro level
• Even harder!• How to derive a set of typical
structural patterns for an ETL scenario?– Top down: delve to the fundamental
constituents of such a scenario– Bottom up: explore scenarios and try to
abstract common parts
• We did a little bit of both, and derived a fundamental pattern of structure
31
A butterfly is an ETL workflow that consists of three distinct components:
• Body: a central, detailed point of persistence (fact or dimension table) that is populated with the data produced by the left wing.
• Left wing: sources, activities and intermediate results. Performs extraction, cleaning and transformation + loads the data to the body.
• Right wing: materialized views, reports, spreadsheets, as well as the activities that populate them, to support reporting and analysis
Butterflies to the rescue!
32
γA,Β
4
V
S
R Z
γA
5
W
100000
100000
sel1=0.6
sel2=0.1
sel4=0.2
p1=0.003
p2=0.004
p4=0.001
σA<600
1
σA>300
2
sel5=0.5
p5=0.005
wv A=A
3
sel3=0.2
p3=0.001
Butterflies to the rescue!
33
γA,Β
4
V
S
R Z
γA
5
W
100000
100000
sel1=0.6
sel2=0.1
sel4=0.2
p1=0.003
p2=0.004
p4=0.001
σA<600
1
σA>300
2
sel5=0.5
p5=0.005
wv A=A
3
sel3=0.2
p3=0.001
Butterflies to the rescue!
34
Butterfly classes
• Butterflies constitute a fundamental pattern of reference– Line– Balanced butterfly
• Left-winged variants (heavy of the ETL part)– Primary flow– Wishbone– Tree
• Right winged variants (heavy on the “reporting” part)– Fork
• Irregular variants
35
1 2 3 5 6 8
Not Null (Part Key, Order Key, Supp Key)
Currency (Ext. Price, Discount,
Tax)Derive Fnc
(Profit)
Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)
Return Status = True
Sum (Profit), Sum (Ext. Price)Group by (Part Key)
11
Line Status <> ‘Delivered’
Line Item.D+
DW.LView01
View02
View03
4DW.D+9
Line Status = ‘Delivered’7
10
12
Line
36
1 2 3
6
Not Null (CustKey) SK(custkey) PhoneFormat
New - Old
Customer.new
CUSTOMER
7C.D+
Error
4 5
SK(custkey) PhoneFormat
Customer.old
Cnew
Cold
Wishbone
37
Order
1
2L-status
L-cust3
L-ord
4 DW.Order
SK (OrderStatus)
SK (CustKey)
SK (Order Key)
SCD T1Insert or Update
Primary Flow
38
7
4SK(custkey)
DIFF I,U
PS1.new PS1new
PS1old
1
Sort(Pkey, SuppKey)
135
DIFF I,U
PS2.new PS2new
PS2old
2
Sort(Pkey, SuppKey)
6
DIFF I,U
PS3.new PS3new
PS3old
3
Sort(Pkey, SuppKey)
10
UNION SORTED (Pkey)
DW.PS11
PS.D 12
Sum (AvailQty)Group by (Part Key)
View04
Insert or Update
Insert or Update
9
SK(custkey)
8
SK(custkey)
Tree
39
Line Item.D+
2 3 4
DW.Lineitem
6 7
8
12
View05
View06
View08
SK(Part Key, Order Key, Supp Key)
Date Key (Ship Date, Receipt
Date)
Currency (Ext. Price, Discount,
Tax)
Derive Fnc (Profit)
1
Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)
10 View07
Sum (Profit), Sum (Ext. Price)Group by (Part Key, Line Status)
Sum (Profit), Avg (Discount)Group by (Part Key, Supp Key)
Avg (Profit), Avg(Ext. Price)Group by (Part Key, Line Status)5
D+.LI
13
11
9
Fork
40
Part Supp+
Supplier+
4
1 2
DW.PS
View097
8 10
View12
View10
5 12 14
View13
View11
SK(PartKey, SuppKey) Derive Fnc
(Total Cost)
SK(SuppKey)Phone Format
(Phone)
SP_Supp Key = S_Supp Key
Max (S. C.), Min (S. C.)Group by (Nation Key, Part Key)
Max (S. C.), Min (S. C.)Group by (Part Key)
Sum (T. C.) Group by (Nation Key, Supp Key)
Sum (T. C.)Group by (Supp Key)
3
D+.PS
DW.S 6
D+.S
9 11
1513
Full recomputation for all right-wing views
Balanced Butterfly
41
SK+RK(partkey)
2
1
DIFFPart.new PART
Part.old 5P.DU
P.D++ 4
Most recent PART3
SK+RK(partkey)
Balanced ButterflySlowly Changing Dimension of
Type II
Not a typical butterfly…
42
Roadmap
• Motivation• Goals & parameters of the benchmark• Micro-macro view of an ETL workflow• Open issues
43
Open Issues• Data sizes
– the numbers given by TPC-H can be a valid point of reference for data warehouse contents.
– Important: fraction of source data over the warehouse contents. Values in the range 0.01 to 0.7?
• Selectivity of the left wing of a butterfly – Values between 0.5 and 1.2?
• Failure rates – Range of 10-4 and 10-2?
• Workflow size – Although we provide scenarios of small scale, medium–size
and large-size scenarios are also needed.
• A careful assignment of values based on a large number of real-world case studies (that we do not possess) should be a topic for a full-fledged benchmark.
44
Open Issues
• Nature of data (now: relational; also: XML, multimedia, …)
• Active vs. off-line modus operandi• Auxiliary structures and processes
(indexes, backup & maintenance scenarios, etc)
• Parallelism and Partitioning
45
Message to you
• We need a commonly agreed benchmark that realistically reflects real-world ETL scenarios
• Butterflies to the rescue• Feedback from the industry is
necessary – workflow complexity – the frequencies of typically encountered ETL
operations)
46
Thank you!
All pictures are imported from MS Clipart
47
Auxiliary slides
48
49
V WγA,B
3
R σA>300
1
σB>400
2
Line
50
γΑ
4
V
S
R
W
σA<600
1
γΒ
2
wv A=A
3
Wishbone
51
WR σA>300
1
σB>400
2
wv A=A
3
S
Primary Flow
52
Q
R σA>50 X
S σA<50
2
Y
wv A=A
3
DW
Z
wv A=A
4
1
Tree
53
γA
2
V
Z
σA>300
1
γA,Β
3
W
γB4
Y
R
Fork
54
σA=500
3
V
U
γB,C
5
W
TγA
2
γA
6
Q
R σA>300
1γC
4
Z
Right Deep Hierarchy
55
Body
Right wing Left wing
n1
n2
nm
n1
n2
nk
V
56
R
t1
t2A B
S
R t2 A B
S
Original state
(a) Sequential execution
t1
R
A B
S
(b) pipelining
t1t2
R1 A1 B1
S1
t1
(c) partitioning
Rk Ak Bk
Sk
tk
...tn
t3
...tn
t3
...tn
t2k...tn
Partitioning & parallelism
57
Statistics per pattern
Legend:
•N+M (left wing + right wing)
•INCR: incremental maintenance
•I/U: insert and/or update
•FULL: full recomputation
58