a taxonomy of etl activities panos vassiliadis 1, alkis simitsis 2, eftychia baikousi 1 (1)...
TRANSCRIPT
A Taxonomy of ETL Activities
Panos Vassiliadis1, Alkis Simitsis2, Eftychia Baikousi1
(1) University of Ioannina{pvassil,ebaikou}@cs.uoi.gr
(2) HP [email protected]
Outline
Motivation
Normal Form
Taxonomy
Design Patterns
Conclusions
DOLAP'09, Hong Kong, Nov. 2009 2
Outline
Motivation
Normal Form
Taxonomy
Design Patterns
Conclusions
DOLAP'09, Hong Kong, Nov. 2009 3
DOLAP'09, Hong Kong, Nov. 2009 4
A developer’s problemUlysses is a developer in the IT department of his
company & he has a problem: he was assigned the task to populate a data mart of the organization
with source data =>
He needs to construct an ETL workflow
DOLAP'09, Hong Kong, Nov. 2009 5
A developer’s problem
Ulysses has a variety of solutions:Off-the-self ETL toolPL/SQL scripts Embedded SQL in Java, C/C++, …Python, Perl, … scripts…
DOLAP'09, Hong Kong, Nov. 2009 6
Ulysses also knows that…
Relational DBMSs perform query optimization using a simple model of data (the relational one)
a (relational) algebra both at the logical and physical levels
Software engineers construct their (O-O) code based onbest practices, or patterns, that allow the efficient construction, maintenance, testing, and execution of code
quality measures that allow the assessment of the quality of the constructed software
DOLAP'09, Hong Kong, Nov. 2009 7
Ulysses also knows that…
Relational DBMSs perform query optimization using a simple model of data (the relational one)
a (relational) algebra both at the logical and physical levels
Software engineers construct their (O-O) code based onbest practices, or patterns, that allow the efficient construction, maintenance, testing, and execution of code
quality measures that allow the assessment of the quality of the constructed software
Wouldn’t it be nice if we had similar tools for ETL flows?
DOLAP'09, Hong Kong, Nov. 2009 8
Can we help Ulysses? Yes, we can!
We can have a uniform model
to cover a large variety of ETL activities with complicated semantics and provide the framework for the optimization of ETL flows
a taxonomy of ETL activities based on their internals and the elementary components that characterize them with respect to their properties
a set of archetype patterns that combine ETL activities in a practical manner that can be used by an ETL engine for optimization, tuning, parallelization, …
Outline
Motivation
Normal Form
Taxonomy
Design Patterns
Conclusions
DOLAP'09, Hong Kong, Nov. 2009 9
DOLAP'09, Hong Kong, Nov. 2009 10
Rationale & Summary of Contribution
Is there a uniform, reference way to describe ETL activities?
Can we classify both simple activities and complicated activities in an meaningful manner?
Is it possible to come up with a simple “algebra” of operations for ETL activities and workflows?
DOLAP'09, Hong Kong, Nov. 2009 11
Rationale & Summary of Contribution
We introduce a uniform modeling notation it works for all kinds of ETL activities and ETL flows
both, experience and practice verifies that
it is based on a mapping to structure of matter Particle ↔ Simple operation Atom ↔ Simple activity Molecule ↔ Complex activity Compound ↔ ETL flow
typically, atoms and molecules can be represented as scriptsa particle is a certain function call inside the scriptETL tools provide atoms and not molecules
DOLAP'09, Hong Kong, Nov. 2009 12
Particle
Output schema
Input schema
A1
A2
A3
A4
A5
A6
A4
A5
A6
A1
A2
A3
A7New field: A7
Rejected fields
ETL Atoms
Merging & Routing Particle
Output schema 1Input schema 1
A1
A2
A3
A4
A5
A6
A4
A5
A6A1
A2
A3
Rejected fields
Output schema 2
A1
A2
A3
Output schema 3
A1
A2
A3
A4
A5
A6
A4
A5
A6
Input schema 2
A1
A2
A3
A4
A5
A6
A7
A8
ETL atom
an ETL activity that performs exactly one job
it involves exactly one ETL particle
it can be unary or n-ary
it may contain multiple output schemata
it may project out input attributes and generate new ones at the output
DOLAP'09, Hong Kong, Nov. 2009 13
ETL Molecules
Merging Particle
Output schema 1Input schema 1
A1
A2
A3
A4
A5
A6
A4
A5
A6A1
A2
A3
Rejected fields
Output schema 2
A1
A2
A3
Output schema 3
A1
A2
A3
A4
A5
A6
A4
A5
A6
Input schema 2
A1
A2
A3
A4
A5
A6
A7
A8
Routing Particle
Transformation Particles
ETL molecule
it transfers data from input to output schemata
but there is a linear workflow of particles in between these two groups of schemata.
Benefits
A normal form is a single, reference way to theoretically discuss properties of ETL activitiesSeveral results can be expressed under this formalism
logical optimization is a prominent caseeasy to show that results around swapping of activities (ICDE’05, TKDE’05) fit nicely in this paper
e.g., when can we change the particle order within the same molecule? what kind of (inter-/intra-) molecule operations can we do?
compose molecules (to be able to form flows)split or merge molecules (i.e., SW modules)exchange particles between molecules or swap molecules (for optimization reasons)
DOLAP'09, Hong Kong, Nov. 2009 14
Benefits
Example inter-/intra-molecule operations
Coupling
Activity swapping
DOLAP'09, Hong Kong, Nov. 2009 15
ba
b
b
b
...
a
abba
aIb,k
ObbOa,j
Outline
Motivation
Normal Form
Taxonomy
Design Patterns
Conclusions
DOLAP'09, Hong Kong, Nov. 2009 16
DOLAP'09, Hong Kong, Nov. 2009 17
Taxonomy of ETL activities
The taxonomy is based on interrelationship of input-output w.r.t.
schemata
processing of incoming tuples
Exploit taxonomical characteristics forlogical & physical optimization of a workflow
parallelization of activities
any other tuning for improving efficiency, resilience to failures
Taxonomy of ETL activities
DOLAP'09, Hong Kong, Nov. 2009 18
blocking
semi-blocking
non-blocking
# inputs
Final classificatio
n
Physical-level characteristics
unary
N-ary
Other
DOLAP'09, Hong Kong, Nov. 2009 19
Applicability
DOLAP'09, Hong Kong, Nov. 2009 20
Benefits
Example uses of taxonomy as heuristics for optimization1:1
locally processedeasily exchangeable with one another (under conditions)easily parallelizable
N:1aggregators are blocking / semi-blocking at bestorder or hash sensitivehard to parallelize
1:Ntuple producersorder generators
Binary, primary flowssemi-blockingcan be treated as local if right input can be hashed in main memory
Routersstrong possibility to parallelize and enhance pipelining
Restmostly blocking, very hard to include in pipelining
Outline
Motivation
Normal Form
Taxonomy
Design Patterns
Conclusions
DOLAP'09, Hong Kong, Nov. 2009 21
DOLAP'09, Hong Kong, Nov. 2009 22
Design Patterns
So far, we have given Ulysses the database related foundations
a notation / model to express ETL molecules (activities) in a uniform way
a taxonomy for the particles (elementary operations) that compose molecules
a set of cases where we can show that optimizations can be decided on the grounds of the above
Now we discuss design related foundations
DOLAP'09, Hong Kong, Nov. 2009 23
Design patterns for ETL
Apart from a normal form for ETL activities, we strive to establish the benefits of having normal forms for combinations of activities
We have introduced butterflies and a set of ETL design patterns (TPC-TC’09) having ‘appropriate’ combinations of activities based on their taxonomical characteristics
These patterns have been proved to be beneficiary for many purposes (design, efficiency, benchmarking)
DOLAP'09, Hong Kong, Nov. 2009 24
γA,Β
4
V
S
R Z
γA
5
W
100000
100000
sel1=0.6
sel2=0.1
sel4=0.2
p1=0.003
p2=0.004
p4=0.001
σA<600
1
σA>300
2
sel5=0.5
p5=0.005
wv A=A
3
sel3=0.2
p3=0.001
Butterflies for ETL
DOLAP'09, Hong Kong, Nov. 2009 25
(a) Line (b) Wishbone
(c) Primary Flow (d) Tree
(e) Flat Hierarchy - Fork (f) Deep Hierarchy
Butterfly classes
V WγA,B
3
R σA>300
1
σB>400
2
γΑ
4
V
S
R
W
σA<600
1
γΒ
2
wv A=A
3
WR σA>300
1
σB>400
2
wv A=A
3
SQ
R σA>50 X
S σA<50
2
Y
wv A=A
3
DW
Z
wv A=A
4
1
γA
2
V
Z
σA>300
1
γA,Β
3
W
γB4
Y
R
σA=500
3
V
U
γB,C
5
W
TγA
2
γA
6
Q
R σA>300
1γC
4
Z
DOLAP'09, Hong Kong, Nov. 2009 26
Benefits
Design Guidelinesallows designers to communicate designs (document, maintain, …) better
Performancethe synthesis of ETL scenarios should be performed in a manner that allows the underlying engine to exploit them.
Beneficiary for the purpose of benchmarking ETL as well
DOLAP'09, Hong Kong, Nov. 2009 27
Benefits in efficiency
In V. Tziovara’s MSc @ UoI (also DOLAP’07) it is shown that we can introduce sorters intentionally to a workflow to exploit common sorting of data for sequences of order-dependent operatorsDifferent butterflies can benefit from different policies
DOLAP'09, Hong Kong, Nov. 2009 28
Benefits in scheduling
Different scheduling policies can be used for different butterflies (A. Karagiannis MSc, UoI)
Outline
Motivation
Normal Form
Taxonomy
Design Patterns
Conclusions
DOLAP'09, Hong Kong, Nov. 2009 29
DOLAP'09, Hong Kong, Nov. 2009 30
Conclusions
We have presented and discusseda uniform model to cover a large variety of ETL activities with complicated semantics and provide the framework for the optimization of ETL flowsa taxonomy for the internals of ETL activities that allows us to treat them as black-boxes in optimization, scheduling, parallelization, …a set of design patterns for combinations of ETL activities in archetype patterns
Future work on optimizationoptimization techniquesalternative physical implementations for the same logical operationsresource allocation and scheduling policiespossibilities for parallelization