a taxonomy of etl activities
DESCRIPTION
A Taxonomy of ETL Activities. Panos Vassiliadis 1 , Alkis Simitsis 2 , Eftychia Baikousi 1 (1) University of Ioannina {pvassil,ebaikou}@cs.uoi . gr (2) HP Labs [email protected]. Outline. Motivation Normal Form Taxonomy Design Patterns Conclusions. Outline. Motivation Normal Form - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/1.jpg)
A Taxonomy of ETL Activities
Panos Vassiliadis1, Alkis Simitsis2, Eftychia Baikousi1
(1) University of Ioannina{pvassil,ebaikou}@cs.uoi.gr
(2) HP [email protected]
![Page 2: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/2.jpg)
Outline
MotivationNormal FormTaxonomyDesign PatternsConclusions
DOLAP'09, Hong Kong, Nov. 2009 2
![Page 3: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/3.jpg)
Outline
MotivationNormal FormTaxonomyDesign PatternsConclusions
DOLAP'09, Hong Kong, Nov. 2009 3
![Page 4: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/4.jpg)
DOLAP'09, Hong Kong, Nov. 2009 4
A developer’s problemUlysses is a developer in the IT department of his
company & he has a problem: he was assigned the task to populate a data mart of the organization
with source data =>
He needs to construct an ETL workflow
![Page 5: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/5.jpg)
DOLAP'09, Hong Kong, Nov. 2009 5
A developer’s problem
Ulysses has a variety of solutions:Off-the-self ETL toolPL/SQL scripts Embedded SQL in Java, C/C++, …Python, Perl, … scripts…
![Page 6: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/6.jpg)
DOLAP'09, Hong Kong, Nov. 2009 6
Ulysses also knows that…
Relational DBMSs perform query optimization using a simple model of data (the relational one)a (relational) algebra both at the logical and physical levels
Software engineers construct their (O-O) code based onbest practices, or patterns, that allow the efficient construction, maintenance, testing, and execution of codequality measures that allow the assessment of the quality of the constructed software
![Page 7: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/7.jpg)
DOLAP'09, Hong Kong, Nov. 2009 7
Ulysses also knows that…
Relational DBMSs perform query optimization using a simple model of data (the relational one)a (relational) algebra both at the logical and physical levels
Software engineers construct their (O-O) code based onbest practices, or patterns, that allow the efficient construction, maintenance, testing, and execution of codequality measures that allow the assessment of the quality of the constructed software
Wouldn’t it be nice if we had similar tools for ETL flows?
![Page 8: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/8.jpg)
DOLAP'09, Hong Kong, Nov. 2009 8
Can we help Ulysses? Yes, we can!
We can have a uniform model
to cover a large variety of ETL activities with complicated semantics and provide the framework for the optimization of ETL flows
a taxonomy of ETL activities based on their internals and the elementary components that characterize them with respect to their properties
a set of archetype patterns that combine ETL activities in a practical manner that can be used by an ETL engine for optimization, tuning, parallelization, …
![Page 9: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/9.jpg)
Outline
MotivationNormal FormTaxonomyDesign PatternsConclusions
DOLAP'09, Hong Kong, Nov. 2009 9
![Page 10: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/10.jpg)
DOLAP'09, Hong Kong, Nov. 2009 10
Rationale & Summary of Contribution
Is there a uniform, reference way to describe ETL activities?Can we classify both simple activities and complicated activities in an meaningful manner?Is it possible to come up with a simple “algebra” of operations for ETL activities and workflows?
![Page 11: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/11.jpg)
DOLAP'09, Hong Kong, Nov. 2009 11
Rationale & Summary of Contribution
We introduce a uniform modeling notation it works for all kinds of ETL activities and ETL flows
both, experience and practice verifies thatit is based on a mapping to structure of matter
Particle ↔ Simple operation Atom ↔ Simple activity Molecule ↔ Complex activity Compound ↔ ETL flow
typically, atoms and molecules can be represented as scriptsa particle is a certain function call inside the scriptETL tools provide atoms and not molecules
![Page 12: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/12.jpg)
DOLAP'09, Hong Kong, Nov. 2009 12
Particle
Output schema
Input schema
A1
A2
A3
A4
A5
A6
A4
A5
A6
A1
A2
A3
A7New field: A7
Rejected fields
ETL Atoms
Merging & Routing Particle
Output schema 1Input schema 1
A1
A2
A3
A4
A5
A6
A4
A5
A6A1
A2
A3
Rejected fields
Output schema 2
A1
A2
A3
Output schema 3
A1
A2
A3
A4
A5
A6
A4
A5
A6
Input schema 2
A1
A2
A3
A4
A5
A6
A7
A8
ETL atom an ETL activity that performs exactly one jobit involves exactly one ETL particle it can be unary or n-ary it may contain multiple output schematait may project out input attributes and generate
new ones at the output
![Page 13: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/13.jpg)
DOLAP'09, Hong Kong, Nov. 2009 13
ETL Molecules
Merging Particle
Output schema 1Input schema 1
A1
A2
A3
A4
A5
A6
A4
A5
A6A1
A2
A3
Rejected fields
Output schema 2
A1
A2
A3
Output schema 3
A1
A2
A3
A4
A5
A6
A4
A5
A6
Input schema 2
A1
A2
A3
A4
A5
A6
A7
A8
Routing Particle
Transformation Particles
ETL molecule it transfers data from
input to output schemata
but there is a linear workflow of particles in between these two groups of schemata.
![Page 14: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/14.jpg)
Benefits
A normal form is a single, reference way to theoretically discuss properties of ETL activitiesSeveral results can be expressed under this formalism
logical optimization is a prominent caseeasy to show that results around swapping of activities (ICDE’05, TKDE’05) fit nicely in this paper
e.g., when can we change the particle order within the same molecule? what kind of (inter-/intra-) molecule operations can we do?
compose molecules (to be able to form flows)split or merge molecules (i.e., SW modules)exchange particles between molecules or swap molecules (for optimization reasons)
DOLAP'09, Hong Kong, Nov. 2009 14
![Page 15: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/15.jpg)
Benefits
Example inter-/intra-molecule operations
Coupling
Activity swapping
DOLAP'09, Hong Kong, Nov. 2009 15
ba
b
b
b
...
a
abba
aIb,k
ObbOa,j
![Page 16: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/16.jpg)
Outline
MotivationNormal FormTaxonomyDesign PatternsConclusions
DOLAP'09, Hong Kong, Nov. 2009 16
![Page 17: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/17.jpg)
DOLAP'09, Hong Kong, Nov. 2009 17
Taxonomy of ETL activities
The taxonomy is based on interrelationship of input-output w.r.t.
schemata processing of incoming tuples
Exploit taxonomical characteristics forlogical & physical optimization of a workflowparallelization of activitiesany other tuning for improving efficiency, resilience to failures
![Page 18: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/18.jpg)
Taxonomy of ETL activities
DOLAP'09, Hong Kong, Nov. 2009 18
blocking
semi-blocking
non-blocking
# inputs
Final classificatio
n
Physical-level characteristics
unaryN-ary
Other
![Page 19: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/19.jpg)
DOLAP'09, Hong Kong, Nov. 2009 19
Applicability
![Page 20: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/20.jpg)
DOLAP'09, Hong Kong, Nov. 2009 20
Benefits
Example uses of taxonomy as heuristics for optimization1:1
locally processedeasily exchangeable with one another (under conditions)easily parallelizable
N:1aggregators are blocking / semi-blocking at bestorder or hash sensitivehard to parallelize
1:Ntuple producersorder generators
Binary, primary flowssemi-blockingcan be treated as local if right input can be hashed in main memory
Routersstrong possibility to parallelize and enhance pipelining
Restmostly blocking, very hard to include in pipelining
![Page 21: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/21.jpg)
Outline
MotivationNormal FormTaxonomyDesign PatternsConclusions
DOLAP'09, Hong Kong, Nov. 2009 21
![Page 22: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/22.jpg)
DOLAP'09, Hong Kong, Nov. 2009 22
Design Patterns
So far, we have given Ulysses the database related foundations
a notation / model to express ETL molecules (activities) in a uniform waya taxonomy for the particles (elementary operations) that compose moleculesa set of cases where we can show that optimizations can be decided on the grounds of the above
Now we discuss design related foundations
![Page 23: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/23.jpg)
DOLAP'09, Hong Kong, Nov. 2009 23
Design patterns for ETL
Apart from a normal form for ETL activities, we strive to establish the benefits of having normal forms for combinations of activitiesWe have introduced butterflies and a set of ETL design patterns (TPC-TC’09) having ‘appropriate’ combinations of activities based on their taxonomical characteristicsThese patterns have been proved to be beneficiary for many purposes (design, efficiency, benchmarking)
![Page 24: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/24.jpg)
DOLAP'09, Hong Kong, Nov. 2009 24
γA,Β
4
V
S
R Z
γA
5
W
100000
100000
sel1=0.6
sel2=0.1
sel4=0.2
p1=0.003
p2=0.004
p4=0.001
σA<600
1
σA>300
2
sel5=0.5p5=0.005
wv A=A
3
sel3=0.2p3=0.001
Butterflies for ETL
![Page 25: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/25.jpg)
DOLAP'09, Hong Kong, Nov. 2009 25
(a) Line (b) Wishbone
(c) Primary Flow (d) Tree
(e) Flat Hierarchy - Fork (f) Deep Hierarchy
Butterfly classes
V WγA,B
3
R σA>300
1
σB>400
2γΑ
4
V
S
R
W
σA<600
1
γΒ
2
wv A=A
3
WR σA>300
1
σB>400
2
wv A=A
3
SQ
R σA>50 X
S σA<50
2
Y
wv A=A
3
DW
Z
wv A=A
4
1
γA2
V
Z
σA>300
1
γA,Β3
W
γB4
Y
RσA=500
3
V
U
γB,C
5
W
TγA
2
γA
6
Q
R σA>300
1γC
4
Z
![Page 26: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/26.jpg)
DOLAP'09, Hong Kong, Nov. 2009 26
Benefits
Design Guidelinesallows designers to communicate designs (document, maintain, …) better
Performancethe synthesis of ETL scenarios should be performed in a manner that allows the underlying engine to exploit them.
Beneficiary for the purpose of benchmarking ETL as well
![Page 27: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/27.jpg)
DOLAP'09, Hong Kong, Nov. 2009 27
Benefits in efficiency
In V. Tziovara’s MSc @ UoI (also DOLAP’07) it is shown that we can introduce sorters intentionally to a workflow to exploit common sorting of data for sequences of order-dependent operatorsDifferent butterflies can benefit from different policies
![Page 28: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/28.jpg)
DOLAP'09, Hong Kong, Nov. 2009 28
Benefits in scheduling
Different scheduling policies can be used for different butterflies (A. Karagiannis MSc, UoI)
![Page 29: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/29.jpg)
Outline
MotivationNormal FormTaxonomyDesign PatternsConclusions
DOLAP'09, Hong Kong, Nov. 2009 29
![Page 30: A Taxonomy of ETL Activities](https://reader036.vdocuments.us/reader036/viewer/2022062315/56815a9c550346895dc81bec/html5/thumbnails/30.jpg)
DOLAP'09, Hong Kong, Nov. 2009 30
Conclusions
We have presented and discusseda uniform model to cover a large variety of ETL activities with complicated semantics and provide the framework for the optimization of ETL flowsa taxonomy for the internals of ETL activities that allows us to treat them as black-boxes in optimization, scheduling, parallelization, …a set of design patterns for combinations of ETL activities in archetype patterns
Future work on optimizationoptimization techniquesalternative physical implementations for the same logical operationsresource allocation and scheduling policiespossibilities for parallelization