formal specification and optimization of etl workflows
TRANSCRIPT
ETL Workflows: From Formal Specification to Optimization
Timos SellisNational Technical University of Athens
(joint work with Alkis Simitsis, IBM Almaden Research Center, Panos Vassiliadis, Univ. of Ioannina and Dimitris Skoutas, NTUA)
Timos Sellis - ADBIS 2007 2
Data Warehouse Environment
Timos Sellis - ADBIS 2007 3
Extract-Transform-Load (ETL)
Sources
Extract Transform & Clean
DW
Load
DSA
Timos Sellis - ADBIS 2007 4
Motivation ETL and Data Cleaning tools cost
30% of effort and expenses in the budget of the DW 55% of the total costs of DW runtime 80% of the development time in a DW project
ETL market: a multi-million market IBM paid $1.1 billion dollars for Ascential
ETL tools in the market software packages in-house development
No standard, no common model most vendors implement a core set of operators and provide GUI to
create a data flow
Timos Sellis - ADBIS 2007 5
Problems The key factors underlying the main problems of
ETL processes are: vastness of the data volumes quality problems, since data is not always clean and has
to be cleansed performance, since the whole process has to take place
within a specific time window evolution of the sources and the data warehouse can
eventually lead, even to daily maintenance operations
Timos Sellis - ADBIS 2007 7
Outline Conceptual Model Logical Model
Architecture Graph Operational Semantics Quality Metrics (will not touch this)
Conceptual to Logical Optimization of ETL Workflows Research Challenges
Timos Sellis - ADBIS 2007 8
Outline Conceptual Model Logical Model
Architecture Graph Operational Semantics Quality Metrics (will not touch this)
Conceptual to Logical Optimization of ETL Workflows Research Challenges
Timos Sellis - ADBIS 2007 9
Conceptual Model Design goals
we need a simple model, sufficient for the early stages of the data warehouse design
we need convenient means of communication among different groups of people involved in the DW project (e.g., dba’s and business managers)
we need to be able to model what our sources “talk” about
Semantic goals we need richer semantics to
describe sources reason about them
Timos Sellis - ADBIS 2007 10
Conceptual Model
S1.PARTSUPPS2.PARTSUPP DW.PARTSUPP
American toEuropean Date$ € Date = SysDate()
SK
f
SUM(S2.Cost)
SUM(S2.Qty)
S2.Date
S2.PKey
S2.SuppKey
f NN
f
SK
PK
Cost
Qty
Date
PKey
SuppKey
Cost
Qty
PKey
SuppKey
Cost
Date
Qty
PKey
SuppKey
Annual PartSupp’s
RecentPartSupp’s
{XOR}
Due to acccuracy and small size
(< update window)
Necessary providers:S1 and S2
{Duration<4h}
U
γ
PS1
Qty
PKey
SuppKey
PS2
Cost
PKey
SuppKey
+
PS1.Pkey+=PS2.PKeyPS1.SuppKey+=PS2.SuppKey
PS1.Dept+=PS2.Dept
Dept
Dept
Dept
Timos Sellis - ADBIS 2007 11
Conceptual Model Key idea
an ontology-based approach to facilitate the conceptual design An ontology
is a “formal, explicit specification of a shared conceptualization” describes the knowledge in a domain in terms of classes, properties,
and relationships between them machine processable formal semantics reasoning mechanisms
Method construct an appropriate application vocabulary annotate the data sources generate the application ontology apply reasoning techniques to select relevant sources and identify
required transformations
Timos Sellis - ADBIS 2007 12
Paris/Rome/Athens
Conceptual Model
Above 200 Euros
prices in Dollars
From 500 to 1500 Euros
software/hardware
only software products
Paris/Rome/Athens
Rome/Athens
software
hardware
Timos Sellis - ADBIS 2007 13
Conceptual Model
Application vocabularyVC = {product, store}
VPproduct = {pid, pName, quantity, price, type, storage}
VPstore = {sid, sName, city, street}
VFpid = {source_pid, dw_pid}
VFsid = {source_sid, dw_sid}
VFprice = {dollars, euros}
VTtype = {software, hardware}
VTcity = {paris, rome, athens}
Datastore mappings
Datastore annotation
Timos Sellis - ADBIS 2007 14
Conceptual Model The class hierarchy Definition for class
DS1_Products
Timos Sellis - ADBIS 2007 15
Conceptual Model Reasoning on the mappings s(location, city, street)c(street, number, street)
Timos Sellis - ADBIS 2007 16
Conceptual Model Reasoning on the definitions
f(pid, Source_Pid, DW_Pid)σ(price, From_500_To_1500)nn(quantity)σ(type, {software})
Timos Sellis - ADBIS 2007 17
Outline Conceptual Model Logical Model
Architecture Graph Operational Semantics Quality Metrics (will not touch this)
Conceptual to Logical Optimization of ETL Workflows Research Challenges
Timos Sellis - ADBIS 2007 18
Logical Model
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF2
DS.PSNEW2.PKEY,DS.PSOLD2.PKEY
DS.PSNEW2
DS.PSOLD2
DW.PARTS Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP2S2.PARTS
S1.PARTS FTP1
DS.PSNEW1
DIFF1
DS.PSOLD1
DS.PSNEW1.PKEY,DS.PSOLD1.PKEY
SourcesDW
DSA
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
Timos Sellis - ADBIS 2007 19
Logical Model Main question:
What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage? which attributes/tables are involved in the population of
an attribute? what part of the scenario is affected if we delete an
attribute?
Timos Sellis - ADBIS 2007 20
Architecture Graph
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate U
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF2
DS.PSNEW2.PKEY,DS.PSOLD2.PKEY
DS.PSNEW2
DS.PSOLD2
DW.PARTS Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP2S2.PARTS
S1.PARTS FTP1
DS.PSNEW1
DIFF1
DS.PSOLD1
DS.PSNEW1.PKEY,DS.PSOLD1.PKEY
SourcesDW
DSA
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
DS.PS2
Log
rejected
Log
rejected
Timos Sellis - ADBIS 2007 21
Architecture GraphExample
Add_Attr2 SK2DS.PS2 TMP_STOR.PARTSUPP
PKEY
QTY
COST
PKEY
QTY
COST
PKEY
QTY
COST
DATE
PKEY
QTY
COST
DATE
PKEY
QTY
COST
DATE
PKEY
QTY
COST
DATEDATE DATE
SOURCE SOURCE SOURCE SOURCE
SKEYAddConst2
in out
1
LOOKUP2
PKEY
SOURCE
SKEY
PKEY
SOURCE
LPKEY
LSOURCE
LSKEY
IN IN INOUT OUT OUT
OUT
PAR PAR
SUPPKEY SUPPKEY SUPPKEY SUPPKEY SUPPKEYSUPPKEY
2
Timos Sellis - ADBIS 2007 22
Architecture Graph
SK2
PKEY
QTY
COST
DATE
PKEY
QTY
COST
DATE
SOURCE SOURCE
SKEY
PKEY
SOURCE
LPKEY
LSOURCE
LSKEY
IN OUT
PAR
SUPPKEY SUPPKEY
input schema
output schema
functionality schema
projected-out schema
generated schema
Add_Attr2DS.PS2 TMP_STOR.PARTSUPP
PKEY
QTY
COST
PKEY
QTY
COST
PKEY
QTY
COST
DATE
PKEY
QTY
COST
DATEDATE DATE
SOURCE SOURCE
AddConst2
in out
1
LOOKUP2
PKEY
SOURCE
SKEY
IN INOUT OUT
OUT
PAR
SUPPKEY SUPPKEY SUPPKEY SUPPKEY
Example
2
Timos Sellis - ADBIS 2007 23
Semantics We provide graph modeling techniques for several
kinds of activities: update (INS, UPD, DEL) activities aggregates rules employing negation and aliases functions
Timos Sellis - ADBIS 2007 24
Logical Model Question revisited
What information should we put inside a metadata repository to be able to answer questions like: what is the architecture of my DW back stage?
it is described as the Architecture Graph which attributes/tables are involved in the population of
an attribute? what part of the scenario is affected if we delete an
attribute? follow the appropriate path in the Architecture Graph
Timos Sellis - ADBIS 2007 25
Outline Conceptual Model Logical Model
Architecture Graph Operational Semantics Metrics
Conceptual to Logical Optimization of ETL Workflows Research Challenges
Timos Sellis - ADBIS 2007 26
Conceptual to Logical Similarities
concepts and attributes recordsets and attributes part-of and provider relationships
DW.PARTS
PKey
Cost
Qty
Date
DW.PARTS
Qty
Cost
PKey
Date
... ...
Timos Sellis - ADBIS 2007 27
Conceptual to Logical Discrepancies
PKey PKeySK
S1 DW
R01: sk.a_in1(pkey <-S1(pkey).
R02: sk.a_in2(pkey,source,skey)<-lookUp(l_pkey,source,l_skey),pkey=l_pkey, skey=l_skey, source=1.
R03: sk.a_out(pkey, skey)<-add_sk1.a_in1(pkey),add_sk1.a_in2(pkey,source,skey).
R04: S2(skey)<-sk.a_out(pkey,skey).
R01: sk.a_in1(pkey <-S1(pkey).
R02: sk.a_in2(pkey,source,skey)<-lookUp(l_pkey,source,l_skey),pkey=l_pkey, skey=l_skey, source=1.
R03: sk.a_out(pkey, skey)<-add_sk1.a_in1(pkey),add_sk1.a_in2(pkey,source,skey).
R04: S2(skey)<-sk.a_out(pkey,skey).
Operational Semantics of SK :
SK
PKeyPKey
S1
PKey
S2
SKey
LookUp
l_PKey
l_SKey
Source
SKey
PKey
SKey
Source
1
=
=
IN2
Program
SK_in1
R03
SK_in2
SK_out
R01 R04
headhead
head
R02
head
Transformationsvs.
Activities
Timos Sellis - ADBIS 2007 28
Conceptual to Logical Correctness
our methodology is a semi-automatic procedure the designer should examine, complement or change the outcome of
this methodology Optimization
the constraints in the determination of the execution order require only that the placement of activities should be semantically correct
the logical workflow produced by this methodology is not the only possible logical workflow
the final choice of an ETL workflow depends on other parameters, e.g., the quality of the designquality of the design of an ETL workflow the execution costexecution cost of an ETL workflow
Timos Sellis - ADBIS 2007 29
SK
S2.PARTSUPP DW.PARTSUPP
γ
f1
S2.Date
f2Cost
Qty
Date
PKey
SuppKey
Cost
Date
Qty
PKey
SuppKey
PK
Conceptual to Logical Execution order…
which is the proper execution order?
Timos Sellis - ADBIS 2007 30
Conceptual to Logical
γ DW.PARTSUPPf2SKS2.PART
SUPP f1 PK
Execution order…
order equivalence?
SK,f1,f2 or SK,f2,f1 or ... ?
Timos Sellis - ADBIS 2007 31
Outline Conceptual Model Logical Model
Architecture Graph Operational Semantics Metrics
Conceptual to Logical Optimization of ETL Workflows Research Challenges
Timos Sellis - ADBIS 2007 32
Optimization of ETL Workflows Common settlement
ad-hoc optimization based on the experience of the designer execute ETL workflow as it is; hopefully, the optimizer of the DBMS
would improve the performance An ETL workflow is NOT a big query
ETL is very procedural in nature, each ETL operator is a low-level operator in procedural languages like PL/SQL
Traditional query optimization techniques are not enough existence of functions
where it is allowed to push an activity before/after a function? existence of black-box activities
unknown semantics can not interfere in their interior
naming conflicts
Timos Sellis - ADBIS 2007 33
Optimization of ETL Workflows How can we improve an ETL workflow in terms of execution
time? We model the ETL processes optimization problem as a state
search problem we consider each ETL workflow as a state we construct the search space the optimal state is chosen according to our cost model’s criteria, in
order to minimize the execution time of an ETL workflow
Timos Sellis - ADBIS 2007 34
Optimization of ETL Workflows Transition from one state to the other
SWA: interchange two activities of the workflow FAC: replace homologous tasks in parallel flows with an equivalent
task over a flow to which these parallel flows converge DIS: divide tasks of a joint flow to clones applied to parallel flows
that converge towards the joint flow MER / SPL: merge / split group of activities
a2
SWA(a1,a2) SWA(a2,a1)
a2 a1
a1
FAC(ab,a1,a2) DIS(ab,a)
aab
ab
a1
a2
a,a1,a2: homologous activities
a2a1
SPL(a1+2,a1,a2)
a1+2
MER(a1+2,a1,a2)
Timos Sellis - ADBIS 2007 35
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF2
DS.PSNEW2.PKEY,DS.PSOLD2.PKEY
DS.PSNEW2
DS.PSOLD2
DW.PARTS Aggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP2S2.PARTS
S1.PARTS FTP1
DS.PSNEW1
DIFF1
DS.PSOLD1
DS.PSNEW1.PKEY,DS.PSOLD1.PKEY
SourcesDW
DSA
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
Timos Sellis - ADBIS 2007 36
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
Pre-Processing Phase I: SWA’s in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
Timos Sellis - ADBIS 2007 37
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
Pre-Processing Phase I: SWA’s in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
Timos Sellis - ADBIS 2007 38
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
Pre-Processing Phase I: SWA’s in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
Timos Sellis - ADBIS 2007 39
Optimization of ETL WorkflowsPre-Processing Phase I: SWA’s
in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
DS.PS1
DS.PS2 AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
ULog
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
$2€ A2EDate
€
Adate
€
Edate
$
Adate
€
Adate
in out in out
AddAttr SK2
pkey
source
skey
source
pkey pkey
source
in out in out
skey
Timos Sellis - ADBIS 2007 40
Optimization of ETL WorkflowsPre-Processing Phase I: SWA’s
in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
DS.PS1
DS.PS2 AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
ULog
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
A2EDate $2€
€
Edate
€
Adate
€
Adate
$
Adate
in out in out
AddAttr SK2
pkey
source
skey
source
pkey pkey
source
in out in out
skey
??
Timos Sellis - ADBIS 2007 41
Optimization of ETL WorkflowsPre-Processing Phase I: SWA’s
in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
DS.PS1
DS.PS2 AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
ULog
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
AddAttr SK2
pkey
source
skey
source
pkey pkey
source
in out in out
skey
x
A2EDate $2€
$
Edate
€
Εdate
$
Adate
$
Εdate
in out in out
Timos Sellis - ADBIS 2007 42
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
Pre-Processing Phase I: SWA’s in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
SK2
U
SK1
Timos Sellis - ADBIS 2007 43
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
Pre-Processing Phase I: SWA’s in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
SK1,2U
Timos Sellis - ADBIS 2007 44
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
Pre-Processing Phase I: SWA’s in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
UPKU
Timos Sellis - ADBIS 2007 45
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK
PKEY,DATE
Log
rejected
DW.PARTS
Pre-Processing Phase I: SWA’s in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
PK2
U
PK1
Timos Sellis - ADBIS 2007 46
Optimization of ETL Workflows
AddAttr2
SOURCE
SK2
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SOURCE
$2€
COST DATE
DS.PS1 SK1
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SOURCECOST DATE=SYSDATE
AddDate
U
DS.PS2
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
Log
rejected
γ
QTY,COST
PK1
PKEY,DATE
Log
rejected
DW.PARTS
Pre-Processing Phase I: SWA’s in local groups
Phase II: FAC’s
homologousPhase III: DIS’s
Phase IV: SWA’s in local
groups
Post-Processing
PK2
PKEY,DATE
Log
rejected
Timos Sellis - ADBIS 2007 47
Optimization of ETL Workflows Algorithms
exhaustive (1) heuristic (2) greedy (3)
in phases I & IV perform only those transitions that drive to a better state
Results algorithms (2) and (3) improve the performance of ETL workflows
over 70% (avg) during a satisfactory for DW’s period of time (in a time range of sec..10min)
Timos Sellis - ADBIS 2007 48
Optimization of ETL Workflows Physical optimization
several physical operators implement the same semantic operation e.g., a logical join can be performed in more than one way:
nested loops, sort-merge join, hash-join Solution
employ different alternatives for the physical execution of each logical-level activity
take into consideration the effects of possible system failures to the workflow operation the introduction of sorter activities in the physical design
Algorithms exhaustive search of the search space state-space pruning based on experimental observations and the benefits of
sorter introduction given by the formula:
Timos Sellis - ADBIS 2007 49
Outline Conceptual Model Logical Model
Architecture Graph Operational Semantics Metrics
Conceptual to Logical Optimization of ETL Workflows Research Challenges
Timos Sellis - ADBIS 2007 50
Research Challenges A unified way to represent ETL processes
an algebra or a declarative language a benchmark for ETL processes
useful for evaluating optimization techniques, implementation algorithms for specific ETL activities, and so on(a first attempt was presented in QDB’07, in conj. w/ VLDB’07)
Extension of the ETL mechanisms for non-traditional data XML/HTML, spatial, biomedical data, …
Apply the techniques described to similar environments e.g., Active DWs
meet the high demand of applications for up-to-date information are refreshed on-line and achieve a higher consistency between the
stored information and the latest data updates
Timos Sellis - ADBIS 2007 51
Real time ETL
Real-time Stream of S1 updates DS
Relation R
Source Relation
S
Join module
Load shedder
(Active) ETL activities for regular DW load
...
...
DSA
DW
Active ETL workflow for approximate, real-time reporting
Off-line synchronization
IntroductoryStage
GrowthStage
MaturityStage Decline Stage
TotalMarketSales
Time
10090
8070
6050
4030
4050
Real-time DW refreshment
Offline update of reports
DW refreshmentData to
be loaded
Timos Sellis - ADBIS 2007 52
Research Challenges Take into consideration
privacy issues, physical constraints Evolution of ETL processes
changes may occur in the sources, DW, business requirements, … (a first attempt was presented in DaWaK’07)
Can it scale? think of new models for the case of large distributed environments
with many sources, e.g. P2P can the techniques scale? can they adapt to the different semantics, like approximate and
incomplete answers? can we make the techniques “goal”-driven rather than strict: e.g. I want
to have 100% over this week’s data, 80% over last week’s, etc? how to integrate static and dynamic cases (peers come and leave, others
stay there for a long period)?
Timos Sellis - ADBIS 2007 53
Conclusions Conceptual Model Logical Model
Architecture Graph Operational Semantics Metrics
Conceptual to Logical Optimization of ETL Workflows Research Challenges
DOLAP ’02,’06 – NLDB ’07 – J. IJSWIS ’07
CAiSE ’02 – DMDW ’02CAiSE ’03 - J. Inf.Sys. ’05ER ’05 – DaWaK ’05
DOLAP ’05 – J. DSS ’05ICDE ’05 – J. TKDE ’05 – DOLAP
’07
Real ETL – MeshJoin, ICDE ’07
Timos Sellis - ADBIS 2007 54
Thank you!