model driven etl-design - driving the test process ¾conclusions. 4 - model driven etl design...

64
Atos, Atos and fish symbol, Atos Origin and fish symbol, Atos Consulting, and the fish symbol itself are registered trademarks of Atos Origin SA. © 2006 Atos Origin. Private for the client. This report or any part of it, may not be copied, circulated, quoted without prior written approval from Atos Origin or the client. Model driven ETL-design Improving ETL Development by structuring logical design

Upload: ngotuong

Post on 06-Mar-2018

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

Atos, Atos and fish symbol, Atos Origin and fish symbol, Atos Consulting, and the fish symbol itself are registered trademarks of Atos Origin SA.© 2006 Atos Origin. Private for the client. This report or any part of it, may not be copied, circulated, quoted without prior written approval from Atos Origin or the client.

Model driven ETL-designImproving ETL Development by structuring logical design

Page 2: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

2 - Model Driven ETL Design

Mark Zwijsen

Senior Consultant Business Intelligence and Data Warehousing

25 years IT-experience

Industries:Telecom, Public Sector, Retail, Transport

Customers include:KPN, Dutch Government, KLM, Unilever

Publications in dutch IT-magazines

Page 3: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

3 - Model Driven ETL Design

Agenda

The problem area

Presentation of the proposed solution

The solution explained by examples

The step from logical to technical

Case tools and code generation

Driving the test process

Conclusions

Page 4: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

4 - Model Driven ETL Design

Agenda

The problem area

Presentation of the proposed solution

The solution explained by examples

The step from logical to technical

Case tools and code generation

Driving the test process

Conclusions

Page 5: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

5 - Model Driven ETL Design

Problem Area

Average datawarehouse project :100+ tablesOften as many or even many more ETL-programs

CauseA separate ETL-program is constructed per target tableOften more than 1 ETL-program for a target (e.g. when table is fed from more sources)

Datawarehouse is subject to change

ResultFrequent changes to ETL-programsThus huge design- and development cost

An efficient design and development method can result in significant cost saving

Page 6: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

6 - Model Driven ETL Design

Levels in Design and Development

Logical

Technical

SourceData

TargetData

DataMapping

ERD ERD

PhysicalData Model

PhysicalData Model

ETLProgram

????

??

In data modelling, the logical level is aimed towards structure, meaning, interdependence of information, without any notice of the technical details, whichcan differ from platform to platform

For ETL, designs often are a mixture of functional and technical specifications

This is caused by the absence of a formal ETL design language, and by the common practice in projects : ‘build first, document afterwards’

Page 7: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

7 - Model Driven ETL Design

Quality = Sustaining Usability

Usage

Mainten

ance Transition

CompleteAccurateConsistentEfficientAccessible

PortableReusableIntegratable

StructuredUnderstandableFlexibleTestableChangeable

There is a need for high-quality logical ETL designs

Page 8: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

8 - Model Driven ETL Design

Quality Issues for ETL Designs

UsageComplete, Accurate, Consistent, Efficient, Accessible• The objective is to BUILD WITHOUT FURTHER QUESTIONS

MaintenanceStructured, Understandable, Flexible, Testable, Changeable• The objective is to ALLOW FOR EASY ALTERATION

TransitionPortable, Reusable, Integratable• The objective is to BE PLATFORM/TOOL INDEPENDENT

Page 9: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

9 - Model Driven ETL Design

Mapping on logical level: horizontal split

Logical

SourceData

TargetData

DataMapping

ERD Normalizedmodel

Star scheme+history

Generally, the mapping from source to target can globally be split in 2 phases

The first phase comprises the ‘hard part’of transforming from the source to the target language (the actual extracting and transforming steps)

The second phase comprises the ‘easy part’: handling history, aggregation, denormalization (the actual loading step)

Page 10: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

10 - Model Driven ETL Design

Second Phase: Normalized Model to Target Model is relatively straighforward

DIM_CUSTOMER

CUSTOMER_KEYCUSTOMER_NUMBERCUSTOMER_NAMECUSTOMER_TYPECUSTOMER_IDENTIFICATIONCUSTOMER_COCNRCUSTOMER_ZIPCODECUSTOMER_DEMOGRAPHYCUSTOMER_SECTORVALID_FROMVALID_UNTIL

Base Object(s) of DimensionSCD Type per object/attribute

Usage of generated dimensionkeyType of validity-indicators

TBL_CUSTOMER

CUSTOMER_NUMBERCOCNRZIPCODECUSTOMER_NAMECUSTOMER_IDENTIFICATIONCUSTOMER_TYPE

TBL_COCDATA

COCNRCOC_NAMESECTOR_CODE

TBL_DEMOGRAPHY

ZIPCODEDEMOGRAPHY_TYPE

Page 11: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

11 - Model Driven ETL Design

First phase: Source Model to Normalized Model, is the challenging part

Detect/Determine validityFilter irrelevant data

Standardize codes, names, etc.Merge multiple row-records

Join disparate recordsEliminate irrelevant levels

Clean dirty dataSplit recurring fields

DeduplicateResolve self-references

Etc.Etc.

TBL_CUSTOMER

CUSTOMER_NUMBERCOCNRZIPCODECUSTOMER_NAMECUSTOMER_IDENTIFICATIONCUSTOMER_TYPE

TBL_COCDATA

COCNRCOC_NAMESECTOR_CODE

TBL_DEMOGRAPHY

ZIPCODEDEMOGRAPHY_TYPE

CUSTOMER

CUSTOMER_NRKVKNUMMERADDRESS_NRNAMEPRIVATE_BUSINESS_FLAGIDENTIFICATIONSHIPTO_ADDRESS_NRTIMESTAMP

ADDRESS

ADDRESS_NRVALID_FROMCUSTOMER_NRSTREETHOUSE_NRHOUSE_NR_EXTPOSTCODECITYTIMESTAMP

COMPANYREGISTER

KVKNUMMERNAAMBRANCHE_CODEOMVANG

DEMOGRAFIE

POSTCODEDEMOGRAFIE_TYPE

Page 12: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

12 - Model Driven ETL Design

Detect/Determine validityFilter irrelevant data

Standardize codes, names, etc.Merge multiple row-records

Join disparate recordsEliminate irrelevant levels

Clean dirty dataSplit recurring fields

DeduplicateResolve self-references

Etc.Etc.

Most ETL-programs move Source Model to Target Model in one flow

Base Object(s) of DimensionSCD Type per object/attribute

Usage of generated dimensionkeyType of validity-indicators

CUSTOMER

CUSTOMER_NRKVKNUMMERADDRESS_NRNAMEPRIVATE_BUSINESS_FLAGIDENTIFICATIONSHIPTO_ADDRESS_NRTIMESTAMP

ADDRESS

ADDRESS_NRVALID_FROMCUSTOMER_NRSTREETHOUSE_NRHOUSE_NR_EXTPOSTCODECITYTIMESTAMP

COMPANYREGISTER

KVKNUMMERNAAMBRANCHE_CODEOMVANG

DEMOGRAFIE

POSTCODEDEMOGRAFIE_TYPE

TBL_CUSTOMER

CUSTOMER_NUMBERCOCNRZIPCODECUSTOMER_NAMECUSTOMER_IDENTIFICATIONCUSTOMER_TYPE

TBL_COCDATA

COCNRCOC_NAMESECTOR_CODE

TBL_DEMOGRAPHY

ZIPCODEDEMOGRAPHY_TYPE

DIM_CUSTOMER

CUSTOMER_KEYCUSTOMER_NUMBERCUSTOMER_NAMECUSTOMER_TYPECUSTOMER_IDENTIFICATIONCUSTOMER_COCNRCUSTOMER_ZIPCODECUSTOMER_DEMOGRAPHYCUSTOMER_SECTORVALID_FROMVALID_UNTIL

Page 13: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

13 - Model Driven ETL Design

Agenda

The problem area

Presentation of the proposed solution

The solution explained by examples

The step from logical to technical

Case tools and code generation

Driving the test process

Conclusions

Page 14: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

14 - Model Driven ETL Design

Standard Mapping Model

Collect Enrich Filter Validate Convert

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enrichmentrules

Historyrules

HistoryLink

Linkrules

Consists of 8 specific logical transformation steps

Page 15: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

15 - Model Driven ETL Design

The logical transformation steps can begrouped according to their orientationtowards either source or target model

Sourcemodel orientedCollect, Enrich, FilterRules for these steps are mainly in the context of the source model

Quality orientedValidate

Target orientedLink, ConvertRules for these steps are mainly in the context of the target model

Page 16: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

16 - Model Driven ETL Design

Output results of the logical steps

Transformation step Output

COLLECT All candidates with their implicit attributes

ENRICH All candidates with the direct and indirect related attributes

FILTER All relevant candidates with the direct and indirect related attributes

LINK All relevant candidates with the direct and indirect related attributes, plus the foreign key attributes of the target model

VALIDATE All relevant and validated candidates with the direct and indirect relatedattributes, plus the foreign key attributes of the target model

CONVERT All relevant and validated candidatesplus all needed attributes of the target model

Page 17: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

17 - Model Driven ETL Design

Agenda

The problem area

Presentation of the proposed solution

The solution explained by examples

The step from logical to technical

Case tools and code generation

Driving the test process

Conclusions

Page 18: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

18 - Model Driven ETL Design

Target model: star scheme of phonecall factwith 3 dimensions

PHONECALL

CATEGORY_KEYCALENDAR_KEYCUSTOMER_KEYCALL_DURATIONCALL_CHARGE

DIM_CUSTOMER

CUSTOMER_KEYCUSTOMER_NUMBERCUSTOMER_NAMECUSTOMER_TYPECUSTOMER_IDENTIFICATIONCUSTOMER_COCNRCUSTOMER_ZIPCODECUSTOMER_DEMOGRAPHYCUSTOMER_SECTORVALID_FROMVALID_UNTIL

DIM_CALENDAR

CALENDAR_KEYDATEMONTHYEAR

DIM_CATEGORY

CATEGORY_KEYCATEGORY_CODECATEGORY_OMSCHRIJVING

Page 19: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

19 - Model Driven ETL Design

Source model: customer care system, 2 external sources

CUSTOMER CARE SYSTEEM

DEMOGRAFIEBEDRIJVENREGISTER

CUSTOMERCUSTOMER_NRADDRESS_NRNAMEPRIVATE_BUSINESS_FLAGIDENTIFICATIONSHIPTO_ADDRESS_NRTIMESTAMP ADDRESS

ADDRESS_NRVALID_FROMSTREETHOUSE_NRHOUSE_NR_EXTPOSTCODECITYTIMESTAMP

BEDRIJVENREGISTERKVKNUMMERNAAMBRANCHE_CODEOMVANG

DEMOGRAFIEPOSTCODEDEMOGRAFIE_TYPE

SUBSCRIPTIONSUBSCRIPTION_IDCUSTOMER_NRITEM_NRVALID_FROMVALID_UNTILTIMESTAMP

CDRSTARTDATETIMEDURATIONMOBILE_NUMBERCALLED_NUMBERCHARGE

MOBILE_ NUMBER_ASSIGNMENTMOBILE_NUMBERVALID_FROMVALID_UNTILSUBSCRIPTION_ID

MOBILE_NUMBERMOBILE_NUMBERSTATUS

Page 20: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

20 - Model Driven ETL Design

Collect: Retrieve candidates

Collect Enhance Filter Validate Convert

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enhancementrules

Historyrules

HistoryLink

Linkrules

Page 21: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

21 - Model Driven ETL Design

Collect: Retrieve candidates

Retrieve data from the Trigger table

The Trigger table is the table that, in essence, contains the candidates to beprocessed

It is called the Trigger table, because the records trigger the process

Page 22: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

22 - Model Driven ETL Design

Collect: Retrieve candidatesVisualized as tables with rows (candidates) and columns (attributes)

Customer

R#

1

2

3

4

5

CN AN Nm PB Id ST TS

5

4

3

2

1

R#

CDR

ChCnMnDrSd

Output = All candidates with their direct attributes

Page 23: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

23 - Model Driven ETL Design

Collect Enrich Filter Validate Convert

Sourcemodel

Triggerentiteiten

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enrichmentrules

Historyrules

HistoryLink

Linkrules

Enrich : add other source attributes

Page 24: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

24 - Model Driven ETL Design

Enrich : add other source attributes

Trigger candidates may not yet possess all necessary attributes

Data is combined from different sources

Data is denormalized to populate a dimension

Page 25: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

25 - Model Driven ETL Design

Enrich : add other source attributes Extra columns are added to the table

Customer Address BedrGeg Demogr

BC Ov DTSt Hn Hx Pc CtR#

1

2

3

4

5

CN AN Nm PB Id ST TS

InCnId

Subscription

St

MobileNr

5

4

3

2

1

R#

CDR

ChCnMnDrSd

Output = All candidates with their direct and indirect related attributes

Page 26: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

26 - Model Driven ETL Design

Filter: separate relevant/non-relevant candidates

Collect Enrich Filter Validate Convert

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enrichmentrules

Historyrules

HistoryLink

Linkrules

Page 27: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

27 - Model Driven ETL Design

Filter: separate relevant/non-relevant candidates

For various reasons, candidates can be irrelevantOnly changed records need to be processedSpecific categories are not interesting for reporting

Page 28: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

28 - Model Driven ETL Design

Customer Address BedrGeg Demogr

BC Ov DTSt Hn Hx Pc CtR#

1

2

3

4

5

CN AN Nm PB Id ST TS

CDR MobileNr Subscription

St Id Cn InR#

1

2

3

4

5

Sd Dr Mn Cn Ch

Output = All relevant candidates with their direct and indirect related attributes

Filter: separate relevant/non-relevant candidates, based on attribute values

Page 29: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

29 - Model Driven ETL Design

Link: resolve target relations

Collect Enrich Filter Validate Convert

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enrichmentrules

Historyrules

HistoryLink

Linkrules

Page 30: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

30 - Model Driven ETL Design

Link: resolve target relations

Determine foreign key values to referenced tables

Describe the rules that must be used to determine these values

In star scheme models, this step mainly applies for fact tables

Page 31: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

31 - Model Driven ETL Design

Link: resolve target relations

Customer Address BedrGeg Demogr

BC Ov DTSt Hn Hx Pc CtR#

1

2

3

4

5

CN AN Nm PB Id ST TS

CDR MobileNr Subscription Gespreksfeit

Id Cn In Klant-key Cat-key Kal-KeyStR#

1

2

3

4

5

Sd Dr Mn Cn Ch

Output = All relevant candidates with their direct and indirect related attributes,plus the foreign key attributes of the target model

Page 32: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

32 - Model Driven ETL Design

Validate: check candidate quality

Collect Enrich Filter Validate Convert

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enrichmentrules

Historyrules

HistoryLink

Linkrules

Page 33: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

33 - Model Driven ETL Design

Validate: check candidate quality

Candidates need to fulfil certain quality requirements, likeDomain compliance of attributesLinks must be resolvedEnrichment must succeed

If a validation rule fails, appropriate actions must be taken, likeApply a default valueReject the candidate, or perhaps reject the whole collection of candidatesMove the candidate to a suspend table for later reprocessing

Page 34: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

34 - Model Driven ETL Design

Validate: check candidate quality

Customer Address BedrGeg Demogr

BC Ov DTSt Hn Hx Pc CtR#

1

2

3

4

5

CN AN Nm PB Id ST TS

CDR MobileNr Subscription Gespreksfeit

Id Cn In Klant-key Cat-key Kal-KeyStR#

1

2

3

4

5

Sd Dr Mn Cn Ch

Output = All relevant and validated candidates with their direct and indirect related attributes, plus the foreign key attributes of the target model

Page 35: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

35 - Model Driven ETL Design

Reprocess: handle invalidated candidates

Collect Enhance Filter Validate Convert

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enhancementrules

Historyrules

HistoryLink

Linkrules

Page 36: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

36 - Model Driven ETL Design

Convert: derive target attributes

Collect Enhance Filter Validate Convert

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enhancementrules

Historyrules

HistoryLink

Linkrules

Page 37: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

37 - Model Driven ETL Design

Convert: derive target attributes

All target attributes are derived from the available candidate attributes

Page 38: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

38 - Model Driven ETL Design

Convert: derive target attributes

Customer Address BedrGeg Demogr KLANT

DT Nr Nm Tp Id KvK Pc Dg BrBC OvSt Hn Hx Pc CtR#

1

2

3

4

5

CN AN Nm PB Id ST TS

CDR MobileNr Subscription Gespreksfeit

Id Cn In Klant-key Cat-key Kal-Key Duur OmzetStR#

1

2

3

4

5

Sd Dr Mn Cn Ch

Output: All relevant and validated candidatesplus all needed attributes of the target model

Page 39: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

39 - Model Driven ETL Design

History: handle changes in time

Collect Enhance Filter Validate Convert

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enhancementrules

Historyrules

HistoryLink

Linkrules

Page 40: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

40 - Model Driven ETL Design

History: handle changes in time

This step comprisesSlowly Changing Dimension rulesAggregation rules

Page 41: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

41 - Model Driven ETL Design

Transformation rules Customer Dimension (1)

Nr Step Rule1 Source

modelCUSTOMER

2 Sourcemodel

ADDRESS

3 Sourcemodel

Bedrijvenregister

4 Sourcemodel

Demografie

5 Collect CUSTOMER

These rules can be entered and presented in an easy-to-use table

Page 42: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

42 - Model Driven ETL Design

Transformation rules Customer Dimension (2)

Nr Step Rule6

7

Enrich CUSTOMER – BEDRIJVENREGISTER:

CUSTOMER.IDENTIFICATION = BEDRIJVENREGISTER.KVKNUMMER

-en-

CUSTOMER.PRIVATE_BUSINESS_FLAG = ‘Z’Enrich CUSTOMER – ADDRESS:

CUSTOMER.ADDRESS_NR = ADDRESS.ADDRESS_NR

-en-

CUSTOMER.TIMESTAMP >= ADDRESS.VALID_FROM

-en-

kies MAX(ADDRESS.VALID_FROM)

Page 43: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

43 - Model Driven ETL Design

Transformation rules Customer Dimension (3)

Nr Step Rule

8 Enrich ADDRESS – DEMOGRAFIE:

ADDRESS.POSTCODE = DEMOGRAFIE.POSTCODE9 Filter CUSTOMER.TIMESTAMP within [VerwerkingsPeriode]

10 Filter CUSTOMER.CUSTOMER_NR >= 1000

11

12

13

Validate CUSTOMER.PRIVATE_BUSINESS_FLAG = ‘Z’ or ‘P’

Convert KLANT_NUMMER = CUSTOMER.CUSTOMER_NR

Convert KLANT_NAAM = CUSTOMER.NAME

Page 44: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

44 - Model Driven ETL Design

Transformation rules KLANT (4)

Nr Step Rule

14

15

16

17 Convert KLANT_POSTCODE = ADDRESS.POSTCODE

18 Convert KLANT_DEMOGRAFIE = DEMOGRAFIE.DEMOGRAFIE_TYPE

19 Convert KLANT_BRANCHE = BEDRIJVENREGISTER.BRANCHE_CODE

Convert KLANT_TYPE = CUSTOMER.PRIVATE_BUSINESS_FLAG

Convert KLANT_IDENTIFICATIE = CUSTOMER.IDENTIFICATION

Convert KLANT_KVKNUMMER = BEDRIJVENREGISTER.KVKNUMMER

Page 45: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

45 - Model Driven ETL Design

Agenda

The problem area

Presentation of the proposed solution

The solution explained by examples

The step from logical to technical

Case tools and code generation

Driving the test process

Conclusions

Page 46: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

46 - Model Driven ETL Design

From logical to technical transformationmodel

In fact, this is also a kind of mapping

Now we make use of development expertise, tool expertise, standards and techniques

Here we also must apply the design rules according to the quality model:Usage• Complete, Accurate, Consistent, Efficient, AccessibleMaintenance• Structured, Understandable, Flexible, Testable, ChangeableTransition• Portable, Reusable, Integrateable

Page 47: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

47 - Model Driven ETL Design

Logical to technical: CUSTOMER

Collect Enhance Filter Validate Convertteren

Sourcemodel

Triggerentity

Filterrules

Validationrules

Conversionrules

Targetmodel

Reprocess

Enhancementrules

Historyrules

HistoryLink

Linkrules

Cust

BDR

AddrKlant

err

RDR1LKP1 LKP2

EXP1

WRT1RTR1

5 6 7 8 9 10 11 12-19

Demo

FLT1

1 2 3 4

Below is an example of a technical mapping in a specific tool, graphically presented

Each component of this mapping relates to a logical design step

Page 48: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

48 - Model Driven ETL Design

‘Mapping’ table between Logical and Technical model

Step RDR1 FLT1 RTR1 LKP1 LKP2 EXP1

Validate 11

Convert 12

13

...

19

1

2

10

3

4

5

6

7

8

9

Source model

Collect

Enhance

Filter

Helps in checking completeness

Page 49: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

49 - Model Driven ETL Design

Mapping-table Logical-Technical

Often this seems a trivial table

This is especially the case when there is a 1-1 relation between a logical functionand a technical ETL-program

However, there are many 1-1 situations that are not that trivial

Also, in case of a n-1 or 1-n or m-n relation, this table proves very valuable

For a specic ETL-tool (SAS, Powercenter, etc.) standards can be defined as whatsort of transformation is used for what sort of functional rule.

Page 50: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

50 - Model Driven ETL Design

Agenda

The problem area

Presentation of the proposed solution

The solution explained by examples

The step from logical to technical

Case tools and code generation

Driving the test process

Conclusions

Page 51: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

51 - Model Driven ETL Design

Case tools and code generation

Can this ETL design and development process be automated?

Can we use a CASE-tool here?

Do such CASE-tools exist in the market?

What role does CWM play in this area?

Page 52: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

52 - Model Driven ETL Design

Target modelSource model

Mapping as supported by ER-tools

Model

Table 1

col1

col2

col3

Table 2

col1

col2

Model

Table 1

col1

col2

Table 2

col1

col2

Table 3

ƒ( )

ƒ( )

Reference to source model

Reference to source table

ƒ( )

col1

ƒ( )Reference to column

Plus formula

Page 53: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

53 - Model Driven ETL Design

CWM – Common Warehouse Metadata

Website scanSAS, Informatica, Oracle, Microsoft, Business Objects

ImpressionSupport for CWM is growingETL tools can export mapping to CWM, in order to support for lineage or impact analysisCWM can not be used to design an ETL program

Page 54: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

54 - Model Driven ETL Design

Agenda

The problem area

Presentation of the proposed solution

The solution explained by examples

The step from logical to technical

Case tools and code generation

Driving the test process

Conclusions

Page 55: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

55 - Model Driven ETL Design

Driving the test processLogicalDesign

DefineTest cases

Input OutputProcess

Testcase X Prediction Y

RunTest

InputRecords

OutputRecords

ETLprogram

?

Testing is an essential step in software development

From the logical design, testcases are defined

Testcases are used to create testdata

Testdata is used as input for the ETL program to be tested

Page 56: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

56 - Model Driven ETL Design

How to do adequate and efficient testing

The logical ETL design is both input for the software developer as for the tester

So, the design must be highly usable for both disciplines

To do testing efficiently, one needs to define testdata sets as small as possible, yetit must contain all logical testcases

Using production or production like data is not an option, because it often has a toolarge volume, and the predicting the outcome on an adequate detailed level is hard if not impossible

Page 57: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

57 - Model Driven ETL Design

Example : square root function

The test approach is based on the theory of graphs

The graph depicts all decision points and possible paths a function can take fromstart to stop

Every route (= chain of paths) is a potential test case

start

stop

X?>=0

<0

Error message Calculate square root

Page 58: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

58 - Model Driven ETL Design

Creating test graph from logical ETL model

The logical ETL model is very useful when specifying test cases and creating a test graph

Especially, the steps ‘Selection’ and ‘Validation’ are very important. They map one-to-one to decision points in the graph

History-handling adds its own decision pointsNew record vs. changed recordType 1 or type 2 changes

Page 59: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

59 - Model Driven ETL Design

Graph of ETL programstart

stop

8 - Validate

5-Select

6-Select

7-Validate

ok failure

<1000 >=1000

before

between

P Z not in {P,Z}

History

new

unchanged

changed

after

Page 60: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

60 - Model Driven ETL Design

Determining test cases

Search in the transformation rules for the decision points;

For each decision point, specify the possible options;

Create a graph with the decision points and options;

Specify the test cases by combination/sequence of options;

Create test data for each test case

Page 61: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

61 - Model Driven ETL Design

Agenda

The problem area

Presentation of the proposed solution

The solution explained by examples

The step from logical to technical

Case tools and code generation

Driving the test process

Conclusions

Page 62: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

62 - Model Driven ETL Design

Conclusions (1)

The problem areaDesign and development cost of ETL-programs represents a big slice of the Data warehousing Cost pieA structured high-quality design approach will yield benefits

Presentation of the proposed solutionA logical transformation model, independent of technical implementation issuesStrongly linked to the source and target data model

The solution explained by examplesApplicable for all kinds of data warehouse data modelsApplicable for transactional and dimensional entity types

From logical to technicalPer tool conversion standards will enable efficiency

Page 63: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

63 - Model Driven ETL Design

Conclusions (2)

Case tools and code generationProgress is mainly found in ER-modelling toolsCWM useful for reporting lineage and impact, not for ETL code generation

Driving the test processModel driven logical design is a solid base for preparing test casesBased on the theory of graphsThus the design meets the quality criterium ‘testablility’

OVERALL CONCLUSIONA high-quality design approachBoosting efficiencyProven in a number of ETL projects

Page 64: Model driven ETL-design - Driving the test process ¾Conclusions. 4 - Model Driven ETL Design Agenda ¾The problem area ¾Presentation of the proposed solution ¾The solution explained

Atos, Atos and fish symbol, Atos Origin and fish symbol, Atos Consulting, and the fish symbol itself are registered trademarks of Atos Origin SA.© 2006 Atos Origin. Private for the client. This report or any part of it, may not be copied, circulated, quoted without prior written approval from Atos Origin or the client.

Thank [email protected]