the data warehouse etl toolkit - chapter 02

Upload: abacus83

Post on 01-Jun-2018

232 views

Category:

Documents


2 download

TRANSCRIPT

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    1/31

    The Data Warehouse

    ETL Toolkitby Ralph KimballVSV Training

    Chapter 2:ETL Data Structures

    Prepared by: Hien BuiDate: 09/02/2008

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    2/31

    2.0 Introduce ETLData StructuresThe ETL team i!! need a number "# di$erent

    data %tru&ture% t" meet a!! the !egitimate

    %taging need%.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    3/31

    2.' T" Stage "r ("t t"StageThe de&i%i"n t" %t"re data in a phy%i&a!

    %taging area )er%u% pr"&e%%ing it in mem"ry i%u!timate!y the &h"i&e "# the ETL ar&hite&t

    The i%%ue ith determining hether t" %tagey"ur data "r n"t depend% "n t" &"n*i&ting"b+e&ti)e%:,etting the data #r"m the "riginating %"ur&e t"

    the u!timate target a% #a%t a% p"%%ib!eHa)ing the abi!ity t" re&")er #r"m #ai!ure

    ith"ut re%tarting #r"m the beginning "# thepr"&e%%

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    4/31

    2.' T" Stage "r ("t t"Stage -&t"n%ider the #"!!"ing rea%"n% #"r %taging data be#"re

    it i% !"aded int" the data areh"u%e:

    Reco!erability" n m"%t enterpri%e en)ir"nment%1 it% a

    g""d pra&ti&e t" %tage the data a% %""n a% it ha% beene3tra&ted #r"m the %"ur&e %y%tem and then againimmediate!y a#ter ea&h "# the ma+"r tran%#"rmati"n %tep%

    #ackup" 4uite "#ten1 ma%%i)e )"!ume pre)ent% the dataareh"u%e #r"m being re!iab!y ba&5ed up at the databa%e

    !e)e!.$uditin%" 6any time% the data !ineage beteen the

    %"ur&e and target i% !"%t in the ETL &"de.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    5/31

    2.2 De%igning theStaging 7reaThe %taging area %t"re% data "n it% ay t" the na!

    pre%entati"n area "# the data areh"u%e.

    6a5e %ure y"u gi)e %eri"u% th"ught t" the )ari"u% r"!e%

    that %taging &an p!ay in y"ur ")era!! data areh"u%e"perati"n%.

    7 gi)en %taging !e &an a!%" be u%ed #"r re%tarting the+"b *" i# a %eri"u% pr"b!em de)e!"p% d"n%tream1and the %taging !e &an be a #"rm "# audit "r pr""# thatthe data had %pe&i& &"ntent hen it a% pr"&e%%ed.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    6/31

    2.2 De%igning the Staging

    7rea -&tThe data%taging area mu%t be "ned by the

    ETL team.

    %er% are n"t a!!"ed in the %taging area #"rany rea%"n.

    ;ep"rt% &ann"t a&&e%% data #r"m the %tagingarea.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    7/31

    2.2 De%igning the Staging 7rea-&t

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    8/31

    2.2 De%igning the Staging

    7rea -&tThe )"!umetri& "r5%heet !i%t% ea&h tab!e in the

    %taging area ith the #"!!"ing in#"rmati"n:Table &ame"The name "# the tab!e "r !e in the %taging

    area.'pdate Strate%y"Thi% e!d indi&ate% h" the tab!e i%

    maintained.Load (re)uency" ;e)ea!% h" "#ten the tab!e i% !"aded

    "r &hanged by the ETL pr"&e%%.ETL *ob+s," Staging tab!e% are p"pu!ated "r updated )ia

    ETL +"b%.Initial Ro- Count"The ETL team mu%t e%timate h"

    many r"% ea&h tab!e in the %taging area initia!!y&"ntain%.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    9/31

    2.2 De%igning the Staging

    7rea -&t$!era%e Ro- Len%th" ="r %i>ee%timati"n purp"%e%1

    y"u mu%t %upp!y the DB7 ith the a)erage r" !ength inea&h %taging tab!e.

    .ro-s With" E)en th"ugh tab!e% are updated "n a%&hedu!ed inter)a!1 they d"nt ne&e%%ari!y gr" ea&h timethey are t"u&hed.

    E/pected 0onthly Ro-s"Thi% e%timate i% ba%ed "nhi%t"ry and bu%ine%% ru!e%.

    E/pected 0onthly #ytes" E3pe&ted 6"nth!y Byte% i% a&a!&u!ati"n "# 7)erage ;" Length time% E3pe&ted6"nth!y ;"%.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    10/31

    2.2 De%igning the Staging

    7rea -&tInitial Table Si1e"The initia! tab!e %i>e i%

    u%ua!!y repre%ented in byte% "r megabyte%.Table Si1e 0onths" 7n e%timati"n "# tab!e

    %i>e% a#ter %i3 m"nth% "# a&ti)ity he!p% the DB7team t" e%timate h" the %taging databa%e "r!e %y%tem gr"%.

    The ETL ar&hite&t need% t" arrange #"r the a!!"&ati"nand &"ngurati"n "# data !e% that re%ide "n the !e

    %y%tem a% part "# the data%taging area t" %upp"rt theETL pr"&e%%.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    11/31

    2.? Data Stru&ture% in the ETL

    Sy%temn thi% %e&ti"n1 e de%&ribe the imp"rtant

    type% "# data %tru&ture% y"u are !i5e!y t" needin y"ur ETL %y%tem.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    12/31

    2.?.' =!at =i!e%@hen data i% %t"red in &"!umn% and r"% ithin a !e

    "n y"ur !e %y%tem t" emu!ate a databa%e tab!e1 it i%re#erred t" a% a fat "r sequential le.

    $r%uments in 3a!or o3 relational tables"It is al-ays 3aster to WRITE to a 4at 5le as lon%

    you are truncatin% or insertin%"There is no real concept o3 '6D$TI&. e/istin%

    records o3 a 4at 5le e7cientlyWhen you RE$D 3rom a sta%in% table in the ETL

    system#ein% able to -ork in S8L and %et automaticdatabase parallelism 3or 3ree is a !ery ele%antapproach"

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    13/31

    2.?.' =!at =i!e% -&tSta%in% source data 3or sa3ekeepin% and

    reco!ery"

    Sortin% data" S"rting i% a prereAui%ite t" )irtua!!y

    e)ery data integrati"n ta%5.(ilterin%" Supp"%e y"u need t" !ter "n an attribute

    that i% n"t inde3ed "n the %"ur&e databa%e.

    Replacin%9substitutin% te/t strin%s"

    $%%re%ationRe3erencin% source data"

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    14/31

    2.?.2 6L Data Set%6L i% a !anguage #"r data &"mmuni&ati"n.

    6L metadata &"n%i%t% "# tag% unambigu"u%!yidenti#ying ea&h item in an 6L d"&ument.

    6L ha% e3ten%i)e &apabi!ity #"r de&!aringhierar&hi&a! %tru&ture%1 %u&h a% &"mp!e3 #"rm%ith ne%ted repeating %ube!d%.

    6L i% t"day an e3treme!y e$e&ti)e medium#"r m")ing data beteen "theri%ein&"mpatib!e %y%tem%

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    15/31

    2.?.2 6L Data Set% -&tDTDs, XML Schemas, and XSLT

    The DTD de&!arati"n #"r "ur &u%t"mer e3amp!e&"u!d be &a%t a%:

    CELE6E(Tu%t"mer-(ame17ddre%%1ity1State1P"%ta!&"deF

    CELE6E(T (ame -GPD7T7F

    6L S&hema% &"ntain mu&h m"re databa%e

    "riented in#"rmati"n ab"ut data type%SLT i% a genera! me&hani%m #"r tran%!ating

    "ne 6L d"&ument int" an"ther 6L d"&ument

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    16/31

    2.?.? ;e!ati"na! Tab!e%$pparent metadata"

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    17/31

    2.?. ndependent DB6S

    @"r5ing Tab!e%# y"u de&ide t" %t"re y"ur %taging data in a

    DB6S1 y"u ha)e %e)era! ar&hite&ture "pti"n%hen y"u are m"de!ing the data%taging

    %&hema.T" +u%ti#y the u%e "# independent %taging

    tab!e%1 e!! u%e "ne "# "ur #a)"riteaph"ri%m%: Ieep it %imp!e.

    6"%t "# the time1 the rea%"n y"u &reate a%taging tab!e i% t" %et the data d"n %" y"u&an again manipu!ate it u%ing S4L "r a%&ripting !anguage.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    18/31

    2.?.J Third ("rma! ="rm

    Entity/;e!ati"n 6"de!%There are argument% that the data%taging

    area i% perhap% the &entra! rep"%it"ry "# a!!the enterpri%e data that e)entua!!y get%!"aded int" the data areh"u%e.

    ;emember t" "# the g"a!% #"r de%igning y"urETL pr"&e%%e% e de%&ribe at the beginning "#

    thi% &hapter: 6a5e them #a%t and ma5e themre&")erab!e.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    19/31

    2.?.K ("nre!ati"na! DataS"ur&e%7 &"mm"n rea%"n #"r &reating a dedi&ated

    %taging en)ir"nment i% t" integraten"nre!ati"na! data.

    The p"er "# ETL t""!% in hand!ingheter"gene"u% data minimi>e% the need t"%t"re a!! "# the ne&e%%ary data in a %ing!e

    databa%e.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    20/31

    2.?.K ("nre!ati"na! Data

    S"ur&e%

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    21/31

    2.?. Dimen%i"na! Data 6"de!%:

    The Hand"$#r"m the Ba&5 ;""m t" the =r"nt;""m

    Dimen%i"na! data %tru&ture% are the target "#the ETL pr"&e%%e%1 and the%e tab!e% %it at theb"undary beteen the ba&5 r""m and the

    #r"nt r""m.Dimen%i"na! data m"de!% are by #ar the m"%t

    p"pu!ar data %tru&ture% #"r end u%er Aueryingand ana!y%i%.

    Thi% %e&ti"n i% a brie# intr"du&ti"n t" the maintab!e type% in a dimen%i"n m"de!.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    22/31

    2.?.8 =a&t Tab!e%7 %ing!e mea%urement &reate% a %ing!e act table re&"rd.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    23/31

    2.?.9 Dimen%i"n Tab!e%The dimen%i"na! m"de! d"e% n"t anti&ipate "r

    depend up"n the intended Auery u%e%.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    24/31

    2.?.'0 7t"mi& and

    7ggregate =a&t Tab!e%t% g""d pra&ti&e t" partiti"n #a&t tab!e%

    %t"red in the %taging area be&au%e it%re%u!ting aggregate% i!! m"%t !i5e!y be ba%ed"n a %pe&i& peri"d perhap% m"nth!y "rAuarter!y.

    Dimen%i"na!!y de%igned tab!e% in the %tagingarea are in many &a%e% reAuired #"rp"pu!ating "n!ine ana!yti& pr"&e%%ing -

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    25/31

    2.?.'' Surr"gate Iey

    6apping Tab!e%Surr"gate 5ey mapping tab!e% are de%igned t"

    map natura! 5ey% #r"m the di%parate %"ur&e%y%tem% t" their ma%ter data areh"u%e%urr"gate 5ey.

    6apping tab!e% &an be eAua!!y e$e&ti)e i#they are %t"red in a databa%e "r "n the !e%y%tem.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    26/31

    2. P!anning and De%ign

    Standard%

    The data%taging area mu%t be a &"ntr"!!eden)ir"nment.

    Pe"p!e1 e%pe&ia!!y de)e!"per%1 are )ery&reati)e hen it &"me% t" reu%ing e3i%tingre%"ur&e%.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    27/31

    2..' mpa&t 7na!y%i%Impact analysis e3amine% the metadata

    a%%"&iated t" an "b+e&t -in thi% &a%e a tab!e "r&"!umn and determine% hat i% a$e&ted by a

    &hange t" it% %tru&ture "r &"ntent.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    28/31

    2..2 6etadata apture6etadata ha% many di$erent meaning%

    depending "n it% &"nte3t.

    Type% "# metadata deri)ed by the %tagingarea in&!ude the #"!!"ing:

    Data Linea%e

    #usiness De5nitions

    Technical De5nitions6rocess 0etadata

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    29/31

    2..? (aming"n)enti"n%The data%taging area may &"ntain tab!e% "r

    e!ement% that are n"t in the data areh"u%epre%entati"n !ayer and d" n"t ha)e

    e%tab!i%hed naming %tandard%

    @"r5 ith the data areh"u%e team and DB7gr"up t" embe!!i%h the e3i%ting naming%tandard% t" in&!ude %pe&ia! data%tagingtab!e%.

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    30/31

    2.. 7uditing Data

    Tran%#"rmati"n Step%;ep!a&ing natura! 5ey% ith %urr"gate 5ey%

    "mbining and dedup!i&ating entitie%

    "n#"rming &"mm"n!y u%ed attribute% indimen%i"n%

    Standardi>ing &a!&u!ati"n%1 &reating&"n#"rmed 5ey per#"rman&e indi&at"r% -IP%

    "rre&ting and &"er&ing data in the data&!eaning r"utine%

  • 8/9/2019 The Data Warehouse ETL Toolkit - Chapter 02

    31/31

    2.J Summary@e ha)e re)ieed the primary data %tru&ture% y"uneed in y"ur ETL %y%tem.

    @e %tarted by ma5ing the &a%e #"r %taging data in

    many p!a&e%1 #"r tran%ient and permanent need%.7 mature ETL en)ir"nment i!! be a mi3ture "# *at

    !e%1 independent re!ati"na! tab!e%.@e t"u&hed "n %"me be%tpra&ti&e i%%ue%1 in&!uding

    ad"pting a %et "# &"n%i%tent de%ign %tandard%1

    per#"rming %y%temati& impa&t ana!y%e% "n y"ur tab!ede%ign%1 and &hanging th"%e tab!e de%ign%.