1597 edw big data analytics kimball
TRANSCRIPT
-
8/12/2019 1597 EDW Big Data Analytics Kimball
1/33
The Evolving Role of theEnterprise Data Warehouse inthe Era of Big Data nalytics
AKimballGroupWhitePaper
ByRalphKimball
-
8/12/2019 1597 EDW Big Data Analytics Kimball
2/33
Table of ContentsExecutiveSummary......................................................................................................1
AbouttheAuthor...........................................................................................................1
Introduction..................................................................................................................2
Dataisanassetonthebalancesheet.....................................................................3
Raisingthecurtainonbigdataanalytics.....................................................................4
Usecasesforbigdataanalytics...............................................................................4
Makingsenseofbigdataanalyticusecases...........................................................7
Bigdataanalyticssystemrequirements......................................................................9
Extendedrelationaldatabasemanagementsystems............................................10
MapReduce/Hadoopsystems................................................................................13
HowMapReduceworksinHadoop........................................................................14
ToolsfortheHadoopenvironment.........................................................................16
Featureconvergenceinthecomingdecade..........................................................18
Reusableanalytics.................................................................................................20
Complexeventprocessing(CEP)..........................................................................20
Datawarehouseculturalchangesinthecomingdecade..........................................21
Sandboxes..............................................................................................................21
Lowlatency.............................................................................................................22
Continuousthirstformoreexquisitedetail.............................................................22
Lighttouchdatawaitsforitsrelevancetobeexposed ...........................................23
Simpleanalysisofallthedatatrumpssophisticatedanalysisofsomeofthedata.23
Datastructuresshouldbedeclaredatquerytime,notatdataloadtime ...............24
TheEDWsupportingbigdataanalyticsmustbemagnetic,agile,anddeep .........24
Theconflictbetweenabstractionandcontrol.........................................................24
Datawarehouseorganizationchangesinthecomingdecade...................................25
Technicalskillsetsrequired..................................................................................25
Neworganizationsrequired....................................................................................26
Newdevelopmentparadigmsrequired...................................................................27
Lessonsfromtheearlydatawarehousingera.......................................................28
Analyticsinthecloud..............................................................................................29
WhitherEDW?.........................................................................................................29
Acknowledgements....................................................................................................31
References.................................................................................................................31
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics
-
8/12/2019 1597 EDW Big Data Analytics Kimball
3/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics1
Executive SummaryInthiswhitepaper,wedescribetherapidlyevolvinglandscapefordesigningan
enterprisedatawarehouse(EDW)tosupportbusinessanalyticsintheeraof"big
data.Wedescribethescopeandchallengesofbuildingandevolvingaverystable
andsuccessfulEDWarchitecturetomeetnewbusinessrequirements.These
includeextremeintegration,semi-andun-structureddatasources,petabytesofbehavioralandimagedataaccessedthroughMapReduce/Hadoopaswellas
massivelyparallelrelationaldatabases,andthenstructuringtheEDWtosupport
advancedanalytics.Thispaperprovidesdetailedguidancefordesigningand
administeringthenecessaryprocessesfordeployment.Thiswhitepaperhasbeen
writteninresponsetoalackofspecificguidanceintheindustryastohowtheEDW
needstorespondtothebigdataanalyticschallenge,andwhatnecessarydesign
elementsareneededtosupportthesenewrequirements.
About the AuthorRalphKimballfoundedtheKimballGroup.Sincethemid1980s,hehasbeenthedatawarehouse/businessintelligence(DW/BI)industrysthoughtleaderonthe
dimensionalapproachandtrainedmorethan10,000ITprofessionals.PriortoworkingatMetaphorandfoundingRedBrickSystems,Ralphco-inventedtheStarworkstationatXeroxsPaloAltoResearchCenter(PARC).RalphhashisPh.D.inElectricalEngineeringfromStanfordUniversity.TheKimballGroupisthesourcefordimensionalDW/BIconsultingandeducation,consistentwithourbest-sellingToolkitbookseries,DesignTips,andaward-winningarticles.Visitwww.kimballgroup.comformoreinformation.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
4/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics2
IntroductionWhatisbigdata?Itsbignessisactuallynotthemostinterestingcharacteristic.Big
dataisstructured,semistructured,unstructured,andrawdatainmanydifferent
formats,insomecaseslookingtotallydifferentthanthecleanscalarnumbersandtext
wehavestoredinourdatawarehousesforthelast30years.Muchbigdatacannotbe
analyzedwithanythingthatlookslikeSQL.Butmostimportant,bigdataisaparadigmshiftinhowwethinkaboutdataassets,wheredowecollectthem,howdo
weanalyzethem,andhowdowemonetizetheinsightsfromtheanalysis.Thebig
datarevolutionisaboutfindingnewvaluewithinandoutsideconventionaldata
sources.Anadditionalapproachisneededbecausethesoftwareandhardware
environmentsofthepasthavenotbeenabletocapture,manage,orprocessthenew
formsofdatawithinreasonabledevelopmenttimesorprocessingtimes.Weare
challengedtoreorganizeourinformationmanagementlandscapetoextenda
remarkablystableandsuccessfulEDWarchitecturetothisneweraofbigdata
analytics.
Inreadingthiswhitepaperpleasebearinmindthattheconsistentviewofthisauthorhasalwaysbeenthatthe"datawarehouse"comprisesthecompleteecosystemfor
extracting,cleaning,integratinganddeliveringdatatodecisionmakers,andtherefore
includestheextract-transform-load(ETL)andbusinessintelligence(BI)functions
consideredasoutsideofthedatawarehousebymoreconservativewriters.This
authorhasalwaystakentheviewthatdatawarehousinghasaverycomprehensive
roleincapturingallformsofenterprisedata,andthenpreparingthatdataforthemost
effectiveusebydecision-makersallacrosstheenterprise.Thiswhitepapertakesthe
aggressiveviewthattheenterprisedatawarehouseisonthevergeofaveryexciting
newsetofresponsibilities.ThescopeoftheEDWwillincreasedramatically.
Also,inthiswhitepaper,althoughweconsistentlyusethetermETLtodescribethemovementofdatawithintheenterprisedatawarehouse,theconventionaluseofthis
termdoesnotdojusticetothemuchlargerresponsibilityofmovingdataacross
networksandbetweensystemsandbetweenprofoundlydifferentprocessesinthe
worldofbigdataanalytics.ETLisaportionofamuchlargertechnologycalleddata
integration(DI).SincewehaveusedETLconsistentlyinourbooksandclassesfor
manyyears,wewillkeepthatterminologyinthispaper,bearinginmindthatETLis
meantinthelargersenseofDI.
Thiswhitepaperstandsbackfromthemarketplaceasitexistsinearly2011to
highlighttheclearlyemergingnewtrendsbroughtbythebigdatarevolution.Anda
revolutionitis.AsJamesMarkarian,Informatica'sExecutiveVicePresidentandChiefTechnologyOfficer,remarked:"thedatabasemarkethasfinallygotteninteresting
again."Becausemuchofthenewbigdatatoolsandapproachesareversion1or
evenversion0developments,thelandscapewillcontinuetochangerapidly.However
thereisgrowingawarenessinthemarketplacethatnewkindsofanalysisarepossible
andthatkeycompetitors,especiallye-commerceenterprises,arealreadytaking
advantageofthenewparadigm.Thiswhitepaperisintendedtobeaguidetohelp
businessintelligence,datawarehousingandinformationmanagementprofessionals
-
8/12/2019 1597 EDW Big Data Analytics Kimball
5/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics3
andmanagementteamsunderstandandprepareforbigdataasacomplementary
extensiontotheircurrentEDWarchitecture.
Dataisanassetonthebalancesheet
Enterprisesincreasinglyrecognizethatdataitselfisanassetthatshouldappearon
thebalancesheetinthesamewaythattraditionalassetsfromthemanufacturingage
suchasequipmentandlandhavealwaysappeared.Thereareseveralwaysto
determinethevalueofthedataasset,including
costtoproducethedata
costtoreplacethedataifitislost
revenueorprofitopportunityprovidedbythedata
revenueorprofitlossifdatafallsintocompetitorshands
legalexposurefromfinesandlawsuitsifdataisexposedtothewrongparties
Butmoreimportantthanthedataitself,enterpriseshaveshownthatinsightsfrom
datacanbemonetized.Whenane-commercesitedetectsanincreaseinfavorableclickthroughsfromanexperimentaladtreatment,thatinsightcanbetakentothe
bottomlineimmediately.Thisdirectcause-and-effectiseasilyunderstoodby
management,andananalyticresearchgroupthatconsistentlydemonstratesthese
insightsislookeduponasastrategicresourcefortheenterprisebythehighestlevels
ofmanagement.Thisgrowthinbusinessawarenessofthevalueofdata-driven
insightsisrapidlyspreadingoutwardfromthee-commerceworldtovirtuallyevery
businesssegment.
Datawarehousing,ofcourse,hasbeendemonstratingthevalueofdata-driven
insightsforatleast20years.Butuntilquiterecentlydatawarehousinghasbeen
focusedonhistoricaltransactiondata.Duringthepastdecadefrom2000to2009,threemajorseismicshiftsoccurredindatawarehousing.Thefirst,earlyinthe
decade,wasthedecisiveintroductionoflowlatencyoperationaldataintothedata
warehousetogetherwiththeexistinghistoricaldata.Ofcourse,manyofthesenew
operationaldatausecasesbenefitedfromreal-timedata,insomecasesdemanding
instantaneousdelivery.Thesecondseismicshiftgrowingincreasinglythroughoutthe
decadewasthegatheringofcustomerbehaviordata,whichnotonlyincluded
traditionaltransactionssuchaspurchasesandclickthroughsbutaddedhuge
volumesof"subtransactions"thatrepresentedmeasurableeventsleadinguptothe
transactionsthemselves.Forexample,allthewebpageeventsacustomerengaged
inpriortothefinaltransactioneventbecamearecordofcustomerbehavior."Good
paths"throughthesewebpageeventhistoriesgavelotsofinsightintoproductive(i.e.,
monetizable)customerbehavior.
Thethirdseismicevent,whichisgatheringenormousmomentumaswetransitioninto
thecurrentdecade,istheextractionofproductpreferencesandcustomers
sentimentsfromsocialmedia,especiallythemassivequantitiesofmachine-
generatedunstructureddatageneratedbythenewbusinessparadigmsofdot-com
companies.Itisthisfinalseismicshiftthathaspushedmanyenterprisesintolooking
seriouslyatunstructureddataforthefirsttime,andasking"howonearthdowe
-
8/12/2019 1597 EDW Big Data Analytics Kimball
6/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics4
analyzethisstuff?"Thepointhereisnotthatunstructureddataissomenewthing
recentlydiscovered,butrathertheanalysisofunstructureddatahasgone
mainstreamjustrecently.
Raising the curtain on big data analyticsUsecasesforbigdataanalytics
Bigdataanalyticsusecasesarespreadinglikewildfire.Hereisasetofusecases
reportedrecently,includingabenchmarksetof"Hadoop-able"usecasesproposed
byJeffHammerbacher,ChiefScientistforCloudera.Followingthesebrief
descriptionsisatablesummarizingthesalientstructureandprocessing
characteristicsofeachusecase.Notethatnoneoftheseusecasescanbesatisfied
withscalarnumericdata,norcananybeproperlyanalyzedbysimpleSQL
statements.Allofthemcanbescaledintothepetabyterangeandbeyondwith
appropriatebusinessassumptions.
Search ranking.Allsearchenginesattempttoranktherelevanceofawebpagetoasearchrequestagainstallotherpossiblewebpages.Googlespagerankalgorithmis,
ofcourse,theposterchildforthisusecase.
Ad tracking. E-commercesitestypicallyrecordanenormousriverofdataincludingeverypageeventineveryusersession.Thisallowsforveryshortturnaroundof
experimentsinadplacement,color,size,wording,andotherfeatures.Whenan
experimentshowsthatsuchafeaturechangeinanadresultsinimprovedclick
throughbehavior,thechangecanbeimplementedvirtuallyinrealtime. Location and proximity tracking. ManyusecasesaddpreciseGPSlocationtracking,togetherwithfrequentupdates,inoperationalapplications,securityanalysis,
navigation,andsocialmedia.Preciselocationtrackingopensthedoorforan
enormousoceanofdataaboutotherlocationsnearbytheGPSmeasurement.These
otherlocationsmayrepresentopportunitiesforsalesorservices.
Causal factor discovery.Point-of-saledatahaslongbeenabletoshowuswhenthesalesofaproductgoessharplyupordown.Butsearchingforthecausalfactorsthat
explainthesedeviationshasbeen,atbest,aguessinggameoranartform.The
answersmaybefoundincompetitivepricingdata,competitivepromotionaldata
includingprintandtelevisionmedia,weather,holidays,nationaleventsincluding
disasters,andvirallyspreadopinionsfoundinsocialmedia.Seethenextusecaseaswell.
Social CRM.Thisusecaseisoneofthehottestnewareasformarketinganalysis.TheAltimeterGrouphasdescribedaveryusefulsetofkeyperformanceindicatorsfor
socialCRMthatincludeshareofvoice,audienceengagement,conversationreach,
activeadvocates,advocateinfluence,advocacyimpact,resolutionrate,resolution
time,satisfactionscore,topictrends,sentimentratio,andideaimpact.Thecalculation
oftheseKPIsinvolvesin-depthtrollingofahugearrayofdatasources,especially
unstructuredsocialmedia.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
7/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics5
Document similarity testing.Twodocumentscanbecomparedtoderiveametricofsimilarity.Thereisalargebodyofacademicresearchandtestedalgorithms,for
examplelatentsemanticanalysis,thatisjustnowfindingitswaytodrivingmonetized
insightsofinteresttobigdatapractitioners.Forexample,asinglesourcedocument
canbeusedasakindofmultifacetedtemplatetocompareagainstalargesetof
targetdocuments.Thiscouldbeusedforthreatdiscovery,sentimentanalysis,and
opinionpolls.Forexample:"findallthedocumentsthatagreewithmysourcedocumentonglobalwarming."
Genomics analysis: e.g., commercial seed gene sequencing.Afewmonthsagothecottonresearchcommunitywasthrilledbyagenomesequencingannouncementthat
statedinpart"Thesequencewillserveacriticalroleasthereferenceforfuture
assemblyofthelargercottoncropgenome.Cottonisthemostimportantfibercrop
worldwideandthissequenceinformationwillopenthewayformorerapidbreeding
forhigheryield,betterfiberqualityandadaptationtoenvironmentalstressesandfor
insectanddiseaseresistance.ScientistRyanRappstressedtheimportanceof
involvingthecottonresearchcommunityinanalyzingthesequence,identifyinggenes
andgenefamiliesanddeterminingthefuturedirectionsofresearch.(SeedQuest,Sept22,2010).Thisusecaseisjustoneexampleofawholeindustrythatisbeing
formedtoaddressgenomicsanalysisbroadly,beyondthisexampleofseedgene
sequencing.
Discovery of customer cohort groups.Customercohortgroupsareusedbymanyenterprisestoidentifycommondemographictrendsandbehaviorhistories.Weareall
familiarwithAmazon'scohortgroupswhentheysayothercustomerswhoboughtthe
samebookasyouhavealsoboughtthefollowingbooks.Ofcourse,ifyoucansell
yourproductorservicetoonememberofacohortgroup,thenalltherestmaybe
reasonableprospects.Cohortgroupsarerepresentedlogicallyandgraphicallyas
links,andmuchoftheanalysisofcohortgroupsinvolvesspecializedlinkanalysisalgorithms.
In-flight aircraft status.Thisusecaseaswellasthefollowingtwousecasesaremadepossiblebytheintroductionofsensortechnologyeverywhere.Inthecaseofaircraft
systems,in-flightstatusofhundredsofvariablesonengines,fuelsystems,hydraulics,
andelectricalsystemsaremeasuredandtransmittedeveryfewmilliseconds.The
valueofthisusecaseisnotjusttheengineeringtelemetrydatathatcouldbe
analyzedatsomefuturepointintime,butdrivesreal-timeadaptivecontrol,fuel
usage,partfailureprediction,andpilotnotification.
Smart utility meters.Itdidn'ttakelongforutilitycompaniestofigureoutthatasmartmetercanbeusedformorethanjustthemonthlyreadoutthatproducesthe
customersutilitybill.Bydrasticallycrankingupthefrequencyofthereadoutstoas
muchasonereadoutpersecondpermeteracrosstheentirecustomerlandscape,
manyusefulanalysescanbeperformedincludingdynamicload-balancing,failure
response,adaptivepricing,andlonger-termstrategiesforincentingcustomersto
utilizetheutilitymoreeffectively(eitherfromthecustomerspointofvieworthe
utility'spointofview!)
-
8/12/2019 1597 EDW Big Data Analytics Kimball
8/33
-
8/12/2019 1597 EDW Big Data Analytics Kimball
9/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics7
Data bag exploration.Therearemanysituationsincommercialenvironmentsandintheresearchcommunitieswherelargevolumesofrawdataarecollected.One
examplemightbedatacollectedaboutstructurefires.Beyondthepredictable
dimensionsoftime,place,primarycauseoffire,andrespondingfirefighters,there
maybeawealthofunpredictableanecdotaldatathatatbestcanbemodeledasa
disorderlycollectionofnamevaluepairs,suchas"contributingweather=lightning.
Anotherexamplewouldbethelistingofallrelevantfinancialassetsforadefendantinalawsuit.Againsuchalistislikelytobeadisorderlycollectionofnamevaluepairs,
suchas"sharedrealestateownership=condominium.Thelistofexampleslikethis
isendless.Whattheyhaveincommonistheneedtoencapsulatethedisorderly
collectionofnamevaluepairswhichisgenerallyknownasa"databag.Complex
databagsmaycontainbothnamevaluepairsaswellasembeddedsubdatabags.
Thechallengeinthisusecaseistofindacommonwaytoapproachtheanalysisof
databagswhenthecontentofthedatamayneedtobediscoveredafterthedatais
loaded.
Thefinaltwousecasesareoldandvenerableexamplesthatevenpredatedata
warehousingitself.Butnewlifehasbeenbreathedintotheseusecasesbecauseoftheexcitingpotentialofultra-atomiccustomerbehaviordata.
Loan risk analysis and insurance policy underwriting.Inordertoevaluatetheriskofaprospectiveloanoraprospectiveinsurancepolicy,manydatasourcescanbe
broughtintoplayrangingfrompaymenthistories,detailedcreditbehavior,
employmentdata,andfinancialassetdisclosures.Insomecasesthecollateralfora
loanortheinsureditemmaybeaccompaniedbyimagedata.
Customer churn analysis.Enterprisesconcernedwithchurnwanttounderstandthepredictivefactorsleadinguptothelossofacustomer,includingthatcustomers
detailedbehavioraswellasmanyexternalfactorsincludingtheeconomy,lifestageandotherdemographicsofthecustomer,andfinallyrealtimecompetitiveissues.
Makingsenseofbigdataanalyticusecases
Certainlythepurposeofdevelopingthislistofusecasesistoconvincethereader
thattheusecasescomeinallshapesandsizesandformats,andrequiremany
specializedapproachestoanalyze.Upuntilveryrecentlyalltheseusecasesexisted
asseparateendeavors,ofteninvolvingspecialpurposebuiltsystems.Buttheindustry
awarenessofthe"bigdataanalyticschallenge"ismotivatingeveryonetolookforthe
architecturalsimilaritiesanddifferencesacrossalltheseusecases.Anygiven
enterpriseisincreasinglylikelytoencounteroneormoreoftheseusecases.That
realizationisdrivingtheinterestinsystemarchitecturesthataddressesthebigdata
analyticsprobleminageneralway.Pleasestudythefollowingtable.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
10/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics8
Thesheerdensityofthistablemakesitclearthatsystemstosupportbigdata
analyticshavetolookverydifferentthantheclassicrelationaldatabasesystemsfrom
the1980sand1990s.TheoriginalRDBMSswerenotbuilttohandleanyofthe
requirementsrepresentedascolumnsinthistable!
Searchranking X X X X X X
Adtracking X X X X X X X X
Location&proximity X X X X X
Causaldiscovery X X X X X X X
SocialCRM X X X X X X X X
Documentsimilarity X X X X X X X
Genomicanalysis X X X X X
Cohortgroups X X X X X X
In-flightenginestatus X X X X X X
Smartutilitymeters X X X X X X
Buildingsensors X X X X X X X X
Satelliteimages X X X X
CATscans X X X X X X
Financialfraud X X X X X X X X X
Hackingdetection X X X X X X X X X
Gamegestures X X X X X X X X
Bigscience X X X X X X X X X
Databagexploration X X X X X X
Riskanalysis X X X X X X X X
Churnanalysis X X X X X X X
Vector,
matrix
,or
complex
structure
Free
text
Image
orbinary
data
D
ata
bags
Iter
ative
logic
or
complex
branching
Advanced
ana
lytic
routines
Rap
idly
repeated
me
asurements
Extreme
low
latency
Access
to
all
data
required
-
8/12/2019 1597 EDW Big Data Analytics Kimball
11/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics9
Big data analytics system requirementsBeforediscussingtheexcitingnewtechnicalandarchitecturaldevelopmentsofthe
2010s,let'ssummarizetheoverallrequirementsforsupportingbigdataanalytics,
keepinginmindthatwearenotrequiringasinglesystemorasinglevendor's
technologytoprovideablanketsolutionforeveryusecase.Fromtheperspectiveof
2011,wehavetheluxuryofstandingbackfromalltheseusecasesgatheredinthelastfewyears,andwearenowinapositiontosurroundtherequirementswithsome
confidence.
Thedevelopmentofbigdataanalyticshasreachedapointwhereitneedsanoverall
missionstatementandidentityindependentofalistofusecases.Manyofushave
livedthroughearlierinstantiationsofadvancedanalyticsthatwentbythenamesof
advancedstatistics,artificialintelligenceanddatamining.Noneoftheseearlier
wavesbecameacoherentthemethattranscendedtheindividualexamples,as
compellingasthoseexampleswere.
Hereisanattempttostepbackanddefinethecharacteristicsofbigdataanalyticsatthehighestlevels.Inthefollowing,theterm"UDF"isusedinthebroadestsenseof
anyuserdefinedfunctionorprogramoralgorithmthatmayappearanywhereinthe
end-to-endanalysisarchitecture.
Inthecoming2010sdecade,theanalysisofbigdatawillrequireatechnologyor
combinationoftechnologiescapableof:
scalingtoeasilysupportpetabytes(thousandsofterabytes)ofdata
beingdistributedacrossthousandsofprocessors,potentiallygeographically
unaware,andpotentiallyheterogeneous
subsecondresponsetimeforhighlyconstrainedstandardSQLqueries embeddingarbitrarilycomplexuser-definedfunctions(UDFs)within
processingrequests
implementingUDFsinawidevarietyofindustry-standardprocedural
languages
assemblingextensivelibrariesofreusableUDFscrossingmostorallofuse
cases
executingUDFsas"relationscans"overpetabytesizeddatasetsinafew
minutes
supportingawidevarietyofdatatypesgrowingtoincludeimages,waveforms,
arbitrarilyhierarchicaldatastructures,anddatabags
loadingdatatobereadyforanalysis,atveryhighrates,atleastgigabytesper
second
integratingdatafrommultiplesourcesduringtheloadprocessatveryhigh
rates(GB/sec)
loadingdatabeforedeclaringordiscoveringitsstructure
executingcertainstreaminganalyticqueriesinrealtimeonincomingload
data
-
8/12/2019 1597 EDW Big Data Analytics Kimball
12/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics10
updatingdatainplaceatfullloadspeeds
joiningabillionrowdimensiontabletoatrillionrowfacttablewithoutpre-
clusteringthedimensiontablewiththefacttable
schedulingandexecutionofcomplexmulti-hundrednodeworkflows
beingconfiguredwithoutbeingsubjecttoasinglepointoffailure
failoverandprocesscontinuationwhenprocessingnodesfail supportingextrememixedworkloadsincludingthousandsofgeographically
dispersedon-lineusersandprogramsexecutingavarietyofrequestsranging
fromadhocqueriestostrategicanalysis,andwhileloadingdatainbatchand
streamingfashion
Twoarchitectureshaveemergedtoaddressbigdataanalytics:extendedRDBMS,
andMapReduce/Hadoop.Thesearchitecturesarebeingimplementedascompletely
separatesystemsandinvariousinterestinghybridcombinationsinvolvingboth
architectures.Wewillstartbydiscussingthearchitecturesseparately.
Extendedrelationaldatabasemanagementsystems
Allofthemajorrelationaldatabasemanagementsystemvendorsareaddingfeatures
toaddressbigdataanalyticsfromasolidrelationalperspective.Thetwomost
significantarchitecturaldevelopmentshavebeentheovertakingofthehighendofthe
marketwithmassivelyparallelprocessing(MPP),andthegrowingadoptionof
columnarstorage.WhenMPPandcolumnarstoragetechniquesarecombined,a
numberofthesystemrequirementsintheabovelistcanstarttobeaddressed,
including:
scalingtosupportexabytes(thousandsofpetabytes)ofdata
beingdistributedacrosstensofthousandsofgeographicallydispersedprocessors
subsecondresponsetimeforhighlyconstrainedstandardSQLqueries
updatingdatainplaceatfullloadspeeds
beingconfiguredwithoutbeingsubjecttoasinglepointoffailure
failoverandprocesscontinuationwhenprocessingnodesfail
Additionally,RDBMSvendorsareaddingsomecomplexuser-definedfunctions
(UDF's)totheirsyntax,butthekindofgeneralpurposeprocedurallanguage
computingrequiredbybigdataanalyticsisnotbeingsatisfiedinrelational
environmentsatthistime.
Inasimilarvein,RDBMSvendorsareallowingcomplexdatastructurestobestored
inindividualfields.Thesekindofembeddedcomplexdatastructureshavebeen
knownas"blobs"formanyyears.It'simportanttounderstandthatrelational
databaseshaveahardtimeprovidinggeneralsupportforinterpretingblobssince
blobsdonotfittherelationalparadigm.AnRDBMSindeedprovidessomevalueby
hostingtheblobsinastructuredframework,butmuchofthecomplexinterpretation
andcomputationontheblobsmustbedonewithspeciallycraftedUDFs,orBI
-
8/12/2019 1597 EDW Big Data Analytics Kimball
13/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics11
applicationlayerclients.Blobsarerelatedtodatabagsdiscussedelsewhereinthis
paper.SeethesectionentitledDatastructuresshouldbedeclaredatquerytime.
MPPimplementationshaveneversatisfactorilyaddressedthe"bigjoinissuewherea
billionrowdimensiontableisattemptedtobejoinedtoatrillionrowfacttablewithout
resortingtoclusteredstorage.Thebigjoincrisisoccurswhenanadhocconstraintis
placedagainstthedimensiontableresultinginapotentiallyverylargesetofdimensionkeysthatmustbephysicallydownloadedintoeveryoneofthephysicalsegmentsof
thetrillionrowfacttablestoredseparatelyintheMPPsystem.Sincethedimension
keysarescatteredrandomlyacrosstheseparatesegmentsofthetrillionrowfacttable,
itisveryhardtoavoidalengthydownloadstepoftheverylargedimensiontableto
everyoneofthefacttablestoragepartitions.Tobefair,theMapReduce/Hadoop
architecturehasnotbeenabletoaddressthebigjoinproblemeither.
Columnardatastoragefitstherelationalparadigm,andespeciallydimensionally
modeleddatabases,verywell.Besidesthesignificantadvantageofhighcompression
ofsparsedata,columnardatabasesallowaverylargenumberofcolumnscompared
torow-orienteddatabases,andplacelittleoverheadonthesystemwhencolumnsareaddedtoanexistingschema.ThemostsignificantAchilles'heel,atleastin2011,is
theslowloadingspeedofdataintothecolumnarformat.Althoughimpressiveload
speedimprovementsarebeingannouncedbycolumnardatabasevendors,theyhave
stillnotachievedthegigabytes-per-secondrequirementlistedabove.
ThestandardRDBMSarchitectureforimplementinganenterprisedatawarehouse
basedondimensionalmodelingprinciplesissimpleandwellunderstood,asshownin
Figure1.Recallthatthroughoutthiswhitepaper,theEDWisdefinedinthe
comprehensivesensetoincludeallbackroomandfrontroomprocessesincluding
ETL,datapresentation,andBIapplications.
Figure 1. The standard RDBMS based architecture for an enterprise data warehouseSource: The Data Warehouse Lifecycle Toolkit, 2ndedition, Kimball et al. (2008)
-
8/12/2019 1597 EDW Big Data Analytics Kimball
14/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics12
InthisstandardEDWarchitecturetheETLsystemisamajorcomponentthatsits
betweenthesourcesystemsandthepresentationserversthatareresponsiblefor
exposingalldatatobusinessintelligenceapplications.Inthisview,theETLsystem
addssignificantvaluebycleaning,conforming,andarrangingthedataintoaseriesof
dimensionalschemaswhicharethenstoredphysicallyinthepresentationserver.A
crucialelementofthisarchitectureisthepreparationofconformeddimensionsinthe
ETLsystemthatservesasthebasisofintegrationfortheBIapplications.Itisthestrongconvictionofthisauthorthatdeferringthebuildingofthedimensional
structuresandtheissuesofintegrationuntilquerytimeisthewrongarchitecture.
Sucha"deferredcomputation"approachrequiresanundulyexpensivequery
optimizertocorrectlyquerycomplexnon-dimensionalmodelseverytimeaqueryis
presented.Thecalculationofintegrationatqueryprocessingtimegenerallyrequires
complexapplicationlogicintheBItoolswhichalsomighthavetobeexecutedfor
everyquery.
TheextendedRDBMSarchitecturetosupportbigdataanalyticspreservesthe
standardarchitecturewithanumberofimportantadditions,shownbelowinFigure2
withlargearrows:
Figure 2. The extended RDBMS based architecture for an enterprise data warehouseThefactthatthehigh-levelenterprisedatawarehousearchitectureisnotmaterially
changedbytheintroductionofnewdatastructures,oragrowinglibraryofspecially
crafteduser-definedfunctions,orpowerfulprocedurallanguage-basedprograms
actingaspowerfulBIclients,isthecharmoftheextendedRDBMSapproachtobig
dataanalytics.ThemajorRDBMSplayersareabletomarshaltheirenormouslegacy
ofmillionsoflinesofcode,powerfulgovernancecapabilities,andsystemstabilitybuilt
overdecadesofservingthemarketplace.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
15/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics13
However,itistheopinionofthisauthorthattheextendedRDBMSsystemscannotbe
theonlysolutionforbigdataanalytics.Atsomepoint,tackingonnon-relationaldata
structuresandnon-relationalprocessingalgorithmstothebasic,coherentRDBMS
architecturewillbecomeunwieldyandinefficient.TheSwissArmyknifeanalogy
comestomind.Anotheranalogyclosertothetopicistheprogramminglanguage
PL/1.Originallydesignedasanoverarching,multipurpose,powerfulprogramming
languageforallformsofdataandallapplications,itultimatelybecameabloatedandsprawlingcorpusthattriedtodotoomanythingsinasinglelanguage.Sincethe
heydayofPL/1therehasbeenawonderfulevolutionofmorenarrowlyfocused
programminglanguageswithmanynewconceptsandfeaturesthatsimplycouldn'tbe
tackedontoPL/1afteracertainpoint.Relationaldatabasemanagementsystemsdo
somanythingssowellthatthereisnodangerofsufferingthesamefateasPL/1.The
bigdataanalyticsspaceisgrowingsorapidlyandinsuchexcitingandunexpected
newdirectionsthatalighterweight,moreflexibleandmoreagileprocessing
frameworkinadditiontoRDBMSsystemsmaybeareasonablealternative.
MapReduce/Hadoopsystems
MapReduceisaprocessingframeworkoriginallydevelopedbyGoogleintheearly
2000sforperformingwebpagesearchesacrossthousandsofphysicallyseparated
machines.TheMapReduceapproachisextremelygeneral.CompleteMapReduce
systemscanbeimplementedinavarietyoflanguagesalthoughthemostsignificant
implementationisinJava.MapReduceisreallyaUDF(userdefinedfunction)
executionframework,wherethe"F"canbeextraordinarilycomplex.Originally
targetedtobuildingGoogle'swebpagesearchindex,aMapReducejobcanbe
definedforvirtuallyanydatastructureandanyapplication.Thetargetprocessorsthat
actuallyperformtherequestedcomputationcanbeidentical(a"cluster"),orcanbea
heterogeneousmixofprocessortypes(a"grid").Thedataineachprocessorupon
whichtheultimatecomputationisperformedcanbestoredinadatabase,ormorecommonlyinafilesystem,andcanbeinanydigitalformat.
ThemostsignificantimplementationofMapReduceisApacheHadoop,knownsimply
asHadoop.Hadoopisanopensource,top-levelApacheproject,withthousandsof
contributorsandawholeindustryofdiverseapplications.Hadooprunsnativelyonits
owndistributedfilesystem(HDFS)andcanalsoreadandwritetoAmazonS3and
others.Conventionaldatabasevendorsarealsoimplementinginterfacestoallow
Hadoopjobstoberunovermassivelydistributedinstancesoftheirdatabases.
AswewillseewhenwegiveabriefoverviewofhowaHadoopjobworks,bandwidth
betweentheseparateprocessorscanbeahugeissue.HDFSisaso-called"rackaware"filesystembecausethecentralnamenodeknowswhichnodesresideonthe
samerackandwhichareconnectedbymorethanonenetworkhop.Hadoopexploits
therelationshipbetweenthecentraljobdispatcherandHDFStosignificantlyoptimize
amassivelydistributedprocessingtaskbyhavingdetailedknowledgeofwheredata
actuallyresides.Thisalsoimpliesthatacriticalaspectofperformancecontrolisco-
locatingsegmentsofdataonactualphysicalhardwarerackssothattheMapReduce
communicationcanbeaccomplishedatbackplanespeedsratherthanslowernetwork
speeds.Notethatremotecloud-basedfilesystemssuchasAmazonS3and
-
8/12/2019 1597 EDW Big Data Analytics Kimball
16/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics14
CloudStoreare,bytheirnature,unabletoprovidetherackawarebenefit.Ofcourse,
cloud-basedfilesystemshaveanumberofcompellingadvantageswhichwe'll
discusslater.
HowMapReduceworksinHadoop
AMapReducejobissubmittedtoacentralizedJobTracker,whichinturnschedules
partsofthejobtoanumberofTaskTrackernodes.Although,ingenerala
TaskTrackermayfailanditstaskcanbereassignedbytheJobTracker,the
JobTrackerisasinglepointoffailure.IftheJobTrackerhalts,theMapReducejob
mustberestartedorberesumedfromintermediatesnapshots.
AMapReducejobisalwaysdividedintotwodistinctphases,mapandreduce.The
overallinputtoaMapReducejobisdividedintomanyequalsizedsplits,eachof
whichisassignedamaptask.Themapfunctionisthenappliedtoeachrecordin
eachsplit.Forlargejobs,thejobtrackerschedulesthesemaptasksinparallel.The
overallperformanceofaMapReducejobdependssignificantlyonachievinga
balanceofenoughparallelsplitstokeepmanymachinesbusy,butnotsomany
parallelsplitsthattheinterprocesscommunicationofmanagingallthesplitsbogs
downtheoveralljob.WhenMapReduceisrunovertheHDFSfilesystem,atypical
defaultsplitsizeis64MBofinputdata.
Asthenamesuggests,themaptaskisthefirsthalfoftheMapReducejob.Eachmap
taskproducesasetofintermediateresultrecordswhicharewrittentothelocaldiskof
themachineperformingthemaptask.ThesecondhalfoftheMapReducejob,the
reducetask,mayrunonanyprocessingnode.Theoutputsofthemappers(nodes
runningmaptasks)aresortedandpartitionedinsuchawaythattheseoutputscanbe
transferredtothereducers(nodesrunningthereducetask).Thefinaloutputsofthe
reducerscomprisethesortedandpartitionedresultssetoftheoverallMapReduce
job.InMapReducerunningoverHDFS,theresultssetiswrittentoHDFSandis
replicatedforreliability.
InFigure3,weshowthistaskflowforaMapReducejobwiththreemappernodes
feedingtworeducernodes,byreproducingfigure2.3fromTomWhite'sbook,
Hadoop,TheDefinitiveGuide,2ndEdition,(O'Reilly,2010).
-
8/12/2019 1597 EDW Big Data Analytics Kimball
17/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics15
Figure 3. An example MapReduce jobInTomWhite'sbook,asimpleMapReducejobisdescribedwhichweextend
somewhathere.Supposethattheoriginaldatabeforethesplitsareappliedconsists
ofaverylargenumber(perhapsbillions)ofunsortedtemperaturemeasurements,one
perrecord.Suchmeasurementscouldcomefrommanythousandsofautomatic
sensorslocatedaroundtheUnitedStates.Thesplitsareassignedtotheseparate
mappernodestoequalizeasmuchaspossiblethenumberofrecordsgoingtoeach
node.Theactualformofthemapperinputsarekey-valuepairs,inthiscasea
sequentialrecordidentifierandthefullrecordcontainingthetemperature
measurementsaswellasotherdata.Thejobofeachmapperissimplytoparsetherecordspresentedtoitandextracttheyear,thestate,andthetemperature,which
becomesthesecondsetofkey-valuepairspassedfromthemappertothereducer.
Thejobofeachreduceristofindthemaximumreportedtemperatureforeachstate,
andeachdistinctyearintherecordspassedtoit.Eachreducerisresponsiblefora
state,soinordertoaccomplishthetransfer,theoutputofeachmappermustbe
sortedsothatthekey-valuepairscanbedispatchedtotheappropriatereducers.In
thiscasetherewouldbe50reducers,oneforeachstate.Thesesortedblocksare
thentransferredtothereducersinastepwhichisacriticalfeatureoftheMapReduce
architecture,whereitiscalledthe"shuffle.
Noticethattheshuffleinvolvesatruephysicaltransferofdatabetweenprocessing
nodes.Thismakesthevalueoftherackawarefeaturemoreobvious,sincealotof
dataneedstobemovedfromthemapperstothereducers.Thecleverreadermay
wonderifthisdatatransfercouldbereducedbyhavingthemapperoutputscombined
sothatmanyreadingsfromasinglestateandyeararegiventothereducerasa
singlekey-valuepairratherthanmany.Theanswerisyes,andHadoopprovidesa
combinerfunctiontoaccomplishexactlythisend.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
18/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics16
Eachreducerreceivesalargenumberofstate/year-temperaturekey-valuepairs,and
findsthemaximumtemperatureforagivenyear.Thesemaximumtemperaturesfor
eachyeararethefinaloutputfromeachreducer.
Thisapproachcanbescaledmoreorlessindefinitely.ReallyseriousMapReduce
jobsrunningonHDFSmayhavehundredsorthousandsofmappersandreducers,
processingpetabytesofinputdata.
AtthispointtheappealoftheMapReduce/Hadoopapproachshouldbeclear.There
arevirtuallynorestrictionsontheformoftheinputstotheoveralljob.Thereonly
needstobesomerationalbasisforcreatingsplitsandreadingrecords,inthiscase
therecordidentifierinTomWhite'sexample.Actuallogicinthemappersandthe
reducerscanbeprogrammedinvirtuallyanyprogramminglanguageandcanbeas
simpleastheaboveexample,ormuchmorecomplicatedUDFs.Thereadershould
beabletovisualizehowsomeofthemorecomplexusecases(e.g.,comparisonof
satelliteimages)describedearlierinthepapercouldfitintothisframework.
ToolsfortheHadoopenvironment
Whatwehavedescribedthusfaristhecoreprocessingcomponentwhen
MapReduceisrunintheHadoopenvironment.Thisisroughlyequivalentto
describingtheinnerprocessingloopinarelationaldatabasemanagementsystem.In
bothcasesthere'salotmoretothesesystemstoimplementacompletefunctioning
environment.ThefollowingisabriefoverviewoftypicaltoolsusedinaMapReduce/
Hadoopenvironment.Wegroupthesetoolsbyoverallfunction.TomWhite'sbook,
mentionedabove,isanexcellentstartingpointforunderstandinghowthesetoolsare
used.
Gettingdatainandgettingdataout
ETLplatforms--ETLplatforms,withtheirlonghistoryofimportingand
exportingdatatorelationaldatabases,providespecificinterfacesformoving
dataintoandoutofHDFS.Theplatform-basedapproach,ascontrastedwith
handcoding,providesextensivesupportformetadata,dataquality,
documentation,andavisualstyleofsystembuilding.
SqoopSqoop,developedbyCloudera,isanopensourcetoolthatallows
importingdatafromarelationalsourcetoHDFSandexportingdatafrom
HDFStoarelationaltarget.DataimportedbySqoopintoHDFScanbeused
bothbyMapReduceapplicationsandHBaseapplications.HBaseisdescribed
below.
ScribeScribe,developedatFacebookandreleasedasopensource,isused
toaggregatelogdatafromalargenumberofWebservers.
FlumeFlume,developedbyCloudera,isadistributedreliablestreamingdata
collectionservice.ItusesacentralconfigurationmanagedbyZookeeperand
supportstunablereliabilityandautomaticfailoverandrecovery.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
19/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics17
Programming
Low-levelMapReduceprogramming--primarycodeformappersand
reducerscanbewritteninanumberoflanguages.Hadoop'snativelanguage
isJavabutHadoopexposesAPIsforwritingcodeinotherlanguagessuchas
RubyandPython.AninterfacetoC++isprovided,whichisnamedHadoop
Pipes.ProgrammingMapReduceatthelowestlevelobviouslyprovidesthemostpotentialpower,butthislevelofprogrammingisverymuchlike
assemblylanguageprogramming.Itcanbeverylaborious,especiallywhen
attemptingtodoconceptuallysimpletaskslikejoiningtwodatasets.
HighlevelMapReduceprogramming--ApachePig,orsimplyPig,isaclient-
sideopen-sourceapplicationprovidingahighlevelprogramminglanguagefor
processinglargedatasetsinMapReduce.Theprogramminglanguageitselfis
calledPigLatin.Hiveisanalternativeapplicationdesignedtolookmuchmore
likeSQL,andisusedfordatawarehousingusecases.Whenemployedforthe
appropriateusecases,PigandtheHiveprovideenormousprogramming
productivitybenefitsoverlow-levelMapReduceprogramming,oftenbya
factorof10ormore.PigandHivelifttheapplicationdeveloper sperspective
upfrommanagingthedetailedmapperandreducerprocessestomoreofan
applicationsfocus.
IntegrateddevelopmentenvironmentMapReduce/Hadoopdevelopment
needstomovedecisivelyawayfrombarehandcodingtobeadoptedby
mainstreamITshops.Anintegrateddevelopmentenvironmentfor
MapReduce/Hadoopneedstoincludeeditorsforsourcecode,compilers,tools
forautomatingsystembuilds,debuggers,andaversioncontrolsystem.
Integratedapplicationenvironmentanevenhigherlayeraboveanintegrated
developmentenvironmentcouldbecalledanintegratedapplication
environment,wherecomplexreusableanalyticroutinesareassembledintocompleteapplicationsviaagraphicaluserinterface.Thiskindofenvironment
mightbeabletouseopensourcealgorithmssuchasprovidedbytheApache
MahoutprojectwhichdistributesmachinelearningalgorithmsonHadoop
platform.
Cascading--Cascadingisanothertoolthatisanabstractionlayerforwriting
complexMapReduceapplications.ItisbestdescribedasathinJavalibrary
typicallyinvokedfromcommandlinetobeusedasaqueryAPIandprocess
scheduler.ItisnotintendedtobeacomprehensivealternativetoPigorHive.
HBase--HBaseisanopen-source,nonrelational,columnorienteddatabase
thatrunsdirectlyonHadoop.ItisnotaMapReduceimplementation.AprincipaldifferentiatorofHBasefromPigorHive(MapReduce
implementations)istheabilitytoprovidereal-timereadandwriterandom-
accesstoverylargedatasets.
Oozie--Oozieisaserver-basedworkflowenginespecializedinrunning
workflowjobswithactionsthatexecuteHadoopjobs,suchasMapReduce,
Pig,Hive,Sqoop,HDFSoperations,andsub-workflows .
-
8/12/2019 1597 EDW Big Data Analytics Kimball
20/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics18
ZooKeeperZooKeeperisacentralizedconfigurationmanagerfordistributed
applications.ZookeepercanbeusedindependentlyofHadoopaswell.
Administering
EmbeddedHadoopadminfeaturesHadoopsupportsacomprehensive
runtimeenvironmentincludingeditlog,safemodeoperation,auditlogging,
filesystemcheck,datanodeblockverifier,datanodeblockdistribution
balancer,performancemonitor,comprehensivelogfiles,metricsfor
administrators,countersforMapReduceusers,metadatabackup,data
backup,filesystembalancer,commissioninganddecommissioningnodes.
JavamanagementextensionsastandardJavaAPIformonitoringand
managingapplications
GangliaContextanopensourcedistributedmonitoringsystemforverylarge
clusters
Featureconvergenceinthecomingdecade
ItissafetosaythatrelationaldatabasemanagementsystemsandMapReduce/
Hadoopsystemswillincreasinglyfindwaystocoexistgracefullyinthecoming
decade.Butthesystemshavedistinctcharacteristics,asdepictedinthefollowing
table:
IntheupcomingdecadeRDBMSswillextendtheirsupportforhostingcomplexdata
typesas"blobs,andwillextendAPIsforarbitraryanalyticroutinestooperateonthe
contentsofrecords.MapReduce/Hadoopsystems,especiallyHive,willdeepentheir
supportforSQLinterfacesandfullersupportofthecompleteSQLlanguage.But
neitherwilltakeoverthemarketforbigdataanalyticsexclusively.Asremarked
earlier,RDBMSscannotprovide"relational"semanticsformanyofthecomplexuse
casesrequiredbybigdataanalytics.Atbest,RDBMSswillproviderelationalstructure
surroundingthecomplexpayloads.
RelationalDBMSs MapReduce/Hadoop
Proprietary,mostly Opensource
Expensive Lessexpensive
Datarequiresstructuring Datadoesnotrequirestructuring
Greatforspeedyindexedlookups Greatformassivefulldatascans
Deepsupportforrelationalsemantics Indirectsupportforrelationalsemantics,e.g.Hive
Indirectsupportforcomplexdatastructures Deepsupportforcomplexdatastructures
Indirectsupportforiteration,complexbranching
Deepsupportforiteration,complexbranching
Deepsupportfortransactionprocessing Littleornosupportfortransactionprocessing
-
8/12/2019 1597 EDW Big Data Analytics Kimball
21/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics19
Similarly,MapReduce/HadoopsystemswillnevertakeoverACID-compliant
transactionprocessing,orbecomesuperiortoRDBMSsforindexedqueriesonrow
andcolumnorientedtables.
Asthispaperisbeingwritten,significantadvancesarebeingmadeindeveloping
hybridsystemsusingbothrelationaldatabasetechnologyandMapReduce/Hadoop
technology.Figure4illustratestwoprimaryalternatives.ThefirstalternativedeliversthedatadirectlyintoaMapReduce/Hadoopconfigurationforprimarynon-relational
analysis.Aswehavedescribed,thisanalysiscanrangethefullgamutfromcomplex
analyticalroutinestosimplesortingthatlookslikeaconventionalETLstep.Whenthe
MapReduce/Hadoopstepiscomplete,theresultsareloadedintoanRDBMSfor
conventionalstructuredqueryingwithSQL.
ThesecondalternativeconfigurationloadsthedatadirectlytoanRDBMS,evenwhen
theprimarydatapayloadsarenotconventionalscalarmeasurements.Atthatpoint
twoanalysismodesarepossible.Thedatacanbeanalyzedwithspeciallycrafted
user-definedfunctions,effectivelyfromtheBIlayer,orpassedtoadownstream
MapReduce/Hadoopapplication.
Inthefutureevenmorecomplexcombinationswilltiethesearchitecturesmore
closelytogether,includingMapReducesystemswhosemappersandreducersare
actuallyrelationaldatabases,andrelationaldatabasesystemswhoseunderling
storageconsistsofHDFSfiles.
Figure 4. Alternative hybrid architectures using both RDBMS and Hadoop.ItwillprobablybedifficultforITorganizationstosortoutthevendorclaimswhichwill
almostcertainlyclaimthattheirsystemsdoeverything.Insomecasestheseclaims
-
8/12/2019 1597 EDW Big Data Analytics Kimball
22/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics20
are"objectionremovers"whichmeansthattheyareclaimsthathaveagrainoftruth
tothem,andaremadetomakeyoufeelgood,butdonotstanduptoscrutinyina
competitiveandpracticalenvironment.Buyerbeware!
Reusableanalytics
Uptothispointwehavebeggedtheissueofwheredoesallthespecialanalytic
softwarecomefrom.Bigdataanalyticswillneverprosperifeveryinstanceisa
customcodedsolution.BoththeRDBMSandtheopen-sourcecommunities
recognizethisandtwomaindevelopmentthemeshaveemerged.High-endstatistical
analysisvendors,suchasSAS,havedevelopedextensiveandproprietaryreusable
librariesforawiderangeofanalyticapplications,includingadvancedstatistics,data
mining,predictiveanalytics,featuredetection,linearmodels,discriminantanalysis,
andmanyothers.Theopensourcecommunityhasanumberofinitiatives,themost
notableofwhichareHadoop-MLandApacheMahout.QuotingfromHadoop- MLs
website:
Hadoop-ML(is)aninfrastructuretofacilitatetheimplementationofparallel
machinelearning/datamining(ML/DM)algorithmsonHadoop.Hadoop-ML
hasbeendesignedtoallowforthespecificationofbothtask-parallelanddata-
parallelML/DMalgorithms.Furthermore,itsupportsthecompositionof
parallelML/DMalgorithmsusingbothserialaswellasparallelbuildingblocks
--thisallowsonetowritereusableparallelcode.Theproposedabstraction
easestheimplementationprocessbyrequiringtheusertoonlyspecify
computationsandtheirdependencies,withoutworryingaboutscheduling,
datamanagement,andcommunication.Asaconsequence,thecodesare
portableinthattheuserneverneedstowriteHadoop-specificcode.This
potentiallyallowsonetoleveragefutureparallelizationplatformswithout
rewritingone'scode.
ApacheMahoutprovidesfreeimplementationsofmachinelearningalgorithmson
Hadoopplatform.
Complexeventprocessing(CEP)
Complexeventprocessing(CEP)consistsofprocessingeventshappeninginsideand
outsideanorganizationtoidentifymeaningfulpatternsinordertotakesubsequent
actioninrealtime.Forexample,CEPisusedinutilitynetworks(electrical,gasand
water)toidentifypossibleissuesbeforetheybecomedetrimental.TheseCEP
deploymentsallowforreal-timeinterventionforcriticalnetworkorinfrastructure
situations.ThecombinationofdeepDWanalyticsandCEPcanbeappliedinretail
customersettingstoanalyzebehaviorandidentifysituationswhereacompanymay
loseacustomerorbeabletosellthemadditionalproductsorservicesatthetimeof
theirdirectengagement.Inbanking,sophisticatedanalyticsmighthelptoidentifythe
10mostcommonpatternsoffraudandCEPcanthenbeusedtowatchforthose
patternssotheymaybethwartedbeforealoss.
Atthetimeofthiswhitepaper,CEPisnotgenerallythoughtofaspartoftheEDW,
butthisauthorbelievesthattechnicaladvancesincontinuousqueryprocessingwill
-
8/12/2019 1597 EDW Big Data Analytics Kimball
23/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics21
causeCEPandEDWtosharedataandworkmorecloselytogetherinthecoming
decade.
Data warehouse cultural changes in the coming decadeTheenterprisedatawarehousemustabsolutelystayrelevanttothebusiness.Asthe
valueandthevisibilityofbigdataanalyticsgrows,thedatawarehousemust
encompassthenewculture,skills,techniques,andsystemsrequiredforbigdata
analytics.
Sandboxes
Forexample,bigdataanalysisencouragesexploratorysandboxesfor
experimentation.Thesesandboxesarecopiesorsegmentsofthemassivedatasets
beingsourcedbytheorganization.Individualanalystsorverysmallgroupsare
encouragedtoanalyzethedatawithaverywidevarietyoftools,rangingfromserious
statisticaltoolslikeSAS,MatlaborR,topredictivemodels,andmanyformsofadhocqueryingandvisualizationthroughadvancedBIgraphicalinterfaces.Theanalyst
responsibleforagivensandboxisallowedtodoanythingwiththedata,usinganytool
theywant,evenifthetoolstheyusearenotcorporatestandards.Thesandbox
phenomenonhasenormousenergybutitcarriesasignificantrisktotheIT
organizationandEDWarchitecturebecauseitcouldcreateisolatedandincompatible
stovepipesofdata.Thispointisamplifiedinthesectiononorganizationalchanges,
below.
Exploratorysandboxesusuallyhavealimitedtimeduration,lastingweeksoratmost
afewmonths.Theirdatacanbeafrozensnapshot,orawindowonacertainsegment
ofincomingdata.Theanalystmayhavepermissiontorunanexperimentchangingafeatureontheproductorserviceinthemarketplace,andthenperformingA/Btesting
toseehowthechangeaffectscustomerbehavior.Typically,ifsuchanexperiment
producesasuccessfulresult,thesandboxexperimentisterminated,andthefeature
goesintoproduction.Atthatpoint,trackingapplicationsthatmayhavebeen
implementedinthesandboxusingaquickanddirtyprototypinglanguage,areusually
reimplementedbyotherpersonnelintheEDWenvironmentusingcorporatestandard
tools.Inseveralofthee-commerceenterprisesinterviewedforthiswhitepaper,
analyticsandboxeswereextremelyimportant,andinsomecaseshundredsofthe
sandboxexperimentswereongoingsimultaneously.Asoneintervieweecommented,
"newlydiscoveredpatternshavethemostdisruptivepotential,andinsightsfromthem
leadtothehighestreturnsoninvestment."
Architecturally,sandboxesshouldnotbebruteforcecopiesofentiredatasets,or
evenmajorsegmentsofthesedatasets.Indimensionalmodelingparlance,the
analystneedsmuchmorethanjustafacttabletoruntheexperiment.Ataminimum
theanalystalsoneedsoneormoreverylargedimensiontables,andpossibly
additionalfacttablesforcomplete"drillacross"analysis.If100analystsarecreating
bruteforcecopyversionsofthedataforthesandboxestherewillbeenormous
wastingofdiskspaceandresourcesforalltheredundantcopies.Rememberthatthe
-
8/12/2019 1597 EDW Big Data Analytics Kimball
24/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics22
largestdimensiontables,suchascustomerdimensions,canhave500millionrows!
Therecommendedarchitectureforaserioussandboxenvironmentistobuildeach
sandboxusingconformed(shared)dimensionswhichareincorporatedintoeach
sandboxasrelationalviews,ortheirequivalentunderHadoopapplications.
Lowlatency
Anelementarymistakewhengatheringbusinessrequirementsduringthedesignofa
datawarehouseistoaskthebusinessuseriftheywant"realtime"data.Usersare
likelytosay"ofcourse!"Althoughperhapsthisanswerhasbeensomewhatgratuitous
inthepast,agoodbusinesscasecannowbemadeinmanysituationsthatmore
frequentupdatesofdatadeliveredtothebusinesswithlowerandlowerlatenciesare
justified.BothRDBMSsandMapReduce/Hadoopsystemsstrugglewithloading
giganticamountsofdataandmakingthatdataavailablewithinsecondsofthatdata
beingcreated.Butthemarketplacewantsthis,andregardlessofatechnologists
doubtabouttherequirement,therequirementisrealandoverthenextdecadeitmust
beaddressed.
Aninterestingangleonlowlatencydataisthedesiretobeginseriousanalysisonthe
dataasitisstreamingin,butpossiblyfarbeforethedatacollectionprocesseven
terminates.Thereissignificantinterestinstreaminganalysissystemswhichallow
SQL-likequeriestoprocessthedataasitflowsintothesystem.Insomeusecases
whentheresultsofastreamingquerysurpassathreshold,theanalysiscanbehalted
withoutrunningthejobtothebitterend.Anacademiceffort,knownascontinuous
querylanguage(CQL),hasmadeimpressiveprogressindefiningtherequirements
forstreamingdataprocessingincludingcleversemanticsfordynamicallymovingtime
windowsonthestreamingdata.LookforCQLlanguageextensionsandstreaming
dataquerycapabilitiesintheloadprogramsforbothRDBMSsandHDFSdeployed
datasets.Anidealimplementationwouldallowstreamingdataanalysistotakeplacewhilethedataisbeingloadedatgigabytespersecond.
Theavailabilityofextremelyfrequentandextremelydetailedeventmeasurements
candriveinteractiveintervention.Theusecaseswherethisinterventionisimportant
spansmanysituationsrangingfromonlinegamingtoproductoffersuggestionsto
financialaccountfraudresponsestothestabilityofnetworks.
Continuousthirstformoreexquisitedetail
Analystsareforeverthirstingformoredetailineverymarketplaceobservation,
especiallyofcustomerbehavior.Forexampleeverywebpageevent(apagebeing
paintedonauser'sscreen)spawnshundredsofrecordsdescribingeveryobjecton
thepage.Inonlinegames,whereeverygestureentersthedatastream,asmanyas
100descriptorsareattachedtoeachofthesegesturemicro-events.Forinstance,ina
hypotheticalonlinebaseballgame,whenthebatterswingsatapitch,everything
describingthepositionoftheplayers,thescore,runnersonthebases,andeventhe
characteristicsofthepitch,areallstoredwiththatindividualrecord.Inbothofthese
examples,thecompletecontextmustbecapturedwithinthecurrentrecord,because
itisimpracticaltocomputethisdetailedcontextafterthefactfromseparatedata
sources.Thelessonforthecomingdecadeisthatthisthirstforexquisitedetailwill
-
8/12/2019 1597 EDW Big Data Analytics Kimball
25/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics23
onlygrow.Itispossibletoimaginethousandsofattributesbeingattachedtosome
micro-events,andthecategoriesandnamesoftheseattributeswillgrowin
unpredictableways.Thismakesthedatabagapproachdiscussedearlierinthepaper
muchmoreimportant.Itmeansthatpositionallydependentschemas,withthekeys
(namesofthedata)pre-declaredascolumnnamesisanunworkabledesign.
Finally,aperfecthistoricalreconstructionofinterestingeventssuchaswebpageexposuresneedstobemorethanjustalistofattributesonthewebpagewhenitwas
displayed,evenifthatlistisenormouslydetailed.Aperfecthistoricalreconstructionof
thewebpageneedstobeseenthroughamultimediauserinterface,i.e.,abrowser.
Lighttouchdatawaitsforitsrelevancetobeexposed
Lighttouchdataisanaspectoftheexquisitedetaildatadescribedintheprevious
section.Forexample,ifacustomerbrowsesawebsiteextensivelybeforemakinga
purchase,agreatdealofmicro-contextisstoredinallthewebpageeventspriortothe
purchase.Whenthepurchaseismade,someofthatmicro-contextsuddenly
becomesmuchmoreimportant,andiselevatedfrom"lighttouchdata"torealdata.At
thatpointthesequenceofexposurestotheselectedproductortocompetitive
productsinthesamespacebecomespossibletobesessionized.Thesemicro-events
areprettymuchmeaninglessbeforethepurchaseevent,becausetherearesomany
conceivableandirrelevantthreadsthatwouldbedeadendsforanalysis.This
requiresoceansoflighttouchdatatobestored,waitingfortherelevanceofselected
threadsofthesemicro-eventstoeventuallybeexposed.Conventionalseasonality
thinkingsuggeststhatatleastfivequarters(15months)ofthislighttouchdataneeds
tobekeptonline.Thisisoneinstanceofaremarkmadeconsistentlyduring
interviewsforthiswhitepaperthatanalystswant"longertails"whichmeansthatthey
wantmoresignificanthistoriesthantheycurrentlyget.
Simpleanalysisofallthedatatrumpssophisticatedanalysisofsomeofthedata
Althoughdatasamplinghasneverbeenapopulartechniqueindatawarehousing,
surprisinglythearrivalofenormouspetabytesizeddatasetshasnotincreasedthe
interestinanalyzingasubsetofthedata.Onthecontrary,anumberofanalystspoint
outthatmonetizableinsightscanbederivedfromverysmallpopulationsthatcouldbe
missedbyonlysamplingsomeofthedata.Ofcoursethisisasomewhatcontroversial
point,sincethesameanalystsadmitthatifyouhave1trillionbehaviorobservation
records,youmaybeabletofindanybehaviorpatternifyoulookhardenough.
Anothersomewhatcontroversialpointraisedbysomeanalystsistheirconcernthat
anyformofdatacleaningontheincomingdatacoulderaseinterestinglow-frequency
"edgecases.Ultimatelyboththecasesofmisleadingrarebehaviorpatterns,and
misleadingcorrupteddataneedtobegentlyfilteredoutofthedata.
Assumingthatthebehaviorinsightsfromverysmallpopulationsarevalid,thereis
widespreadrecognitionthatmicro-marketingtothesmallpopulationsispossible,and
doingenoughofthiscanbuildasustainablestrategicadvantage.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
26/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics24
Afinalargumentinfavorofanalyzingcompletedatasetsisthatthese"relationscans"
donotrequireindexesoraggregationstobecomputedinadvanceoftheanalysis.
ThisapproachfitswellwiththebasicMapReducedistributedanalysisarchitecture.
Datastructuresshouldbedeclaredatquerytime,notatdataloadtime
Anumberofanalystsinterviewedforthiswhitepapersaidthattheenormousdata
setstheyweretryingtoanalyzeneededtobeloadedinaqueryablestatebeforethe
structureandcontentofthedatasetswerecompletelyunderstood.Again,thinkingof
thedatabagkindofmarketplaceobservationwherewithinawell-structured
dimensionalmeasurementprocesstheactualobservationisadisorderlyand
potentiallyunpredictablesetofkeyvaluepairs,thestructureofthisdatabagmay
needtobediscovered,andalternateinterpretationofthestructuresmayneedtobe
possiblewithoutreloadingthedatabase.Onerespondentremarkedthatyesterdays
fringedataistomorrowswell -structureddata,implyingthatweneedexceptional
flexibilityasweexplorenewkindsofdatasources.
AkeydifferentiatorbetweentheRDBMSapproachandtheMapReduce/Hadoop
approachisthedeferralofthedatastructuredeclarationuntilquerytimeinthe
MapReduce/Hadoopsystems.AnobjectionfromtheRDBMScommunitythatforcing
everyMapReducejobtodeclarethetargetdatastructurepromotesakindofchaos
becauseeveryanalystcandotheirownthing.Butthatobjectionseemstomissthe
pointthatastandarddatastructuredeclarationcaneasilybepublishedasalibrary
modulethatcanbepickedupbyeveryanalystwhentheyareimplementingtheir
application.
TheEDWsupportingbigdataanalyticsmustbemagnetic,agile,anddeep
CohenandDolanintheirseminalbutsomewhatcontroversialpaperonbigdata
analyticsarguethatEDWsmustshedsomeoldorthodoxiesinordertobemagnetic,agile,anddeep.Amagneticenvironmentplacestheleastimpedimentsonthe
incorporationofnew,unexpected,andpotentiallydirtydatasources.Specifically,this
supportstheneedtodeferdeclarationofdatastructuresuntilafterthedataisloaded.
AccordingtoCohenandDolan,anagileenvironmenteschewslong-rangecareful
designandplanning!Andadeepenvironmentallowsrunningsophisticatedanalytic
algorithmsonmassivedatasetswithoutsampling,orperhapsevencleaning.We
havemadethesepointselsewhereinthiswhitepaperbutCohenandDolanspaper
isaparticularlypotent,ifunusual,argument.Readthispapertogetsomeprovocative
perspectives!AlinktoCohenandDolanspaperisprovidedinthereferencessection
attheendofthiswhitepaper.
Theconflictbetweenabstractionandcontrol
IntheMapReduce/Hadoopworld,PigandHivearewidelyregardedasvaluable
abstractionsthatallowtheprogrammertofocusondatabasesemanticsratherthan
programmingdirectlyinJava.Butseveralanalystsinterviewedforthispaper
remarkedthattoomuchabstractionandtoomuchdistancingfromwherethedata
actuallyisstoredcanbedisastrouslyinefficient.Thisseemslikeareasonable
concernwhendealingwiththeverylargestdatasets,whereabadalgorithmcould
-
8/12/2019 1597 EDW Big Data Analytics Kimball
27/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics25
resultinruntimesmeasuredindays.Forthebreakingwaveofthebiggestdatasets,
programmingtoolswillneedtoallowconsiderablecontroloverthestoragestrategy,
andtheprocessingapproaches,butwithoutrequiringprogrammingusingthelowest
levelcode.
Data warehouse organization changes in the coming decadeThegrowingimportanceofbigdataanalyticsamountstosomethingbetweena
midcoursecorrectionandarevolutionforenterprisedatawarehousing.Newskillsets,
neworganizations,newdevelopmentparadigms,andnewtechnologywillneedtobe
absorbedbymanyenterprises,especiallythosefacingtheusecasesdescribedinthis
paper.Noteveryenterpriseneedstojumpintothepetabyteocean,butitisthis
author'spredictionthattheupcomingdecadewillseeasteadygrowthinthe
percentageoflargeenterprisesrecognizingthevalueofbigdataanalytics.
Mostobserverswouldagreethatbigdataanalyticsfallswithin"information
management,"butthesameobserversmayquibbleaboutwhetherthisaffectsthe
"datawarehouse."Ratherthanworryingaboutwhethertheboxontheorganization
chartlabeledEDWhasresponsibilityforbigdataanalytics,wetaketheperspective
thatenterprisedatawarehousingwithoutthecapitallettersabsolutelyencompasses
bigdataanalytics.Havingsaidthat,therewillbemanydifferentorganizational
structuresandmanagementperspectivesasindustriesexpandtheirinformation
management.Thiskindoftinkeringandadjustingtothenewparadigmisnormaland
expected.Wewentthroughaverysimilarphaseinthemid1980swhendata
warehousingitselfwasanewparadigmforITandthebusiness.Manyofthemost
successfulearlydatawarehousinginitiativesstartedinthebusinessorganizations
andwereeventuallyincorporatedintothoseITorganizationsthatthenmademajorcommitmentstobeingbusinessrelevant.Itislikelythesameevolutionwilltakeplace
withbigdataanalytics.
Thechallengebeforeinformationmanagersinlargeenterprisesishowtoencourage
threeseparatedatawarehouseendeavors:conventionalRDBMSapplications,
MapReduce/Hadoopapplications,andadvancedanalytics.
Technicalskillsetsrequired
Itisworthrepeatingherethemessageoftheveryfirstsentenceofthiswhitepaper.
Petabytescaledatasetsareofcourseabigchallengebutbigdataanalysisisoften
aboutdifficultiesotherthandatavolume.Youcanhavefastarrivingdataorcomplexdataorcomplexanalyseswhichareverychallengingevenifallyouhaveare
terabytesofdata!
ThecareandfeedingofRDBMS-orienteddatawarehousesinvolvesa
comprehensivesetofskillsthatisprettywellunderstood:SQLprogramming,ETL
platformexpertise,databasemodeling,taskscheduling,systembuildingand
maintenanceskills,oneormorescriptinglanguagessuchasPythonorPerl,UNIXor
Windowsoperatingsystemskills,andbusinessintelligencetoolsskills.SQL
-
8/12/2019 1597 EDW Big Data Analytics Kimball
28/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics26
programming,whichisatthecoreofanRDBMSimplementation,isadeclarative
language,whichcontrastswiththemindsetoftheprocedurallanguageskillsneeded
forMapReduce/Hadoopprogramming,atleastinJava.Thedatawarehouseteam
alsoneedstohaveagoodpartnershipwithinotherareasofITincludingstorage
management,security,networking,andsupportofmobiledevices.Finally,gooddata
warehousingalsorequiresanextensiveinvolvementwiththebusinesscommunity,
andwiththecognitivepsychologyofend-users!
ThecareandfeedingofMapReduce/Hadoopdatawarehouses,includinganyofthe
bigdataanalyticsusecasesdescribedinthispaper,involvesasetofskillsthatonly
partiallyoverlaptraditionalRDBMSdatawarehouseskills.Thereinliesasignificant
challenge.Thesenewskillsincludelower-levelprogramminglanguagessuchas
Java,C++,Ruby,Python,andMapReduceinterfacesmostcommonlyavailablevia
Java.Althoughtherequirementtoprogramviaproceduralbasedlower-level
programminglanguageswillbereducedsignificantlyduringtheupcomingdecadein
favorofPig,Hive,andHBase,itmaybeeasiertorecruitMapReduce/Hadoop
applicationdevelopersfromtheprogrammingcommunityratherthanthedata
warehousecommunity,ifthedatawarehousejobapplicantslackprogrammingandUNIXskills.IfMapReduce/Hadoopdatawarehousesaremanagedexclusivelywith
opensourcetools,thenZookeeperandOozieskillswillbeneededtoo.Keepinmind
thattheopen-sourcecommunityinnovatesquickly.Hive,PigandHBasearenotthe
lastwordinhigh-levelinterfacestoHadoopforanalysis.Itislikelythatwewillsee
muchmoreinnovationinthisdecadeincludingentirelynewinterfaces.
ETLplatformprovidershaveabigopportunitytoprovidemuchofthegluethatwilltie
togetherthebigdatasources,MapReduce/Hadoopapplications,andexisting
relationaldatabases.DeveloperswithETLplatformskillswillbeabletoleveragea
greatdealoftheirexperienceandinstinctsinsystembuildingwhentheyincorporate
MapReduce/Hadoopapplications.
Finally,theanalystswhomwehavedescribedasoftenworkinginsandbox
environmentswillarrivewithaneclecticandunpredictablesetofskillsstartingwith
deepanalyticexpertise.Forthesepeopleitisprobablymoreimportanttobe
conversantinSAS,Matlab,orRthantohavespecificprogramminglanguageor
operatingsystemskills.SuchindividualstypicallywillarrivewithUNIXskills,and
somereasonableprogrammingproficiency,andmostofthesepeopleareextremely
tolerantoflearningnewcomplextechnicalenvironments.Perhapsthebiggest
challengewithtraditionalanalystsisgettingthemtorelyontheotherresources
availabletothemwithinIT,ratherthanbuildingtheirownextractanddatadelivery
pipelines.Thisisatrickybalancebecauseyouwanttogivetheanalystsunusualfreedom,butyouneedtolookovertheirshoulderstomakesurethattheyarenot
wastingtheirtime.
Neworganizationsrequired
Atthisearlystageofthebigdataanalyticsrevolution,thereisnoquestionthatthe
analystsmustbepartofthebusinessorganization,bothtounderstandthe
microscopicworkingsofthebusiness,butalsotobeabletoconductthekindofrapid
turnaroundexperimentsandinvestigationswehavedescribedinthispaper.Aswe
-
8/12/2019 1597 EDW Big Data Analytics Kimball
29/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics27
havedescribed,theseanalystsmustbeheavilysupportedinatechnicalsense,with
potentiallymassivecomputepoweranddatatransferbandwidth.Soalthoughthe
analystsmayresideinthebusinessorganizations,thisisagreatopportunityforITto
gaincredibilityandpresencewiththebusiness.Itwouldbeasignificantmistakeand
alostopportunityfortheanalystsandtheirsandboxestoexistasroguetechnical
outpostsinthebusinessworldwithoutrecognizingandtakingadvantageoftheirdeep
dependenceonthetraditionalITworld.
Insomeorganizationsweinterviewedforthiswhitepaper,wesawseparateanalytic
groupsembeddedwithindifferentbusinessorganizations,butwithoutverymuch
crosscommunicationorcommonidentityestablishedamongtheanalyticgroups.In
somenoteworthycases,thislackofan"analyticcommunity"ledtolostopportunities
toleverageeachother'swork,andledtomultiplegroupsreinventingthesame
approaches,andduplicatingprogrammingeffortsandinfrastructuredemandsasthey
madeseparatecopiesofthesamedata.
Werecommendthatacrossdivisionalanalyticscommunitybeestablishedmimicking
someofthesuccessfuldatawarehousecommunitybuildingeffortswehaveseeninthepastdecade.Suchacommunityshouldhaveregularcrossdivisionalmeetings,as
wellasakindofprivateLinkedInapplicationtopromoteawarenessofallthecontacts
andperspectivesandresourcesthattheseindividualscollectintheirown
investigations,andaprivatewebportalwhereinformationandnewseventsare
shared.Periodictalkscanbegiven,hopefullyinvitingmembersofthebusiness
communityaswell,andabovealltheanalyticscommunityneedsT-shirtsandmugs!
Newdevelopmentparadigmsrequired
Evenbeforethearrivalofbigdataanalytics,datawarehousinghasbeentransforming
itselftoprovidemorerapidresponsetonewopportunitiesandtobemoreintouch
withthebusinesscommunity.Someofthepracticesoftheagilesoftware
developmentmovementhavebeensuccessfullyadoptedbythedatawarehouse
community,althoughrealisticallythishasnotbeenahighlyvisibletransformation.
But,inparticular,theagiledevelopmentapproachsupportsthedatawarehouseby
beingorganizedaroundsmallteamsdrivenbythebusiness,nottypicallybyIT.An
agiledevelopmenteffortalsoproducesfrequenttangibledeliveries,deemphasizes
documentationandformaldevelopmentmethodologies,andtoleratesmidcourse
correctionandtheincrementalacceptanceofnewrequirements.Themostsensitive
ingredientforsuccessofagiledevelopmentprojectsisthepersonalityandskillsofthe
businessleaderwhoultimatelyisincharge.Theagilebusinessleaderneedstobea
thoughtfulandsophisticatedobserverofthedevelopmentprocessandtherealitiesoftheinformationworld.Hopefullytheagilebusinessleaderisaprettygoodmanageras
well.
Bigdataanalyticscertainlyopensthedoortobusinessinvolvementsincethecentral
analysisisprobablydoneinthebusinessenvironmentdirectly.Butitisprobably
unlikelythattheprofessionalanalystistherightpersontobetheoverallagiledata
warehouseprojectleader.Theagileprojectleaderneedstobewellskilledin
facilitatingshorteffectivemeetings,resolvingissuesanddevelopmentchoices,
-
8/12/2019 1597 EDW Big Data Analytics Kimball
30/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics28
determiningthetruthofprogressreportsfromindividualdevelopers,communicating
withtherestoftheorganization,andgettingfundingforinitiatives.
Traditionaldatawarehousedevelopmenthasdiscoveredtheattractivenessof
buildingincrementallyfromamodeststart,butwithagoodarchitecturalfoundation
thatprovidesablueprintforwherefuturedevelopmentwillgo.Thisauthorhas
describedinmanypapersthetechniquesfor"gracefulmodification"ofdimensionaldatawarehouseschemas.Inadimensionallymodeleddatawarehouse,new
measurementfacts,newdimensionalattributes,andevennewdimensionscanbe
addedtoexistingdatawarehouseapplicationswithoutchanging,invalidating,or
rollingoverexistinginformationdeliverypipelinestotheendusers.Manyoftheuse
caseswehavedescribedinthispaperforbigdataanalyticssuggestthatnewfacts,
newattributes,andnewdimensionswillroutinelybecomeavailable.
Integrationofnewdatasourcesintoadatawarehousehasalwaysbeena
significantchallenge,sinceoftenthesenewdatasourcesarrivewithoutanythought
tointegrationwithexistingdatasources.Thiswillcertainlybethecasewithbigdata
analytics.Againfordimensionallymodeleddatawarehouses,thisauthorhasdescribedtechniquesforincrementalintegration,where"enterprisedimensional
attributes"aredefinedandplantedinthedimensionsoftheseparatedatasources.
Wecalltheseconformeddimensions.Thedevelopmentanddeploymentof
conformeddimensionsfitstheagiledevelopmentapproachbeautifully,sincethis
kindofintegrationcanbeimplementedonedatasourceatatime,andone
dimensionalattributeatatime,againinawaythatisnondestructivetoexisting
applications.Pleaseseethereferencessectionattheendofthiswhitepaperfor
moreinformationonconformeddimensions.
Finally,atleastoneorganizationinterviewedforthiswhitepaperhastakenagilityto
itslogicalextreme.Individualdevelopersaregivencompleteend-to-endresponsibilityforaproject,allthewayfromoriginalsourcingofthedata,through
experimentalanalysis,re-implementingtheprojectforproductionuse,andworking
withtheend-usersandtheirBItoolsinsupportivemode.Althoughthisdevelopment
approachremainsanexperiment,earlyresultsareveryinterestingbecausethese
developersfeelasignificantsenseofresponsibilityandpridefortheirprojects.
Lessonsfromtheearlydatawarehousingera
Ittookmostofthe1990sfororganizationstounderstandwhatadatawarehouse
wasandhowtobuildandmanagethosekindsofsystems.Interestingly,attheend
ofthe1990s,datawarehousingwaseffectivelyrelabeledasbusinessintelligence.
Thiswasaverypositivedevelopmentbecauseitreflectedtheneedforthebusiness
toownandtakeresponsibilityfortheusesofdata.
Theearliestdatawarehousepioneershadnochoicebuttodotheirownsystems
integration,assemblingbest-of-breedcomponents,andcopingwiththeinevitable
incompatibilitiesinissuesofdealingwithmultiplevendors.Bytheendofthe1990s,
thebestofbreedapproachgavewaytovendorstacksofintegratedproducts,a
trendwhichcontinuesuntiltoday.Atthispoint,thereareonlyafewindependent
vendorsinthedatawarehousespace,andthosevendorshavesucceededby
-
8/12/2019 1597 EDW Big Data Analytics Kimball
31/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics29
interfacingwithnearlyeveryconceivableformatandinterface,therebyproviding
bridgesbetweenthemorelimitedproprietaryvendorstacks.
Withthebenefitofhindsightgainedfromthetraditionaldatawarehouseexperience,
thebigdataanalyticsversionofdatawarehousingislikelytoconsolidatequite
quickly.Onlythebravestorganizationswithverystrongsoftwaredevelopmentskills
shouldconsiderrollingtheirownbigdataanalyticsapplicationsdirectlyonrawMapReduce/Hadoop.Forinformationmanagementorganizationswishingtofocuson
thebusinessissuesratherthanonthebreakingwaveofsoftwaredevelopment,a
packagedHadoopdistribution(e.g.,Cloudera)makesalotofsense.TheleadingETL
platformvendorslikelywillalsointroducepackagedenvironmentsforhandlingmany
ofthephasesofMapReduce/Hadoopdevelopment.
Analyticsinthecloud
Thiswhitepaperhasnotdiscussedcloudimplementationsofbigdataanalytics.Most
oftheenterprisesinterviewedforthiswhitepaperwerenotusingpubliccloud
implementationsfortheirproductionanalytics.Nevertheless,cloudimplementations
maybeveryattractiveinthestartupphaseforananalyticseffort.Acloudservicecan
provideinstantscalabilityduringthisstartupphase,withoutcommittingtoamassive
legacyinvestmentinhardware.Dataanalysisprojectscanbeturnedonandturned
offonshortnotice.Recallthattypicalanalyticenvironmentsmayinvolvehundredsof
separatesandboxesandparallelexperiments.
Manyoftheorganizationsinterviewedforthispaperstatedthatmatureanalytics
shouldbebroughtin-house,perhapsimplementedtechnicallyasacloudbutwithin
theconfinesoftheorganization.Ofcourse,suchanin-housecloudmayreducefears
ofsecurityandprivacybreaches(fairlyornot).
Aremotecloudimplementationraisesissuesofnetworkbandwidth,especiallyinabroadlyintegratedapplicationwithmultipleverylargedatasetsindifferentlocations.
Imaginesolvingthebigjoinproblemwhereyourtrillionrowfacttableisoutonthe
cloud,andyourbillionrowdimensiontableislocatedin-house.
Althoughthebestperformingsystemstrytoachieveathree-waybalanceamong
CPU,diskspeed,andbandwidth,mostorganizationsinterviewedforthispaper
predictedthatbandwidthwouldemergeasthenumberonelimitingfactorforbigdata
analyticssystemperformance.
WhitherEDW?
Theenterprisedatawarehousemustexpandtoencompassbigdataanalyticsaspart
ofoverallinformationmanagement.Themissionofthedatawarehousehasalways
beentocollectthedataassetsoftheorganizationandstructuretheminawaythatis
mostusefultodecision-makers.Althoughsomeorganizationsmaypersistwithabox
ontheorgchartlabeledEDWthatisrestrictedtotraditionalreportingactivitieson
transactionaldata,thescopeoftheEDWshouldgrowtoreflectthesenewbigdata
developments.InsomesensethereareonlytwofunctionsofIT:gettingthedatain
(transactionprocessing),andgettingthedataout.TheEDWisgettingthedataout.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
32/33
TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics30
Thebigchoicefacingshopswithgrowingbigdataanalyticsinvestmentsiswhetherto
chooseanRDBMS-onlysolution,oradualRDBMSandMapReduce/Hadoop
solution.Thisauthorpredictsthatthedualsolutionwilldominate,andinmanycases
thetwoarchitectureswillnotexistasseparateislandsbutratherwillhaverichdata
pipelinesgoinginbothdirections.Itissafetosaythatbotharchitectureswillevolve
hugelyoverthenextdecade,butthisauthorpredictsthatbotharchitectureswillshare
thebigdataanalyticsmarketplaceattheendofthedecade.
Sometimeswhenanexcitingnewtechnologyarrives,thereisatendencytoclosethe
dooronoldertechnologiesasiftheyweregoingtogoaway.Datawarehousinghas
builtanenormouslegacyofexperience,bestpractices,supportingstructures,
technicalexpertise,andcredibilitywiththebusinessworld.Thiswillbethefoundation
forinformationmanagementintheupcomingdecadeasdatawarehousingexpands
toincludebigdataanalytics.
-
8/12/2019 1597 EDW Big Data Analytics Kimball
33/33
AcknowledgmentsThisauthorisgratefulforInformaticassponsoringofthiswhitepaperandfor
providingabsolutelyno"vendorbias.Theopinionsinthiswhitepaperaresolelythe
responsibilityoftheauthor.
Anumberofsmartandknowledgeablebigdatapractitionersmadethemselvesavailableduringtheresearchphaseofthewhitepaperforinterviews.These
individualsprovidedmanyusefulinsights.Inalphabeticorderbyorganization,we
thank
AmrAwadallah,MikeOlson,Cloudera
BrianDolan,Discovix
OliverRatzesberger,eBay
AlexIgnatius,ElectronicArts
WilliamSchmarzo,EMC
AshishThusoo,Facebook
JuliannaDeLua,JohnHaddad,SanjayKrishnamurthi,RonLunasin,Informatica
NicholasWakefield,LinkedIn
DanGraham,DilipKrishna,RonKunze,Teradata
ProfessorMichaelFranklin,ComputerScienceDepartment,U.C.Berkeley
RaymieStata,Yahoo!
DanMcCaffrey,KenRudin,Zynga
ReferencesAnArchitectureforDataQuality,aKimballGroupWhitepaper:http://
vip.informatica.com/?elqPURLPage=8784
EssentialStepsfortheIntegratedEDW,aKimballGroupWhitepaper:http://
vip.informatica.com/?elqPURLPage=8785
Hadoop,TheDefinitiveGuide, 2ndEdition,TomWhite,OReilly (2011)
Hadoop-MLwebsite:http://videolectures.net/nipsworkshops09_pednault_hmli/
MADSkills:NewAnalysisPracticesforBigData,Cohen,Dolanetal,http://
db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf