1597 edw big data analytics kimball

Upload: harjeet-bakshi

Post on 03-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    1/33

    The Evolving Role of theEnterprise Data Warehouse inthe Era of Big Data nalytics

    AKimballGroupWhitePaper

    ByRalphKimball

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    2/33

    Table of ContentsExecutiveSummary......................................................................................................1

    AbouttheAuthor...........................................................................................................1

    Introduction..................................................................................................................2

    Dataisanassetonthebalancesheet.....................................................................3

    Raisingthecurtainonbigdataanalytics.....................................................................4

    Usecasesforbigdataanalytics...............................................................................4

    Makingsenseofbigdataanalyticusecases...........................................................7

    Bigdataanalyticssystemrequirements......................................................................9

    Extendedrelationaldatabasemanagementsystems............................................10

    MapReduce/Hadoopsystems................................................................................13

    HowMapReduceworksinHadoop........................................................................14

    ToolsfortheHadoopenvironment.........................................................................16

    Featureconvergenceinthecomingdecade..........................................................18

    Reusableanalytics.................................................................................................20

    Complexeventprocessing(CEP)..........................................................................20

    Datawarehouseculturalchangesinthecomingdecade..........................................21

    Sandboxes..............................................................................................................21

    Lowlatency.............................................................................................................22

    Continuousthirstformoreexquisitedetail.............................................................22

    Lighttouchdatawaitsforitsrelevancetobeexposed ...........................................23

    Simpleanalysisofallthedatatrumpssophisticatedanalysisofsomeofthedata.23

    Datastructuresshouldbedeclaredatquerytime,notatdataloadtime ...............24

    TheEDWsupportingbigdataanalyticsmustbemagnetic,agile,anddeep .........24

    Theconflictbetweenabstractionandcontrol.........................................................24

    Datawarehouseorganizationchangesinthecomingdecade...................................25

    Technicalskillsetsrequired..................................................................................25

    Neworganizationsrequired....................................................................................26

    Newdevelopmentparadigmsrequired...................................................................27

    Lessonsfromtheearlydatawarehousingera.......................................................28

    Analyticsinthecloud..............................................................................................29

    WhitherEDW?.........................................................................................................29

    Acknowledgements....................................................................................................31

    References.................................................................................................................31

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    3/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics1

    Executive SummaryInthiswhitepaper,wedescribetherapidlyevolvinglandscapefordesigningan

    enterprisedatawarehouse(EDW)tosupportbusinessanalyticsintheeraof"big

    data.Wedescribethescopeandchallengesofbuildingandevolvingaverystable

    andsuccessfulEDWarchitecturetomeetnewbusinessrequirements.These

    includeextremeintegration,semi-andun-structureddatasources,petabytesofbehavioralandimagedataaccessedthroughMapReduce/Hadoopaswellas

    massivelyparallelrelationaldatabases,andthenstructuringtheEDWtosupport

    advancedanalytics.Thispaperprovidesdetailedguidancefordesigningand

    administeringthenecessaryprocessesfordeployment.Thiswhitepaperhasbeen

    writteninresponsetoalackofspecificguidanceintheindustryastohowtheEDW

    needstorespondtothebigdataanalyticschallenge,andwhatnecessarydesign

    elementsareneededtosupportthesenewrequirements.

    About the AuthorRalphKimballfoundedtheKimballGroup.Sincethemid1980s,hehasbeenthedatawarehouse/businessintelligence(DW/BI)industrysthoughtleaderonthe

    dimensionalapproachandtrainedmorethan10,000ITprofessionals.PriortoworkingatMetaphorandfoundingRedBrickSystems,Ralphco-inventedtheStarworkstationatXeroxsPaloAltoResearchCenter(PARC).RalphhashisPh.D.inElectricalEngineeringfromStanfordUniversity.TheKimballGroupisthesourcefordimensionalDW/BIconsultingandeducation,consistentwithourbest-sellingToolkitbookseries,DesignTips,andaward-winningarticles.Visitwww.kimballgroup.comformoreinformation.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    4/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics2

    IntroductionWhatisbigdata?Itsbignessisactuallynotthemostinterestingcharacteristic.Big

    dataisstructured,semistructured,unstructured,andrawdatainmanydifferent

    formats,insomecaseslookingtotallydifferentthanthecleanscalarnumbersandtext

    wehavestoredinourdatawarehousesforthelast30years.Muchbigdatacannotbe

    analyzedwithanythingthatlookslikeSQL.Butmostimportant,bigdataisaparadigmshiftinhowwethinkaboutdataassets,wheredowecollectthem,howdo

    weanalyzethem,andhowdowemonetizetheinsightsfromtheanalysis.Thebig

    datarevolutionisaboutfindingnewvaluewithinandoutsideconventionaldata

    sources.Anadditionalapproachisneededbecausethesoftwareandhardware

    environmentsofthepasthavenotbeenabletocapture,manage,orprocessthenew

    formsofdatawithinreasonabledevelopmenttimesorprocessingtimes.Weare

    challengedtoreorganizeourinformationmanagementlandscapetoextenda

    remarkablystableandsuccessfulEDWarchitecturetothisneweraofbigdata

    analytics.

    Inreadingthiswhitepaperpleasebearinmindthattheconsistentviewofthisauthorhasalwaysbeenthatthe"datawarehouse"comprisesthecompleteecosystemfor

    extracting,cleaning,integratinganddeliveringdatatodecisionmakers,andtherefore

    includestheextract-transform-load(ETL)andbusinessintelligence(BI)functions

    consideredasoutsideofthedatawarehousebymoreconservativewriters.This

    authorhasalwaystakentheviewthatdatawarehousinghasaverycomprehensive

    roleincapturingallformsofenterprisedata,andthenpreparingthatdataforthemost

    effectiveusebydecision-makersallacrosstheenterprise.Thiswhitepapertakesthe

    aggressiveviewthattheenterprisedatawarehouseisonthevergeofaveryexciting

    newsetofresponsibilities.ThescopeoftheEDWwillincreasedramatically.

    Also,inthiswhitepaper,althoughweconsistentlyusethetermETLtodescribethemovementofdatawithintheenterprisedatawarehouse,theconventionaluseofthis

    termdoesnotdojusticetothemuchlargerresponsibilityofmovingdataacross

    networksandbetweensystemsandbetweenprofoundlydifferentprocessesinthe

    worldofbigdataanalytics.ETLisaportionofamuchlargertechnologycalleddata

    integration(DI).SincewehaveusedETLconsistentlyinourbooksandclassesfor

    manyyears,wewillkeepthatterminologyinthispaper,bearinginmindthatETLis

    meantinthelargersenseofDI.

    Thiswhitepaperstandsbackfromthemarketplaceasitexistsinearly2011to

    highlighttheclearlyemergingnewtrendsbroughtbythebigdatarevolution.Anda

    revolutionitis.AsJamesMarkarian,Informatica'sExecutiveVicePresidentandChiefTechnologyOfficer,remarked:"thedatabasemarkethasfinallygotteninteresting

    again."Becausemuchofthenewbigdatatoolsandapproachesareversion1or

    evenversion0developments,thelandscapewillcontinuetochangerapidly.However

    thereisgrowingawarenessinthemarketplacethatnewkindsofanalysisarepossible

    andthatkeycompetitors,especiallye-commerceenterprises,arealreadytaking

    advantageofthenewparadigm.Thiswhitepaperisintendedtobeaguidetohelp

    businessintelligence,datawarehousingandinformationmanagementprofessionals

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    5/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics3

    andmanagementteamsunderstandandprepareforbigdataasacomplementary

    extensiontotheircurrentEDWarchitecture.

    Dataisanassetonthebalancesheet

    Enterprisesincreasinglyrecognizethatdataitselfisanassetthatshouldappearon

    thebalancesheetinthesamewaythattraditionalassetsfromthemanufacturingage

    suchasequipmentandlandhavealwaysappeared.Thereareseveralwaysto

    determinethevalueofthedataasset,including

    costtoproducethedata

    costtoreplacethedataifitislost

    revenueorprofitopportunityprovidedbythedata

    revenueorprofitlossifdatafallsintocompetitorshands

    legalexposurefromfinesandlawsuitsifdataisexposedtothewrongparties

    Butmoreimportantthanthedataitself,enterpriseshaveshownthatinsightsfrom

    datacanbemonetized.Whenane-commercesitedetectsanincreaseinfavorableclickthroughsfromanexperimentaladtreatment,thatinsightcanbetakentothe

    bottomlineimmediately.Thisdirectcause-and-effectiseasilyunderstoodby

    management,andananalyticresearchgroupthatconsistentlydemonstratesthese

    insightsislookeduponasastrategicresourcefortheenterprisebythehighestlevels

    ofmanagement.Thisgrowthinbusinessawarenessofthevalueofdata-driven

    insightsisrapidlyspreadingoutwardfromthee-commerceworldtovirtuallyevery

    businesssegment.

    Datawarehousing,ofcourse,hasbeendemonstratingthevalueofdata-driven

    insightsforatleast20years.Butuntilquiterecentlydatawarehousinghasbeen

    focusedonhistoricaltransactiondata.Duringthepastdecadefrom2000to2009,threemajorseismicshiftsoccurredindatawarehousing.Thefirst,earlyinthe

    decade,wasthedecisiveintroductionoflowlatencyoperationaldataintothedata

    warehousetogetherwiththeexistinghistoricaldata.Ofcourse,manyofthesenew

    operationaldatausecasesbenefitedfromreal-timedata,insomecasesdemanding

    instantaneousdelivery.Thesecondseismicshiftgrowingincreasinglythroughoutthe

    decadewasthegatheringofcustomerbehaviordata,whichnotonlyincluded

    traditionaltransactionssuchaspurchasesandclickthroughsbutaddedhuge

    volumesof"subtransactions"thatrepresentedmeasurableeventsleadinguptothe

    transactionsthemselves.Forexample,allthewebpageeventsacustomerengaged

    inpriortothefinaltransactioneventbecamearecordofcustomerbehavior."Good

    paths"throughthesewebpageeventhistoriesgavelotsofinsightintoproductive(i.e.,

    monetizable)customerbehavior.

    Thethirdseismicevent,whichisgatheringenormousmomentumaswetransitioninto

    thecurrentdecade,istheextractionofproductpreferencesandcustomers

    sentimentsfromsocialmedia,especiallythemassivequantitiesofmachine-

    generatedunstructureddatageneratedbythenewbusinessparadigmsofdot-com

    companies.Itisthisfinalseismicshiftthathaspushedmanyenterprisesintolooking

    seriouslyatunstructureddataforthefirsttime,andasking"howonearthdowe

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    6/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics4

    analyzethisstuff?"Thepointhereisnotthatunstructureddataissomenewthing

    recentlydiscovered,butrathertheanalysisofunstructureddatahasgone

    mainstreamjustrecently.

    Raising the curtain on big data analyticsUsecasesforbigdataanalytics

    Bigdataanalyticsusecasesarespreadinglikewildfire.Hereisasetofusecases

    reportedrecently,includingabenchmarksetof"Hadoop-able"usecasesproposed

    byJeffHammerbacher,ChiefScientistforCloudera.Followingthesebrief

    descriptionsisatablesummarizingthesalientstructureandprocessing

    characteristicsofeachusecase.Notethatnoneoftheseusecasescanbesatisfied

    withscalarnumericdata,norcananybeproperlyanalyzedbysimpleSQL

    statements.Allofthemcanbescaledintothepetabyterangeandbeyondwith

    appropriatebusinessassumptions.

    Search ranking.Allsearchenginesattempttoranktherelevanceofawebpagetoasearchrequestagainstallotherpossiblewebpages.Googlespagerankalgorithmis,

    ofcourse,theposterchildforthisusecase.

    Ad tracking. E-commercesitestypicallyrecordanenormousriverofdataincludingeverypageeventineveryusersession.Thisallowsforveryshortturnaroundof

    experimentsinadplacement,color,size,wording,andotherfeatures.Whenan

    experimentshowsthatsuchafeaturechangeinanadresultsinimprovedclick

    throughbehavior,thechangecanbeimplementedvirtuallyinrealtime. Location and proximity tracking. ManyusecasesaddpreciseGPSlocationtracking,togetherwithfrequentupdates,inoperationalapplications,securityanalysis,

    navigation,andsocialmedia.Preciselocationtrackingopensthedoorforan

    enormousoceanofdataaboutotherlocationsnearbytheGPSmeasurement.These

    otherlocationsmayrepresentopportunitiesforsalesorservices.

    Causal factor discovery.Point-of-saledatahaslongbeenabletoshowuswhenthesalesofaproductgoessharplyupordown.Butsearchingforthecausalfactorsthat

    explainthesedeviationshasbeen,atbest,aguessinggameoranartform.The

    answersmaybefoundincompetitivepricingdata,competitivepromotionaldata

    includingprintandtelevisionmedia,weather,holidays,nationaleventsincluding

    disasters,andvirallyspreadopinionsfoundinsocialmedia.Seethenextusecaseaswell.

    Social CRM.Thisusecaseisoneofthehottestnewareasformarketinganalysis.TheAltimeterGrouphasdescribedaveryusefulsetofkeyperformanceindicatorsfor

    socialCRMthatincludeshareofvoice,audienceengagement,conversationreach,

    activeadvocates,advocateinfluence,advocacyimpact,resolutionrate,resolution

    time,satisfactionscore,topictrends,sentimentratio,andideaimpact.Thecalculation

    oftheseKPIsinvolvesin-depthtrollingofahugearrayofdatasources,especially

    unstructuredsocialmedia.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    7/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics5

    Document similarity testing.Twodocumentscanbecomparedtoderiveametricofsimilarity.Thereisalargebodyofacademicresearchandtestedalgorithms,for

    examplelatentsemanticanalysis,thatisjustnowfindingitswaytodrivingmonetized

    insightsofinteresttobigdatapractitioners.Forexample,asinglesourcedocument

    canbeusedasakindofmultifacetedtemplatetocompareagainstalargesetof

    targetdocuments.Thiscouldbeusedforthreatdiscovery,sentimentanalysis,and

    opinionpolls.Forexample:"findallthedocumentsthatagreewithmysourcedocumentonglobalwarming."

    Genomics analysis: e.g., commercial seed gene sequencing.Afewmonthsagothecottonresearchcommunitywasthrilledbyagenomesequencingannouncementthat

    statedinpart"Thesequencewillserveacriticalroleasthereferenceforfuture

    assemblyofthelargercottoncropgenome.Cottonisthemostimportantfibercrop

    worldwideandthissequenceinformationwillopenthewayformorerapidbreeding

    forhigheryield,betterfiberqualityandadaptationtoenvironmentalstressesandfor

    insectanddiseaseresistance.ScientistRyanRappstressedtheimportanceof

    involvingthecottonresearchcommunityinanalyzingthesequence,identifyinggenes

    andgenefamiliesanddeterminingthefuturedirectionsofresearch.(SeedQuest,Sept22,2010).Thisusecaseisjustoneexampleofawholeindustrythatisbeing

    formedtoaddressgenomicsanalysisbroadly,beyondthisexampleofseedgene

    sequencing.

    Discovery of customer cohort groups.Customercohortgroupsareusedbymanyenterprisestoidentifycommondemographictrendsandbehaviorhistories.Weareall

    familiarwithAmazon'scohortgroupswhentheysayothercustomerswhoboughtthe

    samebookasyouhavealsoboughtthefollowingbooks.Ofcourse,ifyoucansell

    yourproductorservicetoonememberofacohortgroup,thenalltherestmaybe

    reasonableprospects.Cohortgroupsarerepresentedlogicallyandgraphicallyas

    links,andmuchoftheanalysisofcohortgroupsinvolvesspecializedlinkanalysisalgorithms.

    In-flight aircraft status.Thisusecaseaswellasthefollowingtwousecasesaremadepossiblebytheintroductionofsensortechnologyeverywhere.Inthecaseofaircraft

    systems,in-flightstatusofhundredsofvariablesonengines,fuelsystems,hydraulics,

    andelectricalsystemsaremeasuredandtransmittedeveryfewmilliseconds.The

    valueofthisusecaseisnotjusttheengineeringtelemetrydatathatcouldbe

    analyzedatsomefuturepointintime,butdrivesreal-timeadaptivecontrol,fuel

    usage,partfailureprediction,andpilotnotification.

    Smart utility meters.Itdidn'ttakelongforutilitycompaniestofigureoutthatasmartmetercanbeusedformorethanjustthemonthlyreadoutthatproducesthe

    customersutilitybill.Bydrasticallycrankingupthefrequencyofthereadoutstoas

    muchasonereadoutpersecondpermeteracrosstheentirecustomerlandscape,

    manyusefulanalysescanbeperformedincludingdynamicload-balancing,failure

    response,adaptivepricing,andlonger-termstrategiesforincentingcustomersto

    utilizetheutilitymoreeffectively(eitherfromthecustomerspointofvieworthe

    utility'spointofview!)

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    8/33

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    9/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics7

    Data bag exploration.Therearemanysituationsincommercialenvironmentsandintheresearchcommunitieswherelargevolumesofrawdataarecollected.One

    examplemightbedatacollectedaboutstructurefires.Beyondthepredictable

    dimensionsoftime,place,primarycauseoffire,andrespondingfirefighters,there

    maybeawealthofunpredictableanecdotaldatathatatbestcanbemodeledasa

    disorderlycollectionofnamevaluepairs,suchas"contributingweather=lightning.

    Anotherexamplewouldbethelistingofallrelevantfinancialassetsforadefendantinalawsuit.Againsuchalistislikelytobeadisorderlycollectionofnamevaluepairs,

    suchas"sharedrealestateownership=condominium.Thelistofexampleslikethis

    isendless.Whattheyhaveincommonistheneedtoencapsulatethedisorderly

    collectionofnamevaluepairswhichisgenerallyknownasa"databag.Complex

    databagsmaycontainbothnamevaluepairsaswellasembeddedsubdatabags.

    Thechallengeinthisusecaseistofindacommonwaytoapproachtheanalysisof

    databagswhenthecontentofthedatamayneedtobediscoveredafterthedatais

    loaded.

    Thefinaltwousecasesareoldandvenerableexamplesthatevenpredatedata

    warehousingitself.Butnewlifehasbeenbreathedintotheseusecasesbecauseoftheexcitingpotentialofultra-atomiccustomerbehaviordata.

    Loan risk analysis and insurance policy underwriting.Inordertoevaluatetheriskofaprospectiveloanoraprospectiveinsurancepolicy,manydatasourcescanbe

    broughtintoplayrangingfrompaymenthistories,detailedcreditbehavior,

    employmentdata,andfinancialassetdisclosures.Insomecasesthecollateralfora

    loanortheinsureditemmaybeaccompaniedbyimagedata.

    Customer churn analysis.Enterprisesconcernedwithchurnwanttounderstandthepredictivefactorsleadinguptothelossofacustomer,includingthatcustomers

    detailedbehavioraswellasmanyexternalfactorsincludingtheeconomy,lifestageandotherdemographicsofthecustomer,andfinallyrealtimecompetitiveissues.

    Makingsenseofbigdataanalyticusecases

    Certainlythepurposeofdevelopingthislistofusecasesistoconvincethereader

    thattheusecasescomeinallshapesandsizesandformats,andrequiremany

    specializedapproachestoanalyze.Upuntilveryrecentlyalltheseusecasesexisted

    asseparateendeavors,ofteninvolvingspecialpurposebuiltsystems.Buttheindustry

    awarenessofthe"bigdataanalyticschallenge"ismotivatingeveryonetolookforthe

    architecturalsimilaritiesanddifferencesacrossalltheseusecases.Anygiven

    enterpriseisincreasinglylikelytoencounteroneormoreoftheseusecases.That

    realizationisdrivingtheinterestinsystemarchitecturesthataddressesthebigdata

    analyticsprobleminageneralway.Pleasestudythefollowingtable.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    10/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics8

    Thesheerdensityofthistablemakesitclearthatsystemstosupportbigdata

    analyticshavetolookverydifferentthantheclassicrelationaldatabasesystemsfrom

    the1980sand1990s.TheoriginalRDBMSswerenotbuilttohandleanyofthe

    requirementsrepresentedascolumnsinthistable!

    Searchranking X X X X X X

    Adtracking X X X X X X X X

    Location&proximity X X X X X

    Causaldiscovery X X X X X X X

    SocialCRM X X X X X X X X

    Documentsimilarity X X X X X X X

    Genomicanalysis X X X X X

    Cohortgroups X X X X X X

    In-flightenginestatus X X X X X X

    Smartutilitymeters X X X X X X

    Buildingsensors X X X X X X X X

    Satelliteimages X X X X

    CATscans X X X X X X

    Financialfraud X X X X X X X X X

    Hackingdetection X X X X X X X X X

    Gamegestures X X X X X X X X

    Bigscience X X X X X X X X X

    Databagexploration X X X X X X

    Riskanalysis X X X X X X X X

    Churnanalysis X X X X X X X

    Vector,

    matrix

    ,or

    complex

    structure

    Free

    text

    Image

    orbinary

    data

    D

    ata

    bags

    Iter

    ative

    logic

    or

    complex

    branching

    Advanced

    ana

    lytic

    routines

    Rap

    idly

    repeated

    me

    asurements

    Extreme

    low

    latency

    Access

    to

    all

    data

    required

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    11/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics9

    Big data analytics system requirementsBeforediscussingtheexcitingnewtechnicalandarchitecturaldevelopmentsofthe

    2010s,let'ssummarizetheoverallrequirementsforsupportingbigdataanalytics,

    keepinginmindthatwearenotrequiringasinglesystemorasinglevendor's

    technologytoprovideablanketsolutionforeveryusecase.Fromtheperspectiveof

    2011,wehavetheluxuryofstandingbackfromalltheseusecasesgatheredinthelastfewyears,andwearenowinapositiontosurroundtherequirementswithsome

    confidence.

    Thedevelopmentofbigdataanalyticshasreachedapointwhereitneedsanoverall

    missionstatementandidentityindependentofalistofusecases.Manyofushave

    livedthroughearlierinstantiationsofadvancedanalyticsthatwentbythenamesof

    advancedstatistics,artificialintelligenceanddatamining.Noneoftheseearlier

    wavesbecameacoherentthemethattranscendedtheindividualexamples,as

    compellingasthoseexampleswere.

    Hereisanattempttostepbackanddefinethecharacteristicsofbigdataanalyticsatthehighestlevels.Inthefollowing,theterm"UDF"isusedinthebroadestsenseof

    anyuserdefinedfunctionorprogramoralgorithmthatmayappearanywhereinthe

    end-to-endanalysisarchitecture.

    Inthecoming2010sdecade,theanalysisofbigdatawillrequireatechnologyor

    combinationoftechnologiescapableof:

    scalingtoeasilysupportpetabytes(thousandsofterabytes)ofdata

    beingdistributedacrossthousandsofprocessors,potentiallygeographically

    unaware,andpotentiallyheterogeneous

    subsecondresponsetimeforhighlyconstrainedstandardSQLqueries embeddingarbitrarilycomplexuser-definedfunctions(UDFs)within

    processingrequests

    implementingUDFsinawidevarietyofindustry-standardprocedural

    languages

    assemblingextensivelibrariesofreusableUDFscrossingmostorallofuse

    cases

    executingUDFsas"relationscans"overpetabytesizeddatasetsinafew

    minutes

    supportingawidevarietyofdatatypesgrowingtoincludeimages,waveforms,

    arbitrarilyhierarchicaldatastructures,anddatabags

    loadingdatatobereadyforanalysis,atveryhighrates,atleastgigabytesper

    second

    integratingdatafrommultiplesourcesduringtheloadprocessatveryhigh

    rates(GB/sec)

    loadingdatabeforedeclaringordiscoveringitsstructure

    executingcertainstreaminganalyticqueriesinrealtimeonincomingload

    data

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    12/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics10

    updatingdatainplaceatfullloadspeeds

    joiningabillionrowdimensiontabletoatrillionrowfacttablewithoutpre-

    clusteringthedimensiontablewiththefacttable

    schedulingandexecutionofcomplexmulti-hundrednodeworkflows

    beingconfiguredwithoutbeingsubjecttoasinglepointoffailure

    failoverandprocesscontinuationwhenprocessingnodesfail supportingextrememixedworkloadsincludingthousandsofgeographically

    dispersedon-lineusersandprogramsexecutingavarietyofrequestsranging

    fromadhocqueriestostrategicanalysis,andwhileloadingdatainbatchand

    streamingfashion

    Twoarchitectureshaveemergedtoaddressbigdataanalytics:extendedRDBMS,

    andMapReduce/Hadoop.Thesearchitecturesarebeingimplementedascompletely

    separatesystemsandinvariousinterestinghybridcombinationsinvolvingboth

    architectures.Wewillstartbydiscussingthearchitecturesseparately.

    Extendedrelationaldatabasemanagementsystems

    Allofthemajorrelationaldatabasemanagementsystemvendorsareaddingfeatures

    toaddressbigdataanalyticsfromasolidrelationalperspective.Thetwomost

    significantarchitecturaldevelopmentshavebeentheovertakingofthehighendofthe

    marketwithmassivelyparallelprocessing(MPP),andthegrowingadoptionof

    columnarstorage.WhenMPPandcolumnarstoragetechniquesarecombined,a

    numberofthesystemrequirementsintheabovelistcanstarttobeaddressed,

    including:

    scalingtosupportexabytes(thousandsofpetabytes)ofdata

    beingdistributedacrosstensofthousandsofgeographicallydispersedprocessors

    subsecondresponsetimeforhighlyconstrainedstandardSQLqueries

    updatingdatainplaceatfullloadspeeds

    beingconfiguredwithoutbeingsubjecttoasinglepointoffailure

    failoverandprocesscontinuationwhenprocessingnodesfail

    Additionally,RDBMSvendorsareaddingsomecomplexuser-definedfunctions

    (UDF's)totheirsyntax,butthekindofgeneralpurposeprocedurallanguage

    computingrequiredbybigdataanalyticsisnotbeingsatisfiedinrelational

    environmentsatthistime.

    Inasimilarvein,RDBMSvendorsareallowingcomplexdatastructurestobestored

    inindividualfields.Thesekindofembeddedcomplexdatastructureshavebeen

    knownas"blobs"formanyyears.It'simportanttounderstandthatrelational

    databaseshaveahardtimeprovidinggeneralsupportforinterpretingblobssince

    blobsdonotfittherelationalparadigm.AnRDBMSindeedprovidessomevalueby

    hostingtheblobsinastructuredframework,butmuchofthecomplexinterpretation

    andcomputationontheblobsmustbedonewithspeciallycraftedUDFs,orBI

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    13/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics11

    applicationlayerclients.Blobsarerelatedtodatabagsdiscussedelsewhereinthis

    paper.SeethesectionentitledDatastructuresshouldbedeclaredatquerytime.

    MPPimplementationshaveneversatisfactorilyaddressedthe"bigjoinissuewherea

    billionrowdimensiontableisattemptedtobejoinedtoatrillionrowfacttablewithout

    resortingtoclusteredstorage.Thebigjoincrisisoccurswhenanadhocconstraintis

    placedagainstthedimensiontableresultinginapotentiallyverylargesetofdimensionkeysthatmustbephysicallydownloadedintoeveryoneofthephysicalsegmentsof

    thetrillionrowfacttablestoredseparatelyintheMPPsystem.Sincethedimension

    keysarescatteredrandomlyacrosstheseparatesegmentsofthetrillionrowfacttable,

    itisveryhardtoavoidalengthydownloadstepoftheverylargedimensiontableto

    everyoneofthefacttablestoragepartitions.Tobefair,theMapReduce/Hadoop

    architecturehasnotbeenabletoaddressthebigjoinproblemeither.

    Columnardatastoragefitstherelationalparadigm,andespeciallydimensionally

    modeleddatabases,verywell.Besidesthesignificantadvantageofhighcompression

    ofsparsedata,columnardatabasesallowaverylargenumberofcolumnscompared

    torow-orienteddatabases,andplacelittleoverheadonthesystemwhencolumnsareaddedtoanexistingschema.ThemostsignificantAchilles'heel,atleastin2011,is

    theslowloadingspeedofdataintothecolumnarformat.Althoughimpressiveload

    speedimprovementsarebeingannouncedbycolumnardatabasevendors,theyhave

    stillnotachievedthegigabytes-per-secondrequirementlistedabove.

    ThestandardRDBMSarchitectureforimplementinganenterprisedatawarehouse

    basedondimensionalmodelingprinciplesissimpleandwellunderstood,asshownin

    Figure1.Recallthatthroughoutthiswhitepaper,theEDWisdefinedinthe

    comprehensivesensetoincludeallbackroomandfrontroomprocessesincluding

    ETL,datapresentation,andBIapplications.

    Figure 1. The standard RDBMS based architecture for an enterprise data warehouseSource: The Data Warehouse Lifecycle Toolkit, 2ndedition, Kimball et al. (2008)

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    14/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics12

    InthisstandardEDWarchitecturetheETLsystemisamajorcomponentthatsits

    betweenthesourcesystemsandthepresentationserversthatareresponsiblefor

    exposingalldatatobusinessintelligenceapplications.Inthisview,theETLsystem

    addssignificantvaluebycleaning,conforming,andarrangingthedataintoaseriesof

    dimensionalschemaswhicharethenstoredphysicallyinthepresentationserver.A

    crucialelementofthisarchitectureisthepreparationofconformeddimensionsinthe

    ETLsystemthatservesasthebasisofintegrationfortheBIapplications.Itisthestrongconvictionofthisauthorthatdeferringthebuildingofthedimensional

    structuresandtheissuesofintegrationuntilquerytimeisthewrongarchitecture.

    Sucha"deferredcomputation"approachrequiresanundulyexpensivequery

    optimizertocorrectlyquerycomplexnon-dimensionalmodelseverytimeaqueryis

    presented.Thecalculationofintegrationatqueryprocessingtimegenerallyrequires

    complexapplicationlogicintheBItoolswhichalsomighthavetobeexecutedfor

    everyquery.

    TheextendedRDBMSarchitecturetosupportbigdataanalyticspreservesthe

    standardarchitecturewithanumberofimportantadditions,shownbelowinFigure2

    withlargearrows:

    Figure 2. The extended RDBMS based architecture for an enterprise data warehouseThefactthatthehigh-levelenterprisedatawarehousearchitectureisnotmaterially

    changedbytheintroductionofnewdatastructures,oragrowinglibraryofspecially

    crafteduser-definedfunctions,orpowerfulprocedurallanguage-basedprograms

    actingaspowerfulBIclients,isthecharmoftheextendedRDBMSapproachtobig

    dataanalytics.ThemajorRDBMSplayersareabletomarshaltheirenormouslegacy

    ofmillionsoflinesofcode,powerfulgovernancecapabilities,andsystemstabilitybuilt

    overdecadesofservingthemarketplace.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    15/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics13

    However,itistheopinionofthisauthorthattheextendedRDBMSsystemscannotbe

    theonlysolutionforbigdataanalytics.Atsomepoint,tackingonnon-relationaldata

    structuresandnon-relationalprocessingalgorithmstothebasic,coherentRDBMS

    architecturewillbecomeunwieldyandinefficient.TheSwissArmyknifeanalogy

    comestomind.Anotheranalogyclosertothetopicistheprogramminglanguage

    PL/1.Originallydesignedasanoverarching,multipurpose,powerfulprogramming

    languageforallformsofdataandallapplications,itultimatelybecameabloatedandsprawlingcorpusthattriedtodotoomanythingsinasinglelanguage.Sincethe

    heydayofPL/1therehasbeenawonderfulevolutionofmorenarrowlyfocused

    programminglanguageswithmanynewconceptsandfeaturesthatsimplycouldn'tbe

    tackedontoPL/1afteracertainpoint.Relationaldatabasemanagementsystemsdo

    somanythingssowellthatthereisnodangerofsufferingthesamefateasPL/1.The

    bigdataanalyticsspaceisgrowingsorapidlyandinsuchexcitingandunexpected

    newdirectionsthatalighterweight,moreflexibleandmoreagileprocessing

    frameworkinadditiontoRDBMSsystemsmaybeareasonablealternative.

    MapReduce/Hadoopsystems

    MapReduceisaprocessingframeworkoriginallydevelopedbyGoogleintheearly

    2000sforperformingwebpagesearchesacrossthousandsofphysicallyseparated

    machines.TheMapReduceapproachisextremelygeneral.CompleteMapReduce

    systemscanbeimplementedinavarietyoflanguagesalthoughthemostsignificant

    implementationisinJava.MapReduceisreallyaUDF(userdefinedfunction)

    executionframework,wherethe"F"canbeextraordinarilycomplex.Originally

    targetedtobuildingGoogle'swebpagesearchindex,aMapReducejobcanbe

    definedforvirtuallyanydatastructureandanyapplication.Thetargetprocessorsthat

    actuallyperformtherequestedcomputationcanbeidentical(a"cluster"),orcanbea

    heterogeneousmixofprocessortypes(a"grid").Thedataineachprocessorupon

    whichtheultimatecomputationisperformedcanbestoredinadatabase,ormorecommonlyinafilesystem,andcanbeinanydigitalformat.

    ThemostsignificantimplementationofMapReduceisApacheHadoop,knownsimply

    asHadoop.Hadoopisanopensource,top-levelApacheproject,withthousandsof

    contributorsandawholeindustryofdiverseapplications.Hadooprunsnativelyonits

    owndistributedfilesystem(HDFS)andcanalsoreadandwritetoAmazonS3and

    others.Conventionaldatabasevendorsarealsoimplementinginterfacestoallow

    Hadoopjobstoberunovermassivelydistributedinstancesoftheirdatabases.

    AswewillseewhenwegiveabriefoverviewofhowaHadoopjobworks,bandwidth

    betweentheseparateprocessorscanbeahugeissue.HDFSisaso-called"rackaware"filesystembecausethecentralnamenodeknowswhichnodesresideonthe

    samerackandwhichareconnectedbymorethanonenetworkhop.Hadoopexploits

    therelationshipbetweenthecentraljobdispatcherandHDFStosignificantlyoptimize

    amassivelydistributedprocessingtaskbyhavingdetailedknowledgeofwheredata

    actuallyresides.Thisalsoimpliesthatacriticalaspectofperformancecontrolisco-

    locatingsegmentsofdataonactualphysicalhardwarerackssothattheMapReduce

    communicationcanbeaccomplishedatbackplanespeedsratherthanslowernetwork

    speeds.Notethatremotecloud-basedfilesystemssuchasAmazonS3and

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    16/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics14

    CloudStoreare,bytheirnature,unabletoprovidetherackawarebenefit.Ofcourse,

    cloud-basedfilesystemshaveanumberofcompellingadvantageswhichwe'll

    discusslater.

    HowMapReduceworksinHadoop

    AMapReducejobissubmittedtoacentralizedJobTracker,whichinturnschedules

    partsofthejobtoanumberofTaskTrackernodes.Although,ingenerala

    TaskTrackermayfailanditstaskcanbereassignedbytheJobTracker,the

    JobTrackerisasinglepointoffailure.IftheJobTrackerhalts,theMapReducejob

    mustberestartedorberesumedfromintermediatesnapshots.

    AMapReducejobisalwaysdividedintotwodistinctphases,mapandreduce.The

    overallinputtoaMapReducejobisdividedintomanyequalsizedsplits,eachof

    whichisassignedamaptask.Themapfunctionisthenappliedtoeachrecordin

    eachsplit.Forlargejobs,thejobtrackerschedulesthesemaptasksinparallel.The

    overallperformanceofaMapReducejobdependssignificantlyonachievinga

    balanceofenoughparallelsplitstokeepmanymachinesbusy,butnotsomany

    parallelsplitsthattheinterprocesscommunicationofmanagingallthesplitsbogs

    downtheoveralljob.WhenMapReduceisrunovertheHDFSfilesystem,atypical

    defaultsplitsizeis64MBofinputdata.

    Asthenamesuggests,themaptaskisthefirsthalfoftheMapReducejob.Eachmap

    taskproducesasetofintermediateresultrecordswhicharewrittentothelocaldiskof

    themachineperformingthemaptask.ThesecondhalfoftheMapReducejob,the

    reducetask,mayrunonanyprocessingnode.Theoutputsofthemappers(nodes

    runningmaptasks)aresortedandpartitionedinsuchawaythattheseoutputscanbe

    transferredtothereducers(nodesrunningthereducetask).Thefinaloutputsofthe

    reducerscomprisethesortedandpartitionedresultssetoftheoverallMapReduce

    job.InMapReducerunningoverHDFS,theresultssetiswrittentoHDFSandis

    replicatedforreliability.

    InFigure3,weshowthistaskflowforaMapReducejobwiththreemappernodes

    feedingtworeducernodes,byreproducingfigure2.3fromTomWhite'sbook,

    Hadoop,TheDefinitiveGuide,2ndEdition,(O'Reilly,2010).

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    17/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics15

    Figure 3. An example MapReduce jobInTomWhite'sbook,asimpleMapReducejobisdescribedwhichweextend

    somewhathere.Supposethattheoriginaldatabeforethesplitsareappliedconsists

    ofaverylargenumber(perhapsbillions)ofunsortedtemperaturemeasurements,one

    perrecord.Suchmeasurementscouldcomefrommanythousandsofautomatic

    sensorslocatedaroundtheUnitedStates.Thesplitsareassignedtotheseparate

    mappernodestoequalizeasmuchaspossiblethenumberofrecordsgoingtoeach

    node.Theactualformofthemapperinputsarekey-valuepairs,inthiscasea

    sequentialrecordidentifierandthefullrecordcontainingthetemperature

    measurementsaswellasotherdata.Thejobofeachmapperissimplytoparsetherecordspresentedtoitandextracttheyear,thestate,andthetemperature,which

    becomesthesecondsetofkey-valuepairspassedfromthemappertothereducer.

    Thejobofeachreduceristofindthemaximumreportedtemperatureforeachstate,

    andeachdistinctyearintherecordspassedtoit.Eachreducerisresponsiblefora

    state,soinordertoaccomplishthetransfer,theoutputofeachmappermustbe

    sortedsothatthekey-valuepairscanbedispatchedtotheappropriatereducers.In

    thiscasetherewouldbe50reducers,oneforeachstate.Thesesortedblocksare

    thentransferredtothereducersinastepwhichisacriticalfeatureoftheMapReduce

    architecture,whereitiscalledthe"shuffle.

    Noticethattheshuffleinvolvesatruephysicaltransferofdatabetweenprocessing

    nodes.Thismakesthevalueoftherackawarefeaturemoreobvious,sincealotof

    dataneedstobemovedfromthemapperstothereducers.Thecleverreadermay

    wonderifthisdatatransfercouldbereducedbyhavingthemapperoutputscombined

    sothatmanyreadingsfromasinglestateandyeararegiventothereducerasa

    singlekey-valuepairratherthanmany.Theanswerisyes,andHadoopprovidesa

    combinerfunctiontoaccomplishexactlythisend.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    18/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics16

    Eachreducerreceivesalargenumberofstate/year-temperaturekey-valuepairs,and

    findsthemaximumtemperatureforagivenyear.Thesemaximumtemperaturesfor

    eachyeararethefinaloutputfromeachreducer.

    Thisapproachcanbescaledmoreorlessindefinitely.ReallyseriousMapReduce

    jobsrunningonHDFSmayhavehundredsorthousandsofmappersandreducers,

    processingpetabytesofinputdata.

    AtthispointtheappealoftheMapReduce/Hadoopapproachshouldbeclear.There

    arevirtuallynorestrictionsontheformoftheinputstotheoveralljob.Thereonly

    needstobesomerationalbasisforcreatingsplitsandreadingrecords,inthiscase

    therecordidentifierinTomWhite'sexample.Actuallogicinthemappersandthe

    reducerscanbeprogrammedinvirtuallyanyprogramminglanguageandcanbeas

    simpleastheaboveexample,ormuchmorecomplicatedUDFs.Thereadershould

    beabletovisualizehowsomeofthemorecomplexusecases(e.g.,comparisonof

    satelliteimages)describedearlierinthepapercouldfitintothisframework.

    ToolsfortheHadoopenvironment

    Whatwehavedescribedthusfaristhecoreprocessingcomponentwhen

    MapReduceisrunintheHadoopenvironment.Thisisroughlyequivalentto

    describingtheinnerprocessingloopinarelationaldatabasemanagementsystem.In

    bothcasesthere'salotmoretothesesystemstoimplementacompletefunctioning

    environment.ThefollowingisabriefoverviewoftypicaltoolsusedinaMapReduce/

    Hadoopenvironment.Wegroupthesetoolsbyoverallfunction.TomWhite'sbook,

    mentionedabove,isanexcellentstartingpointforunderstandinghowthesetoolsare

    used.

    Gettingdatainandgettingdataout

    ETLplatforms--ETLplatforms,withtheirlonghistoryofimportingand

    exportingdatatorelationaldatabases,providespecificinterfacesformoving

    dataintoandoutofHDFS.Theplatform-basedapproach,ascontrastedwith

    handcoding,providesextensivesupportformetadata,dataquality,

    documentation,andavisualstyleofsystembuilding.

    SqoopSqoop,developedbyCloudera,isanopensourcetoolthatallows

    importingdatafromarelationalsourcetoHDFSandexportingdatafrom

    HDFStoarelationaltarget.DataimportedbySqoopintoHDFScanbeused

    bothbyMapReduceapplicationsandHBaseapplications.HBaseisdescribed

    below.

    ScribeScribe,developedatFacebookandreleasedasopensource,isused

    toaggregatelogdatafromalargenumberofWebservers.

    FlumeFlume,developedbyCloudera,isadistributedreliablestreamingdata

    collectionservice.ItusesacentralconfigurationmanagedbyZookeeperand

    supportstunablereliabilityandautomaticfailoverandrecovery.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    19/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics17

    Programming

    Low-levelMapReduceprogramming--primarycodeformappersand

    reducerscanbewritteninanumberoflanguages.Hadoop'snativelanguage

    isJavabutHadoopexposesAPIsforwritingcodeinotherlanguagessuchas

    RubyandPython.AninterfacetoC++isprovided,whichisnamedHadoop

    Pipes.ProgrammingMapReduceatthelowestlevelobviouslyprovidesthemostpotentialpower,butthislevelofprogrammingisverymuchlike

    assemblylanguageprogramming.Itcanbeverylaborious,especiallywhen

    attemptingtodoconceptuallysimpletaskslikejoiningtwodatasets.

    HighlevelMapReduceprogramming--ApachePig,orsimplyPig,isaclient-

    sideopen-sourceapplicationprovidingahighlevelprogramminglanguagefor

    processinglargedatasetsinMapReduce.Theprogramminglanguageitselfis

    calledPigLatin.Hiveisanalternativeapplicationdesignedtolookmuchmore

    likeSQL,andisusedfordatawarehousingusecases.Whenemployedforthe

    appropriateusecases,PigandtheHiveprovideenormousprogramming

    productivitybenefitsoverlow-levelMapReduceprogramming,oftenbya

    factorof10ormore.PigandHivelifttheapplicationdeveloper sperspective

    upfrommanagingthedetailedmapperandreducerprocessestomoreofan

    applicationsfocus.

    IntegrateddevelopmentenvironmentMapReduce/Hadoopdevelopment

    needstomovedecisivelyawayfrombarehandcodingtobeadoptedby

    mainstreamITshops.Anintegrateddevelopmentenvironmentfor

    MapReduce/Hadoopneedstoincludeeditorsforsourcecode,compilers,tools

    forautomatingsystembuilds,debuggers,andaversioncontrolsystem.

    Integratedapplicationenvironmentanevenhigherlayeraboveanintegrated

    developmentenvironmentcouldbecalledanintegratedapplication

    environment,wherecomplexreusableanalyticroutinesareassembledintocompleteapplicationsviaagraphicaluserinterface.Thiskindofenvironment

    mightbeabletouseopensourcealgorithmssuchasprovidedbytheApache

    MahoutprojectwhichdistributesmachinelearningalgorithmsonHadoop

    platform.

    Cascading--Cascadingisanothertoolthatisanabstractionlayerforwriting

    complexMapReduceapplications.ItisbestdescribedasathinJavalibrary

    typicallyinvokedfromcommandlinetobeusedasaqueryAPIandprocess

    scheduler.ItisnotintendedtobeacomprehensivealternativetoPigorHive.

    HBase--HBaseisanopen-source,nonrelational,columnorienteddatabase

    thatrunsdirectlyonHadoop.ItisnotaMapReduceimplementation.AprincipaldifferentiatorofHBasefromPigorHive(MapReduce

    implementations)istheabilitytoprovidereal-timereadandwriterandom-

    accesstoverylargedatasets.

    Oozie--Oozieisaserver-basedworkflowenginespecializedinrunning

    workflowjobswithactionsthatexecuteHadoopjobs,suchasMapReduce,

    Pig,Hive,Sqoop,HDFSoperations,andsub-workflows .

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    20/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics18

    ZooKeeperZooKeeperisacentralizedconfigurationmanagerfordistributed

    applications.ZookeepercanbeusedindependentlyofHadoopaswell.

    Administering

    EmbeddedHadoopadminfeaturesHadoopsupportsacomprehensive

    runtimeenvironmentincludingeditlog,safemodeoperation,auditlogging,

    filesystemcheck,datanodeblockverifier,datanodeblockdistribution

    balancer,performancemonitor,comprehensivelogfiles,metricsfor

    administrators,countersforMapReduceusers,metadatabackup,data

    backup,filesystembalancer,commissioninganddecommissioningnodes.

    JavamanagementextensionsastandardJavaAPIformonitoringand

    managingapplications

    GangliaContextanopensourcedistributedmonitoringsystemforverylarge

    clusters

    Featureconvergenceinthecomingdecade

    ItissafetosaythatrelationaldatabasemanagementsystemsandMapReduce/

    Hadoopsystemswillincreasinglyfindwaystocoexistgracefullyinthecoming

    decade.Butthesystemshavedistinctcharacteristics,asdepictedinthefollowing

    table:

    IntheupcomingdecadeRDBMSswillextendtheirsupportforhostingcomplexdata

    typesas"blobs,andwillextendAPIsforarbitraryanalyticroutinestooperateonthe

    contentsofrecords.MapReduce/Hadoopsystems,especiallyHive,willdeepentheir

    supportforSQLinterfacesandfullersupportofthecompleteSQLlanguage.But

    neitherwilltakeoverthemarketforbigdataanalyticsexclusively.Asremarked

    earlier,RDBMSscannotprovide"relational"semanticsformanyofthecomplexuse

    casesrequiredbybigdataanalytics.Atbest,RDBMSswillproviderelationalstructure

    surroundingthecomplexpayloads.

    RelationalDBMSs MapReduce/Hadoop

    Proprietary,mostly Opensource

    Expensive Lessexpensive

    Datarequiresstructuring Datadoesnotrequirestructuring

    Greatforspeedyindexedlookups Greatformassivefulldatascans

    Deepsupportforrelationalsemantics Indirectsupportforrelationalsemantics,e.g.Hive

    Indirectsupportforcomplexdatastructures Deepsupportforcomplexdatastructures

    Indirectsupportforiteration,complexbranching

    Deepsupportforiteration,complexbranching

    Deepsupportfortransactionprocessing Littleornosupportfortransactionprocessing

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    21/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics19

    Similarly,MapReduce/HadoopsystemswillnevertakeoverACID-compliant

    transactionprocessing,orbecomesuperiortoRDBMSsforindexedqueriesonrow

    andcolumnorientedtables.

    Asthispaperisbeingwritten,significantadvancesarebeingmadeindeveloping

    hybridsystemsusingbothrelationaldatabasetechnologyandMapReduce/Hadoop

    technology.Figure4illustratestwoprimaryalternatives.ThefirstalternativedeliversthedatadirectlyintoaMapReduce/Hadoopconfigurationforprimarynon-relational

    analysis.Aswehavedescribed,thisanalysiscanrangethefullgamutfromcomplex

    analyticalroutinestosimplesortingthatlookslikeaconventionalETLstep.Whenthe

    MapReduce/Hadoopstepiscomplete,theresultsareloadedintoanRDBMSfor

    conventionalstructuredqueryingwithSQL.

    ThesecondalternativeconfigurationloadsthedatadirectlytoanRDBMS,evenwhen

    theprimarydatapayloadsarenotconventionalscalarmeasurements.Atthatpoint

    twoanalysismodesarepossible.Thedatacanbeanalyzedwithspeciallycrafted

    user-definedfunctions,effectivelyfromtheBIlayer,orpassedtoadownstream

    MapReduce/Hadoopapplication.

    Inthefutureevenmorecomplexcombinationswilltiethesearchitecturesmore

    closelytogether,includingMapReducesystemswhosemappersandreducersare

    actuallyrelationaldatabases,andrelationaldatabasesystemswhoseunderling

    storageconsistsofHDFSfiles.

    Figure 4. Alternative hybrid architectures using both RDBMS and Hadoop.ItwillprobablybedifficultforITorganizationstosortoutthevendorclaimswhichwill

    almostcertainlyclaimthattheirsystemsdoeverything.Insomecasestheseclaims

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    22/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics20

    are"objectionremovers"whichmeansthattheyareclaimsthathaveagrainoftruth

    tothem,andaremadetomakeyoufeelgood,butdonotstanduptoscrutinyina

    competitiveandpracticalenvironment.Buyerbeware!

    Reusableanalytics

    Uptothispointwehavebeggedtheissueofwheredoesallthespecialanalytic

    softwarecomefrom.Bigdataanalyticswillneverprosperifeveryinstanceisa

    customcodedsolution.BoththeRDBMSandtheopen-sourcecommunities

    recognizethisandtwomaindevelopmentthemeshaveemerged.High-endstatistical

    analysisvendors,suchasSAS,havedevelopedextensiveandproprietaryreusable

    librariesforawiderangeofanalyticapplications,includingadvancedstatistics,data

    mining,predictiveanalytics,featuredetection,linearmodels,discriminantanalysis,

    andmanyothers.Theopensourcecommunityhasanumberofinitiatives,themost

    notableofwhichareHadoop-MLandApacheMahout.QuotingfromHadoop- MLs

    website:

    Hadoop-ML(is)aninfrastructuretofacilitatetheimplementationofparallel

    machinelearning/datamining(ML/DM)algorithmsonHadoop.Hadoop-ML

    hasbeendesignedtoallowforthespecificationofbothtask-parallelanddata-

    parallelML/DMalgorithms.Furthermore,itsupportsthecompositionof

    parallelML/DMalgorithmsusingbothserialaswellasparallelbuildingblocks

    --thisallowsonetowritereusableparallelcode.Theproposedabstraction

    easestheimplementationprocessbyrequiringtheusertoonlyspecify

    computationsandtheirdependencies,withoutworryingaboutscheduling,

    datamanagement,andcommunication.Asaconsequence,thecodesare

    portableinthattheuserneverneedstowriteHadoop-specificcode.This

    potentiallyallowsonetoleveragefutureparallelizationplatformswithout

    rewritingone'scode.

    ApacheMahoutprovidesfreeimplementationsofmachinelearningalgorithmson

    Hadoopplatform.

    Complexeventprocessing(CEP)

    Complexeventprocessing(CEP)consistsofprocessingeventshappeninginsideand

    outsideanorganizationtoidentifymeaningfulpatternsinordertotakesubsequent

    actioninrealtime.Forexample,CEPisusedinutilitynetworks(electrical,gasand

    water)toidentifypossibleissuesbeforetheybecomedetrimental.TheseCEP

    deploymentsallowforreal-timeinterventionforcriticalnetworkorinfrastructure

    situations.ThecombinationofdeepDWanalyticsandCEPcanbeappliedinretail

    customersettingstoanalyzebehaviorandidentifysituationswhereacompanymay

    loseacustomerorbeabletosellthemadditionalproductsorservicesatthetimeof

    theirdirectengagement.Inbanking,sophisticatedanalyticsmighthelptoidentifythe

    10mostcommonpatternsoffraudandCEPcanthenbeusedtowatchforthose

    patternssotheymaybethwartedbeforealoss.

    Atthetimeofthiswhitepaper,CEPisnotgenerallythoughtofaspartoftheEDW,

    butthisauthorbelievesthattechnicaladvancesincontinuousqueryprocessingwill

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    23/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics21

    causeCEPandEDWtosharedataandworkmorecloselytogetherinthecoming

    decade.

    Data warehouse cultural changes in the coming decadeTheenterprisedatawarehousemustabsolutelystayrelevanttothebusiness.Asthe

    valueandthevisibilityofbigdataanalyticsgrows,thedatawarehousemust

    encompassthenewculture,skills,techniques,andsystemsrequiredforbigdata

    analytics.

    Sandboxes

    Forexample,bigdataanalysisencouragesexploratorysandboxesfor

    experimentation.Thesesandboxesarecopiesorsegmentsofthemassivedatasets

    beingsourcedbytheorganization.Individualanalystsorverysmallgroupsare

    encouragedtoanalyzethedatawithaverywidevarietyoftools,rangingfromserious

    statisticaltoolslikeSAS,MatlaborR,topredictivemodels,andmanyformsofadhocqueryingandvisualizationthroughadvancedBIgraphicalinterfaces.Theanalyst

    responsibleforagivensandboxisallowedtodoanythingwiththedata,usinganytool

    theywant,evenifthetoolstheyusearenotcorporatestandards.Thesandbox

    phenomenonhasenormousenergybutitcarriesasignificantrisktotheIT

    organizationandEDWarchitecturebecauseitcouldcreateisolatedandincompatible

    stovepipesofdata.Thispointisamplifiedinthesectiononorganizationalchanges,

    below.

    Exploratorysandboxesusuallyhavealimitedtimeduration,lastingweeksoratmost

    afewmonths.Theirdatacanbeafrozensnapshot,orawindowonacertainsegment

    ofincomingdata.Theanalystmayhavepermissiontorunanexperimentchangingafeatureontheproductorserviceinthemarketplace,andthenperformingA/Btesting

    toseehowthechangeaffectscustomerbehavior.Typically,ifsuchanexperiment

    producesasuccessfulresult,thesandboxexperimentisterminated,andthefeature

    goesintoproduction.Atthatpoint,trackingapplicationsthatmayhavebeen

    implementedinthesandboxusingaquickanddirtyprototypinglanguage,areusually

    reimplementedbyotherpersonnelintheEDWenvironmentusingcorporatestandard

    tools.Inseveralofthee-commerceenterprisesinterviewedforthiswhitepaper,

    analyticsandboxeswereextremelyimportant,andinsomecaseshundredsofthe

    sandboxexperimentswereongoingsimultaneously.Asoneintervieweecommented,

    "newlydiscoveredpatternshavethemostdisruptivepotential,andinsightsfromthem

    leadtothehighestreturnsoninvestment."

    Architecturally,sandboxesshouldnotbebruteforcecopiesofentiredatasets,or

    evenmajorsegmentsofthesedatasets.Indimensionalmodelingparlance,the

    analystneedsmuchmorethanjustafacttabletoruntheexperiment.Ataminimum

    theanalystalsoneedsoneormoreverylargedimensiontables,andpossibly

    additionalfacttablesforcomplete"drillacross"analysis.If100analystsarecreating

    bruteforcecopyversionsofthedataforthesandboxestherewillbeenormous

    wastingofdiskspaceandresourcesforalltheredundantcopies.Rememberthatthe

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    24/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics22

    largestdimensiontables,suchascustomerdimensions,canhave500millionrows!

    Therecommendedarchitectureforaserioussandboxenvironmentistobuildeach

    sandboxusingconformed(shared)dimensionswhichareincorporatedintoeach

    sandboxasrelationalviews,ortheirequivalentunderHadoopapplications.

    Lowlatency

    Anelementarymistakewhengatheringbusinessrequirementsduringthedesignofa

    datawarehouseistoaskthebusinessuseriftheywant"realtime"data.Usersare

    likelytosay"ofcourse!"Althoughperhapsthisanswerhasbeensomewhatgratuitous

    inthepast,agoodbusinesscasecannowbemadeinmanysituationsthatmore

    frequentupdatesofdatadeliveredtothebusinesswithlowerandlowerlatenciesare

    justified.BothRDBMSsandMapReduce/Hadoopsystemsstrugglewithloading

    giganticamountsofdataandmakingthatdataavailablewithinsecondsofthatdata

    beingcreated.Butthemarketplacewantsthis,andregardlessofatechnologists

    doubtabouttherequirement,therequirementisrealandoverthenextdecadeitmust

    beaddressed.

    Aninterestingangleonlowlatencydataisthedesiretobeginseriousanalysisonthe

    dataasitisstreamingin,butpossiblyfarbeforethedatacollectionprocesseven

    terminates.Thereissignificantinterestinstreaminganalysissystemswhichallow

    SQL-likequeriestoprocessthedataasitflowsintothesystem.Insomeusecases

    whentheresultsofastreamingquerysurpassathreshold,theanalysiscanbehalted

    withoutrunningthejobtothebitterend.Anacademiceffort,knownascontinuous

    querylanguage(CQL),hasmadeimpressiveprogressindefiningtherequirements

    forstreamingdataprocessingincludingcleversemanticsfordynamicallymovingtime

    windowsonthestreamingdata.LookforCQLlanguageextensionsandstreaming

    dataquerycapabilitiesintheloadprogramsforbothRDBMSsandHDFSdeployed

    datasets.Anidealimplementationwouldallowstreamingdataanalysistotakeplacewhilethedataisbeingloadedatgigabytespersecond.

    Theavailabilityofextremelyfrequentandextremelydetailedeventmeasurements

    candriveinteractiveintervention.Theusecaseswherethisinterventionisimportant

    spansmanysituationsrangingfromonlinegamingtoproductoffersuggestionsto

    financialaccountfraudresponsestothestabilityofnetworks.

    Continuousthirstformoreexquisitedetail

    Analystsareforeverthirstingformoredetailineverymarketplaceobservation,

    especiallyofcustomerbehavior.Forexampleeverywebpageevent(apagebeing

    paintedonauser'sscreen)spawnshundredsofrecordsdescribingeveryobjecton

    thepage.Inonlinegames,whereeverygestureentersthedatastream,asmanyas

    100descriptorsareattachedtoeachofthesegesturemicro-events.Forinstance,ina

    hypotheticalonlinebaseballgame,whenthebatterswingsatapitch,everything

    describingthepositionoftheplayers,thescore,runnersonthebases,andeventhe

    characteristicsofthepitch,areallstoredwiththatindividualrecord.Inbothofthese

    examples,thecompletecontextmustbecapturedwithinthecurrentrecord,because

    itisimpracticaltocomputethisdetailedcontextafterthefactfromseparatedata

    sources.Thelessonforthecomingdecadeisthatthisthirstforexquisitedetailwill

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    25/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics23

    onlygrow.Itispossibletoimaginethousandsofattributesbeingattachedtosome

    micro-events,andthecategoriesandnamesoftheseattributeswillgrowin

    unpredictableways.Thismakesthedatabagapproachdiscussedearlierinthepaper

    muchmoreimportant.Itmeansthatpositionallydependentschemas,withthekeys

    (namesofthedata)pre-declaredascolumnnamesisanunworkabledesign.

    Finally,aperfecthistoricalreconstructionofinterestingeventssuchaswebpageexposuresneedstobemorethanjustalistofattributesonthewebpagewhenitwas

    displayed,evenifthatlistisenormouslydetailed.Aperfecthistoricalreconstructionof

    thewebpageneedstobeseenthroughamultimediauserinterface,i.e.,abrowser.

    Lighttouchdatawaitsforitsrelevancetobeexposed

    Lighttouchdataisanaspectoftheexquisitedetaildatadescribedintheprevious

    section.Forexample,ifacustomerbrowsesawebsiteextensivelybeforemakinga

    purchase,agreatdealofmicro-contextisstoredinallthewebpageeventspriortothe

    purchase.Whenthepurchaseismade,someofthatmicro-contextsuddenly

    becomesmuchmoreimportant,andiselevatedfrom"lighttouchdata"torealdata.At

    thatpointthesequenceofexposurestotheselectedproductortocompetitive

    productsinthesamespacebecomespossibletobesessionized.Thesemicro-events

    areprettymuchmeaninglessbeforethepurchaseevent,becausetherearesomany

    conceivableandirrelevantthreadsthatwouldbedeadendsforanalysis.This

    requiresoceansoflighttouchdatatobestored,waitingfortherelevanceofselected

    threadsofthesemicro-eventstoeventuallybeexposed.Conventionalseasonality

    thinkingsuggeststhatatleastfivequarters(15months)ofthislighttouchdataneeds

    tobekeptonline.Thisisoneinstanceofaremarkmadeconsistentlyduring

    interviewsforthiswhitepaperthatanalystswant"longertails"whichmeansthatthey

    wantmoresignificanthistoriesthantheycurrentlyget.

    Simpleanalysisofallthedatatrumpssophisticatedanalysisofsomeofthedata

    Althoughdatasamplinghasneverbeenapopulartechniqueindatawarehousing,

    surprisinglythearrivalofenormouspetabytesizeddatasetshasnotincreasedthe

    interestinanalyzingasubsetofthedata.Onthecontrary,anumberofanalystspoint

    outthatmonetizableinsightscanbederivedfromverysmallpopulationsthatcouldbe

    missedbyonlysamplingsomeofthedata.Ofcoursethisisasomewhatcontroversial

    point,sincethesameanalystsadmitthatifyouhave1trillionbehaviorobservation

    records,youmaybeabletofindanybehaviorpatternifyoulookhardenough.

    Anothersomewhatcontroversialpointraisedbysomeanalystsistheirconcernthat

    anyformofdatacleaningontheincomingdatacoulderaseinterestinglow-frequency

    "edgecases.Ultimatelyboththecasesofmisleadingrarebehaviorpatterns,and

    misleadingcorrupteddataneedtobegentlyfilteredoutofthedata.

    Assumingthatthebehaviorinsightsfromverysmallpopulationsarevalid,thereis

    widespreadrecognitionthatmicro-marketingtothesmallpopulationsispossible,and

    doingenoughofthiscanbuildasustainablestrategicadvantage.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    26/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics24

    Afinalargumentinfavorofanalyzingcompletedatasetsisthatthese"relationscans"

    donotrequireindexesoraggregationstobecomputedinadvanceoftheanalysis.

    ThisapproachfitswellwiththebasicMapReducedistributedanalysisarchitecture.

    Datastructuresshouldbedeclaredatquerytime,notatdataloadtime

    Anumberofanalystsinterviewedforthiswhitepapersaidthattheenormousdata

    setstheyweretryingtoanalyzeneededtobeloadedinaqueryablestatebeforethe

    structureandcontentofthedatasetswerecompletelyunderstood.Again,thinkingof

    thedatabagkindofmarketplaceobservationwherewithinawell-structured

    dimensionalmeasurementprocesstheactualobservationisadisorderlyand

    potentiallyunpredictablesetofkeyvaluepairs,thestructureofthisdatabagmay

    needtobediscovered,andalternateinterpretationofthestructuresmayneedtobe

    possiblewithoutreloadingthedatabase.Onerespondentremarkedthatyesterdays

    fringedataistomorrowswell -structureddata,implyingthatweneedexceptional

    flexibilityasweexplorenewkindsofdatasources.

    AkeydifferentiatorbetweentheRDBMSapproachandtheMapReduce/Hadoop

    approachisthedeferralofthedatastructuredeclarationuntilquerytimeinthe

    MapReduce/Hadoopsystems.AnobjectionfromtheRDBMScommunitythatforcing

    everyMapReducejobtodeclarethetargetdatastructurepromotesakindofchaos

    becauseeveryanalystcandotheirownthing.Butthatobjectionseemstomissthe

    pointthatastandarddatastructuredeclarationcaneasilybepublishedasalibrary

    modulethatcanbepickedupbyeveryanalystwhentheyareimplementingtheir

    application.

    TheEDWsupportingbigdataanalyticsmustbemagnetic,agile,anddeep

    CohenandDolanintheirseminalbutsomewhatcontroversialpaperonbigdata

    analyticsarguethatEDWsmustshedsomeoldorthodoxiesinordertobemagnetic,agile,anddeep.Amagneticenvironmentplacestheleastimpedimentsonthe

    incorporationofnew,unexpected,andpotentiallydirtydatasources.Specifically,this

    supportstheneedtodeferdeclarationofdatastructuresuntilafterthedataisloaded.

    AccordingtoCohenandDolan,anagileenvironmenteschewslong-rangecareful

    designandplanning!Andadeepenvironmentallowsrunningsophisticatedanalytic

    algorithmsonmassivedatasetswithoutsampling,orperhapsevencleaning.We

    havemadethesepointselsewhereinthiswhitepaperbutCohenandDolanspaper

    isaparticularlypotent,ifunusual,argument.Readthispapertogetsomeprovocative

    perspectives!AlinktoCohenandDolanspaperisprovidedinthereferencessection

    attheendofthiswhitepaper.

    Theconflictbetweenabstractionandcontrol

    IntheMapReduce/Hadoopworld,PigandHivearewidelyregardedasvaluable

    abstractionsthatallowtheprogrammertofocusondatabasesemanticsratherthan

    programmingdirectlyinJava.Butseveralanalystsinterviewedforthispaper

    remarkedthattoomuchabstractionandtoomuchdistancingfromwherethedata

    actuallyisstoredcanbedisastrouslyinefficient.Thisseemslikeareasonable

    concernwhendealingwiththeverylargestdatasets,whereabadalgorithmcould

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    27/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics25

    resultinruntimesmeasuredindays.Forthebreakingwaveofthebiggestdatasets,

    programmingtoolswillneedtoallowconsiderablecontroloverthestoragestrategy,

    andtheprocessingapproaches,butwithoutrequiringprogrammingusingthelowest

    levelcode.

    Data warehouse organization changes in the coming decadeThegrowingimportanceofbigdataanalyticsamountstosomethingbetweena

    midcoursecorrectionandarevolutionforenterprisedatawarehousing.Newskillsets,

    neworganizations,newdevelopmentparadigms,andnewtechnologywillneedtobe

    absorbedbymanyenterprises,especiallythosefacingtheusecasesdescribedinthis

    paper.Noteveryenterpriseneedstojumpintothepetabyteocean,butitisthis

    author'spredictionthattheupcomingdecadewillseeasteadygrowthinthe

    percentageoflargeenterprisesrecognizingthevalueofbigdataanalytics.

    Mostobserverswouldagreethatbigdataanalyticsfallswithin"information

    management,"butthesameobserversmayquibbleaboutwhetherthisaffectsthe

    "datawarehouse."Ratherthanworryingaboutwhethertheboxontheorganization

    chartlabeledEDWhasresponsibilityforbigdataanalytics,wetaketheperspective

    thatenterprisedatawarehousingwithoutthecapitallettersabsolutelyencompasses

    bigdataanalytics.Havingsaidthat,therewillbemanydifferentorganizational

    structuresandmanagementperspectivesasindustriesexpandtheirinformation

    management.Thiskindoftinkeringandadjustingtothenewparadigmisnormaland

    expected.Wewentthroughaverysimilarphaseinthemid1980swhendata

    warehousingitselfwasanewparadigmforITandthebusiness.Manyofthemost

    successfulearlydatawarehousinginitiativesstartedinthebusinessorganizations

    andwereeventuallyincorporatedintothoseITorganizationsthatthenmademajorcommitmentstobeingbusinessrelevant.Itislikelythesameevolutionwilltakeplace

    withbigdataanalytics.

    Thechallengebeforeinformationmanagersinlargeenterprisesishowtoencourage

    threeseparatedatawarehouseendeavors:conventionalRDBMSapplications,

    MapReduce/Hadoopapplications,andadvancedanalytics.

    Technicalskillsetsrequired

    Itisworthrepeatingherethemessageoftheveryfirstsentenceofthiswhitepaper.

    Petabytescaledatasetsareofcourseabigchallengebutbigdataanalysisisoften

    aboutdifficultiesotherthandatavolume.Youcanhavefastarrivingdataorcomplexdataorcomplexanalyseswhichareverychallengingevenifallyouhaveare

    terabytesofdata!

    ThecareandfeedingofRDBMS-orienteddatawarehousesinvolvesa

    comprehensivesetofskillsthatisprettywellunderstood:SQLprogramming,ETL

    platformexpertise,databasemodeling,taskscheduling,systembuildingand

    maintenanceskills,oneormorescriptinglanguagessuchasPythonorPerl,UNIXor

    Windowsoperatingsystemskills,andbusinessintelligencetoolsskills.SQL

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    28/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics26

    programming,whichisatthecoreofanRDBMSimplementation,isadeclarative

    language,whichcontrastswiththemindsetoftheprocedurallanguageskillsneeded

    forMapReduce/Hadoopprogramming,atleastinJava.Thedatawarehouseteam

    alsoneedstohaveagoodpartnershipwithinotherareasofITincludingstorage

    management,security,networking,andsupportofmobiledevices.Finally,gooddata

    warehousingalsorequiresanextensiveinvolvementwiththebusinesscommunity,

    andwiththecognitivepsychologyofend-users!

    ThecareandfeedingofMapReduce/Hadoopdatawarehouses,includinganyofthe

    bigdataanalyticsusecasesdescribedinthispaper,involvesasetofskillsthatonly

    partiallyoverlaptraditionalRDBMSdatawarehouseskills.Thereinliesasignificant

    challenge.Thesenewskillsincludelower-levelprogramminglanguagessuchas

    Java,C++,Ruby,Python,andMapReduceinterfacesmostcommonlyavailablevia

    Java.Althoughtherequirementtoprogramviaproceduralbasedlower-level

    programminglanguageswillbereducedsignificantlyduringtheupcomingdecadein

    favorofPig,Hive,andHBase,itmaybeeasiertorecruitMapReduce/Hadoop

    applicationdevelopersfromtheprogrammingcommunityratherthanthedata

    warehousecommunity,ifthedatawarehousejobapplicantslackprogrammingandUNIXskills.IfMapReduce/Hadoopdatawarehousesaremanagedexclusivelywith

    opensourcetools,thenZookeeperandOozieskillswillbeneededtoo.Keepinmind

    thattheopen-sourcecommunityinnovatesquickly.Hive,PigandHBasearenotthe

    lastwordinhigh-levelinterfacestoHadoopforanalysis.Itislikelythatwewillsee

    muchmoreinnovationinthisdecadeincludingentirelynewinterfaces.

    ETLplatformprovidershaveabigopportunitytoprovidemuchofthegluethatwilltie

    togetherthebigdatasources,MapReduce/Hadoopapplications,andexisting

    relationaldatabases.DeveloperswithETLplatformskillswillbeabletoleveragea

    greatdealoftheirexperienceandinstinctsinsystembuildingwhentheyincorporate

    MapReduce/Hadoopapplications.

    Finally,theanalystswhomwehavedescribedasoftenworkinginsandbox

    environmentswillarrivewithaneclecticandunpredictablesetofskillsstartingwith

    deepanalyticexpertise.Forthesepeopleitisprobablymoreimportanttobe

    conversantinSAS,Matlab,orRthantohavespecificprogramminglanguageor

    operatingsystemskills.SuchindividualstypicallywillarrivewithUNIXskills,and

    somereasonableprogrammingproficiency,andmostofthesepeopleareextremely

    tolerantoflearningnewcomplextechnicalenvironments.Perhapsthebiggest

    challengewithtraditionalanalystsisgettingthemtorelyontheotherresources

    availabletothemwithinIT,ratherthanbuildingtheirownextractanddatadelivery

    pipelines.Thisisatrickybalancebecauseyouwanttogivetheanalystsunusualfreedom,butyouneedtolookovertheirshoulderstomakesurethattheyarenot

    wastingtheirtime.

    Neworganizationsrequired

    Atthisearlystageofthebigdataanalyticsrevolution,thereisnoquestionthatthe

    analystsmustbepartofthebusinessorganization,bothtounderstandthe

    microscopicworkingsofthebusiness,butalsotobeabletoconductthekindofrapid

    turnaroundexperimentsandinvestigationswehavedescribedinthispaper.Aswe

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    29/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics27

    havedescribed,theseanalystsmustbeheavilysupportedinatechnicalsense,with

    potentiallymassivecomputepoweranddatatransferbandwidth.Soalthoughthe

    analystsmayresideinthebusinessorganizations,thisisagreatopportunityforITto

    gaincredibilityandpresencewiththebusiness.Itwouldbeasignificantmistakeand

    alostopportunityfortheanalystsandtheirsandboxestoexistasroguetechnical

    outpostsinthebusinessworldwithoutrecognizingandtakingadvantageoftheirdeep

    dependenceonthetraditionalITworld.

    Insomeorganizationsweinterviewedforthiswhitepaper,wesawseparateanalytic

    groupsembeddedwithindifferentbusinessorganizations,butwithoutverymuch

    crosscommunicationorcommonidentityestablishedamongtheanalyticgroups.In

    somenoteworthycases,thislackofan"analyticcommunity"ledtolostopportunities

    toleverageeachother'swork,andledtomultiplegroupsreinventingthesame

    approaches,andduplicatingprogrammingeffortsandinfrastructuredemandsasthey

    madeseparatecopiesofthesamedata.

    Werecommendthatacrossdivisionalanalyticscommunitybeestablishedmimicking

    someofthesuccessfuldatawarehousecommunitybuildingeffortswehaveseeninthepastdecade.Suchacommunityshouldhaveregularcrossdivisionalmeetings,as

    wellasakindofprivateLinkedInapplicationtopromoteawarenessofallthecontacts

    andperspectivesandresourcesthattheseindividualscollectintheirown

    investigations,andaprivatewebportalwhereinformationandnewseventsare

    shared.Periodictalkscanbegiven,hopefullyinvitingmembersofthebusiness

    communityaswell,andabovealltheanalyticscommunityneedsT-shirtsandmugs!

    Newdevelopmentparadigmsrequired

    Evenbeforethearrivalofbigdataanalytics,datawarehousinghasbeentransforming

    itselftoprovidemorerapidresponsetonewopportunitiesandtobemoreintouch

    withthebusinesscommunity.Someofthepracticesoftheagilesoftware

    developmentmovementhavebeensuccessfullyadoptedbythedatawarehouse

    community,althoughrealisticallythishasnotbeenahighlyvisibletransformation.

    But,inparticular,theagiledevelopmentapproachsupportsthedatawarehouseby

    beingorganizedaroundsmallteamsdrivenbythebusiness,nottypicallybyIT.An

    agiledevelopmenteffortalsoproducesfrequenttangibledeliveries,deemphasizes

    documentationandformaldevelopmentmethodologies,andtoleratesmidcourse

    correctionandtheincrementalacceptanceofnewrequirements.Themostsensitive

    ingredientforsuccessofagiledevelopmentprojectsisthepersonalityandskillsofthe

    businessleaderwhoultimatelyisincharge.Theagilebusinessleaderneedstobea

    thoughtfulandsophisticatedobserverofthedevelopmentprocessandtherealitiesoftheinformationworld.Hopefullytheagilebusinessleaderisaprettygoodmanageras

    well.

    Bigdataanalyticscertainlyopensthedoortobusinessinvolvementsincethecentral

    analysisisprobablydoneinthebusinessenvironmentdirectly.Butitisprobably

    unlikelythattheprofessionalanalystistherightpersontobetheoverallagiledata

    warehouseprojectleader.Theagileprojectleaderneedstobewellskilledin

    facilitatingshorteffectivemeetings,resolvingissuesanddevelopmentchoices,

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    30/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics28

    determiningthetruthofprogressreportsfromindividualdevelopers,communicating

    withtherestoftheorganization,andgettingfundingforinitiatives.

    Traditionaldatawarehousedevelopmenthasdiscoveredtheattractivenessof

    buildingincrementallyfromamodeststart,butwithagoodarchitecturalfoundation

    thatprovidesablueprintforwherefuturedevelopmentwillgo.Thisauthorhas

    describedinmanypapersthetechniquesfor"gracefulmodification"ofdimensionaldatawarehouseschemas.Inadimensionallymodeleddatawarehouse,new

    measurementfacts,newdimensionalattributes,andevennewdimensionscanbe

    addedtoexistingdatawarehouseapplicationswithoutchanging,invalidating,or

    rollingoverexistinginformationdeliverypipelinestotheendusers.Manyoftheuse

    caseswehavedescribedinthispaperforbigdataanalyticssuggestthatnewfacts,

    newattributes,andnewdimensionswillroutinelybecomeavailable.

    Integrationofnewdatasourcesintoadatawarehousehasalwaysbeena

    significantchallenge,sinceoftenthesenewdatasourcesarrivewithoutanythought

    tointegrationwithexistingdatasources.Thiswillcertainlybethecasewithbigdata

    analytics.Againfordimensionallymodeleddatawarehouses,thisauthorhasdescribedtechniquesforincrementalintegration,where"enterprisedimensional

    attributes"aredefinedandplantedinthedimensionsoftheseparatedatasources.

    Wecalltheseconformeddimensions.Thedevelopmentanddeploymentof

    conformeddimensionsfitstheagiledevelopmentapproachbeautifully,sincethis

    kindofintegrationcanbeimplementedonedatasourceatatime,andone

    dimensionalattributeatatime,againinawaythatisnondestructivetoexisting

    applications.Pleaseseethereferencessectionattheendofthiswhitepaperfor

    moreinformationonconformeddimensions.

    Finally,atleastoneorganizationinterviewedforthiswhitepaperhastakenagilityto

    itslogicalextreme.Individualdevelopersaregivencompleteend-to-endresponsibilityforaproject,allthewayfromoriginalsourcingofthedata,through

    experimentalanalysis,re-implementingtheprojectforproductionuse,andworking

    withtheend-usersandtheirBItoolsinsupportivemode.Althoughthisdevelopment

    approachremainsanexperiment,earlyresultsareveryinterestingbecausethese

    developersfeelasignificantsenseofresponsibilityandpridefortheirprojects.

    Lessonsfromtheearlydatawarehousingera

    Ittookmostofthe1990sfororganizationstounderstandwhatadatawarehouse

    wasandhowtobuildandmanagethosekindsofsystems.Interestingly,attheend

    ofthe1990s,datawarehousingwaseffectivelyrelabeledasbusinessintelligence.

    Thiswasaverypositivedevelopmentbecauseitreflectedtheneedforthebusiness

    toownandtakeresponsibilityfortheusesofdata.

    Theearliestdatawarehousepioneershadnochoicebuttodotheirownsystems

    integration,assemblingbest-of-breedcomponents,andcopingwiththeinevitable

    incompatibilitiesinissuesofdealingwithmultiplevendors.Bytheendofthe1990s,

    thebestofbreedapproachgavewaytovendorstacksofintegratedproducts,a

    trendwhichcontinuesuntiltoday.Atthispoint,thereareonlyafewindependent

    vendorsinthedatawarehousespace,andthosevendorshavesucceededby

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    31/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics29

    interfacingwithnearlyeveryconceivableformatandinterface,therebyproviding

    bridgesbetweenthemorelimitedproprietaryvendorstacks.

    Withthebenefitofhindsightgainedfromthetraditionaldatawarehouseexperience,

    thebigdataanalyticsversionofdatawarehousingislikelytoconsolidatequite

    quickly.Onlythebravestorganizationswithverystrongsoftwaredevelopmentskills

    shouldconsiderrollingtheirownbigdataanalyticsapplicationsdirectlyonrawMapReduce/Hadoop.Forinformationmanagementorganizationswishingtofocuson

    thebusinessissuesratherthanonthebreakingwaveofsoftwaredevelopment,a

    packagedHadoopdistribution(e.g.,Cloudera)makesalotofsense.TheleadingETL

    platformvendorslikelywillalsointroducepackagedenvironmentsforhandlingmany

    ofthephasesofMapReduce/Hadoopdevelopment.

    Analyticsinthecloud

    Thiswhitepaperhasnotdiscussedcloudimplementationsofbigdataanalytics.Most

    oftheenterprisesinterviewedforthiswhitepaperwerenotusingpubliccloud

    implementationsfortheirproductionanalytics.Nevertheless,cloudimplementations

    maybeveryattractiveinthestartupphaseforananalyticseffort.Acloudservicecan

    provideinstantscalabilityduringthisstartupphase,withoutcommittingtoamassive

    legacyinvestmentinhardware.Dataanalysisprojectscanbeturnedonandturned

    offonshortnotice.Recallthattypicalanalyticenvironmentsmayinvolvehundredsof

    separatesandboxesandparallelexperiments.

    Manyoftheorganizationsinterviewedforthispaperstatedthatmatureanalytics

    shouldbebroughtin-house,perhapsimplementedtechnicallyasacloudbutwithin

    theconfinesoftheorganization.Ofcourse,suchanin-housecloudmayreducefears

    ofsecurityandprivacybreaches(fairlyornot).

    Aremotecloudimplementationraisesissuesofnetworkbandwidth,especiallyinabroadlyintegratedapplicationwithmultipleverylargedatasetsindifferentlocations.

    Imaginesolvingthebigjoinproblemwhereyourtrillionrowfacttableisoutonthe

    cloud,andyourbillionrowdimensiontableislocatedin-house.

    Althoughthebestperformingsystemstrytoachieveathree-waybalanceamong

    CPU,diskspeed,andbandwidth,mostorganizationsinterviewedforthispaper

    predictedthatbandwidthwouldemergeasthenumberonelimitingfactorforbigdata

    analyticssystemperformance.

    WhitherEDW?

    Theenterprisedatawarehousemustexpandtoencompassbigdataanalyticsaspart

    ofoverallinformationmanagement.Themissionofthedatawarehousehasalways

    beentocollectthedataassetsoftheorganizationandstructuretheminawaythatis

    mostusefultodecision-makers.Althoughsomeorganizationsmaypersistwithabox

    ontheorgchartlabeledEDWthatisrestrictedtotraditionalreportingactivitieson

    transactionaldata,thescopeoftheEDWshouldgrowtoreflectthesenewbigdata

    developments.InsomesensethereareonlytwofunctionsofIT:gettingthedatain

    (transactionprocessing),andgettingthedataout.TheEDWisgettingthedataout.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    32/33

    TheEvolvingRoleoftheEnterpriseDataWarehouseintheEraofBigDataAnalytics30

    Thebigchoicefacingshopswithgrowingbigdataanalyticsinvestmentsiswhetherto

    chooseanRDBMS-onlysolution,oradualRDBMSandMapReduce/Hadoop

    solution.Thisauthorpredictsthatthedualsolutionwilldominate,andinmanycases

    thetwoarchitectureswillnotexistasseparateislandsbutratherwillhaverichdata

    pipelinesgoinginbothdirections.Itissafetosaythatbotharchitectureswillevolve

    hugelyoverthenextdecade,butthisauthorpredictsthatbotharchitectureswillshare

    thebigdataanalyticsmarketplaceattheendofthedecade.

    Sometimeswhenanexcitingnewtechnologyarrives,thereisatendencytoclosethe

    dooronoldertechnologiesasiftheyweregoingtogoaway.Datawarehousinghas

    builtanenormouslegacyofexperience,bestpractices,supportingstructures,

    technicalexpertise,andcredibilitywiththebusinessworld.Thiswillbethefoundation

    forinformationmanagementintheupcomingdecadeasdatawarehousingexpands

    toincludebigdataanalytics.

  • 8/12/2019 1597 EDW Big Data Analytics Kimball

    33/33

    AcknowledgmentsThisauthorisgratefulforInformaticassponsoringofthiswhitepaperandfor

    providingabsolutelyno"vendorbias.Theopinionsinthiswhitepaperaresolelythe

    responsibilityoftheauthor.

    Anumberofsmartandknowledgeablebigdatapractitionersmadethemselvesavailableduringtheresearchphaseofthewhitepaperforinterviews.These

    individualsprovidedmanyusefulinsights.Inalphabeticorderbyorganization,we

    thank

    AmrAwadallah,MikeOlson,Cloudera

    BrianDolan,Discovix

    OliverRatzesberger,eBay

    AlexIgnatius,ElectronicArts

    WilliamSchmarzo,EMC

    AshishThusoo,Facebook

    JuliannaDeLua,JohnHaddad,SanjayKrishnamurthi,RonLunasin,Informatica

    NicholasWakefield,LinkedIn

    DanGraham,DilipKrishna,RonKunze,Teradata

    ProfessorMichaelFranklin,ComputerScienceDepartment,U.C.Berkeley

    RaymieStata,Yahoo!

    DanMcCaffrey,KenRudin,Zynga

    ReferencesAnArchitectureforDataQuality,aKimballGroupWhitepaper:http://

    vip.informatica.com/?elqPURLPage=8784

    EssentialStepsfortheIntegratedEDW,aKimballGroupWhitepaper:http://

    vip.informatica.com/?elqPURLPage=8785

    Hadoop,TheDefinitiveGuide, 2ndEdition,TomWhite,OReilly (2011)

    Hadoop-MLwebsite:http://videolectures.net/nipsworkshops09_pednault_hmli/

    MADSkills:NewAnalysisPracticesforBigData,Cohen,Dolanetal,http://

    db.cs.berkeley.edu/jmh/papers/madskills-032009.pdf