todd - london 2 - brining you data together in the cloud · 2019-12-01 · bringing your data...

Post on 02-Aug-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@SnowflakeDB@SnowflakeDB #CloudAnalytics17

LONDON

BringingYourDataTogetherintheCloudToddBeaucheneGlobalAlliancesArchitect,SnowflakeComputing

“Data!Data!Data!Ican'tmakebrickswithoutclay.”-SherlockHolmes

Agenda

• CloudDataEcosystem• DataSources• Methodologies• DataIntegrationSolutions• Conclusion

Cloud Data EcosystemData Integration Business Intelligence &

AnalyticsData Warehouse

Enterprise apps

Data Sources

Corporate

Web

Mobile

IoT

Data Sources

Data SourcesOn-Premises• Typicallybackedbyalocaltransactionaldatabase

• Alldataliveswithinthefirewall

• Customerhasfullaccesstoalldataandsystem

Cloud• Typicallybackedbyaclouddatabase(i.e.RDS)

• CanrunincustomerVPC

• Typicallyoffersfeweroptionsthanon-premises

SaaS• Typically data is only

available via API• Outside of customer

firewall or VPC• Customer has very

little control over handling of data

Real World Example: Consolidated DashboardChallenges• Long-termprojectwithhigh-levelgoals

• Diversedatasources

• Differentrefreshcycles

• Inconsistentresults

Solutions• Agileprojectwithfocused,short-termgoals

• DedicatedschemainEDW

• DailyETLProcess

• DataqualitycheckswithinETL

Methodologies

MethodologiesBulkLoading– Trunc andLoad• Runsatregularintervals• Fulldatasetloadedduringeachrunandexistingdataispurged

• Leastefficientoption,butverysimpletomanage

• Highdatavolumeseveryrun• Morecommonlyusedfordimensiontables

DailyDifferentials• RunsduringnightlyETLwindow• Requireschangedatacapturetoidentifychangedrows

• Generallyconsistsofaseriesofstepswhereeachdependsontheprevioussteps

• Mustincludelogictohandleslowlychangingdimensions

MethodologiesInsert-only– Date-based• ExtractsdatabydaterangetoeliminateneedforCDC

• Simplifiedprocessing• Commonlyusedforfacttables• Changestodatafrompreviousperiodsrequiredeletionofalldataforthegivenrange

DatabaseReplication• Generallyrunsinnear-real-time• Requiresatoolthatistightlyintegratedwiththesourcedatabase

• Schemasmustmatchbetweensourceanddestination

MethodologiesBatchProcessing• Generallyusedwhendataisbeingpushedfromthesource

• Batchfrequencydependsonthevolumeandvelocityofthedata

• Requiresautomatedprocesstoloadbatchesintothedatawarehouse.

Streaming• Generallyusedforhighvolumedata

• Event-basedratherthanrow-based

• Oftenrequiresmicro-batchingofdataforloadintorelationaldatabase

• Rawdatamustusuallybetransformedtosupportanalytics

Data IntegrationSolutions

Data Integration SolutionsCustomCode• Flexiblebutcomplex

• Leveragesin-databaseprocessing

• Challengingtomanageandmaintain

ETL• Simplifieddatatransformationwithnocode

• Built-independencyanderrorhandling

• ReducesdatavolumeswithinEDW

ELT• Leverages benefits of

ETL while shifting data processing to EDW

• Requires tight integration between Data Integration and EDW

• Raw and transformed data in one place

Data Integration SolutionsOn-Premises• Customerownshardwareandsoftwareinstall/configuration

• Don’thavetodealwithfirewalltoaccesslocalsources

Cloud• Customerownssoftwareinstall/configurationbutnothardware

• CanrunincustomerVPCtoprovidedirectaccesstodatawithinVPCorbehindfirewall

SaaS• Fully managed by

service provider• Configurable options

vary by solution• Must find secure ways

to access data not stored inside firewall

Conclusion

Cloud Data Warehousing Best Practices• Leveragethescalablecomputelayertodothebulkofthedata

processing• Isolateloadandtransformjobsfromqueriestopreventresource

contention• Eliminatephysicaldatamartsbyleveragingascalabledataplatform• QAiskey,makesureallchangesmadetodataintegrationtasksare

testedbeforetheyrolltoproduction• Whenmigratingitisimportanttoconvertonesourceatatime

top related