big data technology ecosystem - · pdf filemsft sql server, oracle, mysql , postgresql, ibm...

22
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Upload: nguyenquynh

Post on 03-Feb-2018

244 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

BigDataTechnologyEcosystemMarkBurnettePentahoDirectorSalesEngineering,HitachiVantara

Page 2: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

Agenda

• End-to-EndDataDeliveryPlatform• EcosystemofDataTechnologies

• MappinganEnd-to-EndSolution

• CaseStudies• PentahoKeyCapabilities• Summary

• Q&A

Page 3: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

End-to-EndDataDeliveryPlatform

Ingest Process ReportPublish• DataAgnostic• MetadataDrivenIngestion• DataOrchestration

• NativeHadoopIntegration• ScaleUp&ScaleOut• BlendUnstructuredData

• StreamlinedDataRefinery• DataVirtualization• MachineLearning

• ProductionReporting• CustomDashboards• Self-ServiceDashboards• InteractiveAnalysis• EmbeddedAnalytics

Page 4: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

DeliveringInsight

Ingest Process ReportPublish

ConsumersDataAnalystDataScientistsDataEngineers ProductionReporting

CustomDashboards

InteractiveAnalysis

Self-ServiceDashboards

DataIntegration&Orchestration

Page 5: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

BigDataEcosystem

AnalyticalDatabases

1

4

7

2

5

8

3

6

9

SQLonHadoop

RelationalDatabase NoSQLDatabase

MessageStreaming

HDFSMapReduceDistributedSearch

EventStreamProcessing(ESP)

ComplexEventProcessing(CEP)

Page 6: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

Volume(DataSize)

Small Medium Large

Variety(Data Type)

Structured Semi-Structured Unstructured

Velocity(Processing)

Batch Micro-Batch RTStreaming

Latency(Reporting)

Scheduled Prompted Interactive

DataSourceAttributes

AnalyticalDatabases

SQLonHadoop

RelationalDatabase NoSQLDatabase

MessageStreaming

DistributedSearch

EventStreamProcessing(ESP)

ComplexEventProcessing(CEP)

HDFSMapReduce

Page 7: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

RelationalDatabase MSFTSQLServer,Oracle,MySQL,PostGreSQL,IBMDB2

Volume(DataSize)

SmallOperationaldatabasesforOLTPappsthatrequirehightransactionloadsanduserconcurrency.Can“scaleup”todatavolumesbutlackabilitytoeasily“scale-out”forlargedataprocessing.

Medium

Large

Variety(DataType)

StructuredStructuredschemaoftablescontainingrowsandcolumnsofdataemphasizingintegrityandconsistencyoverspeedandscale.Structured dataaccessedwiththeSQLquerylanguage.

Semi-Structured

Unstructured

Velocity(Processing)

Batch

Rigidschemaswithbatch-orientedingestionandSQLqueryprocessingarenotdesigned forcontinuousstreamingdataMicro-Batch

RTStreaming

Latency(Reporting)

Scheduled

OptimizedforfrequentsmallCRUDqueries(create,read,update,delete),notforanalyticorinteractivequeryworkloadsonlargedataPrompted

Interactive

RelationalDatabaseGoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 8: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

AnalyticalDatabase

Columnar,In-Memory,MPP,OLAPTeradata,OracleExadata,IBMNetezza,EMCGreenplum,Vertica

Volume(DataSize)

Small

Datawarehouse/martdatabasestosupportBIandadvancedanalyticsworkloads.MPParchitecturegivesabilityto“scaleout”tolargedatavolumesatafinancial cost.Medium

Large

Variety(DataType)

Structured

Structured schemaoftablescontainingrowsandcolumnsofdataofferingimprovedspeedandscalabilityoverRDBMSbutstilllimitedtostructureddata.Semi-Structured

Unstructured

Velocity(Processing)

Batch

Rigidschemaswithbatch-orientedSQLqueriesarenotdesignedforstreamingapplications.Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

Allfourtypes(Columnar,In-Memory,MPP,OLAP)designedforimprovedqueryperformanceforanalyticorinteractivequeryworkloadsonlargedata.Prompted

Interactive

AnalyticalDatabaseGoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 9: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

NoSQLDatabase MongoDB,HBase,Cassandra,MarkLogic,Couchbase

Volume(DataSize)

SmallGoodforwebapplications- lesswebappcodetowrite,debugandmaintain.Scaleout- horizontalscalingwauto-sharding datatosupportmillionsofwebappusers.Compromiseonconsistency(ACIDtransactions)infavorofscale&up-time.

Medium

Large

Variety(DataType)

Structured

Hierarchical, key-valueordocumentdesigntocapturealltypesofdatainasinglelocation.Semi-Structured

Unstructured

Velocity(Processing)

BatchSchema-lessdesignallowsforrapidorcontinuousingestatscale.Goodstorageoptionforhighthroughput,lowlatencyrequirementsofstreamingapplicationsforreal-timeviewsofdata.SeenasakeycomponenttoLambdaarchitecture.

Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

Lowlevelquerylanguages,lackofskills,lackSQLsupportmakesNoSQLlessappealingforreportingandanalysis.Prompted

Interactive

NoSQLDatabaseGoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 10: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

HDFSMapReduce

Cloudera,Hortonworks,MapR,Pivotal,AmazonEMR,HitachiHSP,MSFTHDInsights

Volume(DataSize)

SmallHadoopDistributed FileSystemdesignedtodistributeandreplicatefileblockshorizontallyscaledacrossmultiplecommoditydatanodes.MapReduceprogrammingtakescomputetothedataforbatchprocessinglargedatavolumes.

Medium

Large

Variety(DataType)

Structured

File systemisschema-lessallowingeasystorageofanyfiletypeinmultipleHadoopfileformats.Semi-Structured

Unstructured

Velocity(Processing)

Batch

HDFSandMapReducedesignedfordistributingbatchprocessingworkloadsonlargedatasets,notformicro-batchorsteamingusecases.Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

MapReduceonHDFSlacksSQLsupportandreportqueriesareslowandlessappealingforreportingandanalysis.Prompted

Interactive

HDFSMapReduceGoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 11: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

SQLonHadoop

Batch-oriented,Interactive,andIn-MemoryApacheHive,ApacheDrill/Phoenix,HortonworksHiveonTez,

ClouderaImpala,PivotalHawQ,SparkSQL

Volume(DataSize)

SmallSQLqueriesonametadatalayer(Hcatalog)inHadoop.ThequeriesareconvertedtoMapReduce,ApacheTez,ImpalaMPP,andSparkandrunondifferentstorageformatssuchasHDFSandHBase.

Medium

Large

Variety(DataType)

StructuredSQLwasdesignedforstructureddata.Hadoopfilesmaycontainnesteddata,variabledata,schema-lessdata.ASQL-on-Hadoopenginemustbeabletotranslatealltheseformsofdatatoflatrelationaldataandoptimizequeries(Impala/Drill)

Semi-Structured

Unstructured

Velocity(Processing)

Batch

SQL-on-Hadoopenginesrequiresmartandadvancedworkloadmanagersformulti-userworkloadsdesignedforqueryprocessingnotstreamprocessing.Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled Ad-hocreporting,iterativeOLAP,anddatamining)insingle-userandmulti-usermodes.Formulti-userqueries,Impalaisonaverage16.4xfasterthanHive-on-Tez and7.6xfasterthanSparkSQLwithTungsten,withanaverageresponsetimeof12.8scomparedtoover1.6minutesormore.

Prompted

Interactive

SQLonHadoopGoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 12: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

DistributedSearch ElasticSearch,Solr (basedonApacheLucene),AmazonCloudSearch

Volume(DataSize)

SmallSearchengineshavetodealwithlargesystemswithmillionsofdocumentsandaredesignedforindexandsearchqueryprocessingatscalewithclusteringanddistributedarchitecture.

Medium

Large

Variety(DataType)

StructuredXML,CSV,RDBMS,Word,PDF,ActiveMQ,AWSSQS,DynamoDB (AmazonNoSQL),FileSystem,Git,JDBC,JMS,Kafka,LDAP,MongoDB,neo4j,RabbitMQ,Redis,andTwitter.

Semi-Structured

Unstructured

Velocity(Processing)

BatchESscalabletoverylargeclusterswithnearreal-timesearch.Thedemandsofrealtimewebapplicationsrequiresearchresultsinnearrealtimeasnewcontentisgeneratedbyusers.Somecontentionhandlingconcurrentsearch+indexrequests.

Micro-Batch

RTStreaming

Latency(Reporting)

ScheduledBothusekey-valuepairquerylanguage.Solr ismuchmoreorientedtowardstextsearchwhileElasticsearch isoftenusedformoreadvancedquerying,filtering,andgrouping.Goodforinteractivesearchqueriesbutnotinteractiveanalyticalreporting.

Prompted

Interactive

DistributedSearchGoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 13: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

MessageStreaming Kafka,JMS,AMQP

Volume(DataSize)

Small

KafkaisanexcellentlowlatencymessagingplatformthatbrokersmassivemessagestreamsforparallelingestionintoHadoopMedium

Large

Variety(DataType)

Structured

Datasources,suchastheinternetofthings,sensors,clickstream,andtransactionalsystems.Semi-Structured

Unstructured

Velocity(Processing)

BatchRealtime streamingprovidinghighthroughputforbothpublishingandsubscribing,withconstantperformanceevenwithmanyterabytesofstoredmessages.Designedforstreamingandcanconfigurebatchsizeforbrokeringmicrobatchesofmessages.

Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

StreamtopicsneedtobeprocessedbyadditionaltechnologysuchasPDI,ESP,CEP,queryprocessingenginesforreporting.Prompted

Interactive

MessageStreamingGoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 14: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

MessageStreaming ApacheStorm

Volume(DataSize)

Small

ApacheStormisadistributed“event-at-a-time”stream processingsystemforprocessinglargevolumesinparallel withsub-second latency.Medium

Large

Variety(DataType)

StructuredStormapplicationsprocess1incomingeventatatimeastuplesofdata;atuplemaycancontainobjectofanytypesuchastheinternetofthings,sensors,andtransactionalsystems.

Semi-Structured

Unstructured

Velocity(Processing)

BatchStormisextremelyfast,withtheabilitytoprocessoveramillionmessagespersecondpernode.Compromisesonfaulttolerancebyoffering“atleastoncesemantics”infavorofspeed.

Micro-Batch

RTStreaming

Latency(Reporting)

ScheduledESPprovidesthemostrecentprocesseddataforalltypesofreporting.ExampleESPUseCase:StockmarkettickersshowingstockperformanceswithaGreenuparroworReddownarrowinrealtime.

Prompted

Interactive

EventStreamProcessing(ESP)GoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 15: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

MessageStreaming Spark, Flink

Volume(DataSize)

Small

SparkandFlink aredistributed“micro-batch”streamprocessingenginesforprocessinglargevolumesofhigh-velocitydatainparallelwithafewsecondslatency.Medium

Large

Variety(DataType)

Structured Complexeventprocessingforinternetofthings,sensors,andtransactionalsystems.Anaggregation-orientedCEPsolutionisfocusedonexecutingon-linealgorithmsasaresponsetoeventdataenteringthesystem.Detection-orientedCEPisfocusedondetectingcombinationsofeventscalledeventspatternsorsituations.

Semi-Structured

Unstructured

Velocity(Processing)

BatchMicro-batchprocessingengineswithfewsecondslatencythatisnotasfastasStorm,buthasbetterfaulttoleranceguaranteeing“exactlyoncesemantics”forstatefulcomputations.Greatformachinelearningcomputations.

Micro-Batch

RTStreaming

Latency(Reporting)

ScheduledCEPprovidesthemostrecentprocessed dataforalltypesofreporting. Example CEPusecase:usersetsupalerttothestockmarketsaying"letmeknowifGOOGstockswentupby10%andstayedupfor3hoursormore".

Prompted

Interactive

ComplexEventProcessing(CEP)GoodFit

NotOptimal

NotRecommended

CoreCompetency

Page 16: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

BigDataEcosystem

AnalyticalDatabases

1

4

7

2

5

8

3

6

9

SQLonHadoop

RelationalDatabase NoSQLDatabase

MessageStreaming

HDFSMapReduceDistributedSearch

EventStreamProcessing(ESP)

ComplexEventProcessing(CEP)

Page 17: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

MappingASolution

Page 18: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

RelationalDatabase

AnalyticalDatabase

NoSQLDatabase

HadoopFileSystem(HDFSMR)

SQLonHadoop

DistributedSearch

MessageStreaming

EventStream

Processing(ESP)

ComplexEvent

Processing(CEP)

Volume(DataSize)

Small

Medium

Large

Variety(DataType)

Structured

Semi-Structured

Unstructured

Velocity(Processing)

Batch

Micro-Batch

RTStreaming

Latency(Reporting)

Scheduled

Prompted

Interactive

MatrixforAnalyticsPerformance(MAP)GoodFit

NotOptimal

CoreCompetency

NotRecommended

Page 19: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

PDI

PENTAHODATA

INTEGRATION

BIGDATASOURCES

PENTAHODATA

INTEGRATION

HADOOP/DATALAKE

ANALYTICDATASETS

PENTAHODATA

INTEGRATION

TRADITIONALDATA

PENTAHODATA

INTEGRATION

DATAWAREHOUSE

DATAMARTS

LINEOFBUSINESS

ANALYTICS

E X T RAN E TD E P LOYMENT S

EMB EDD EDANA LY T I C S

ON - D EMAND DATAMART

S E L F - S E RV I C E ANA LY T I C S

C EN T RA L I Z E D ANA LY T I C S AT S C A L E

BigDataProjects

Page 20: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

ASingleFlow

DataPrepDataEngineering Analytics

Ingestion Processing Blending DataDelivery DataDiscovery/Analysis

Analysis&Dashboards

Administration Security LifecycleManagement

DataProvenance

DynamicDataPipeline Monitoring Automation

Page 21: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA

KeyTakeaways

• Dataarchitecturemodernizationinvolvesmanytechnologies

• Understandingtheecosystemofdatatechnologies

• Mappinganend-to-endsolution

• Pentahokeycapabilities

Page 22: Big Data Technology Ecosystem - · PDF fileMSFT SQL Server, Oracle, MySQL , PostGreSQL, IBM DB2 Volume (Data Size) Small ... BIG DATA SOURCES PENTAHO DATA INTEGRATION HADOOP/ DATA