processing big data with pentaho - presentation€¦ · summary: visual future-proof big data...
TRANSCRIPT
ProcessingBigDatawithPentahoRakeshSahaPentahoSeniorProductManager,HitachiVantara
Agenda
• Processbigdatavisuallyinfuture-proofway– Demo
• Combinestreamdataprocessingwithbatch– Demo
Pentaho’sLatestandUpcomingFeaturesforProcessingBigData– BatchorReal-time
BigDataProcessingisHARD
1)GartnerAnalyst,NickHeudecker;infoworld.com,Sept2015
"Through2018,70%ofHadoopdeploymentswillnotmeet
costsavingsandrevenuegenerationobjectivesduetoskills
andintegrationchallenges.”– GARTNER1
1NewSkillsNecessary
2HighEffortandRisk
3ContinuousChange
BigDataIntegrationandAnalyticsWorkflowwithPentaho
BigDataChallenges• ProcessingSemi/un/structureddata
• Blendingbigdatawithtraditionaldata
• Maintainingsecurity,governanceofdata
• Processingstreamingdatainrealtimeandhistorically
• Enablingandoperationalizingdatascience
DataLake
AnalyticDatabase
PentahoAnalyzer
Sensor
Bigorsmalldata
PentahoData
IntegrationPentahoReporting
MSGQueueKafka,JMS,
MQTTMachineLearning
R,Python
Stream FeedbackLoop
LOBApplications
Embedded
PentahoData
Integration
ProcessBigDataVisuallyinaFutureProofWay
VisualBigDataProcessingwithPentaho
• What:VisuallyingestandprocessBigDataatenterprisescale
• WhatSpecial:VisuallydeveloponceandexecuteonanyenginewithAdaptiveExecutionLayer(AEL)
• Why– Difficulttofindqualifieddevelopers– Difficulttokeepupwithnewtechnologies
• AvailablesincePentaho7.1
AdaptiveExecutionofBigData
BuildOnce,ExecuteonAnyEngineChallenge:Withrapidlychangingbigdatatechnology,codingonvariousenginescanbetime-consumingorimpossiblewithexistingresources
Solution:Future-proofdataintegrationandanalyticsdevelopmentinadrag-and-dropvisualdevelopmentenvironment,eliminatingtheneedforspecializedcodingandAPIknowledge.Seamlesslyswitchbetweenexecutionenginestofitdatavolumeandtransformationcomplexity
PDI
PentahoKettle
AdaptiveExecutionforSpark
ProcessBigDataFasteronSparkWithoutAnyCodingChallenge:FindingthetalentandtimetoworkwithSparkandnewerbigdatatechnologies
Solution:MoreeasilydevelopbigdataapplicationsinPDIusingadaptiveexecutiontoingest,processandblenddatafromarangeofbigdatasourcesandscaleonSparkclusters
PDI
PentahoKettle
UpcomingEnhancedAdaptiveExecutionLayer
• SimplifiedSetup– Fewerstepstosetup– Easytoconfigurefail-over,load-balancing
• Developmentproductivity– Robusttransformationerrorandstatusreporting– CustomizationofSparkjobs
• RobustEnterpriseSecurity– ClienttoAELconnectioncanbesecured– End-2-endKerberosimpersonationfromclienttooltocluster
PDIClient
Spark/HadoopProcessingNodes
HADOOPCLUSTER
AEL-SparkEngine(SparkDriver)
AEL-SparkDaemon(EdgeNodes)
Hadoop/SparkCompatibleStorageCluster
HDFS AzureStorage
AmazonS3
Etc…
SparkExecutors
UpcomingBigDataFileFormatHandling
BigDataplatformsintroducedvariousdataformatstoimproveperformance,compression,andinteroperability
What:• VisualhandlingofdatafileswithBigDataformatsParquetandAvro– Readingandwritingfileswithspecificsteps– NativelyexecuteinSparkviaAEL
Why:• EaseofdevelopmentofBigDataprocessing
• Performanceimprovementduetoavoidanceofintermediateformats
Demonstration
RetailWebLogDataProcessingwithPentaho
• RunwithinSpoonviaPentahoduringdevelopmentandthenuseSparkclusterforproduction
• Lookups,sort,andParquetfilein/outandotherstepsastotestparallelandserialprocessingwithinSparkCluster
CombineStreamProcessingwithBatchProcessing
WhatisStreamDataProcessing?AndWhy?
• Batchdataprocessingisuseful,butsometimesbusinessesneedtoobtaincrucialinsightsfasterandactonthem
• Manyusecasesmustconsiderdata2+times:onthewire,andthensubsequentlyashistoricaldata
• Getcrucialtime-sensitiveinsights– Reacttocustomerinteractionsonawebsiteormobileapp– Predictriskofequipmentbreakdownbeforeithappens
FormerPOV“securedatainDW,thenOLAPASAPafterward”giveswayto
CurrentPOV“analyzeonthewire,writebehind”
NEWStreamDataProcessingwithPentaho
• Visuallyingestandproducedatafrom/toKafkausingNEWsteps
• Processmicro-batchchunksofdatausingeitheratime-basedoramessagesize-basedwindow
• SwitchprocessingenginesbetweenSpark(Streaming)orNativeKettle
• Hardenstreamprocessinglibrariesandstepstoprocessdatafromtraditionalmessagequeues• Benefits:– Lowerthebartobuildstreamingapplications– Enablecombiningbatchandstreamdataprocessing
HowtoProcessStreamDatainPentaho
• StepsforKafkaingestionandpublish– KafkaConsumer– KafkaProducer
• Stepsforstreamprocessing– Getrecordsfromstream
• Ingestandprocesscontinuousstreamofdatainnearreal-timeinparenttransformation
• Processmicro-batchofstreamdatainseparatechildtransformation
CombinedDataProcessingUsingSpark&Pentaho
WebClickstreamandOtherLogs
TraditionalDB/DWandNoSQLDatastores
TraditionalMessageBus
DATASOURCES
IoT DataKafkaCluster
DataCollector
PentahoDIPDIcollectsdatafromsourcesincludingKafkaClusters
DataPublisher
AnalyticalDatabases
PentahoAnalytics
HADOOP/SPARKCLUSTER
DataStore
MicroServices
RTDataProcessors BatchDataProcessors
HadoopMR
HDFS
PentahoDIPDI can process streaming data using Sparkand Spark Streaming or Kettle engine in acompletely visual way
PentahoDIPDIcanretrieve
processedorblendeddatafromHadoop/SparkandpublishtoKafkaclustersorexternal
databases
Ingest Process Publish Reporting
KafkaCluster
Demonstration
RetailStoreEventProcessing
• CanberunwithinSpoonviaPentahoorwithinAEL-Sparkengine• UtilizesKafkain/out,Parquetoutandotherstepsastodemonstratestreamdataingestion,windowprocessingandmuchmore…
AvailabilityandRoadmap
Availability
• AdaptiveExecutionLayer(AEL)andSpark-AELavailableinPentaho7.1– SecureSparkintegration,high-availabilityandsecurityofAELisEEonly– SupportedHadoopdistrosinPentaho7.1- ClouderaCDHandPentaho8.0– ClouderaCDHandHortonworksHDP
• KafkastepsandstreamdataprocessingavailableinPentaho8.0– KafkafromClouderaandHortonworkstobesupported
Roadmap
• ExtendingAELtosupportotherSparkdistrosandotherdataprocessingengines• Advancedstreamprocessingwithotherreal-timemessagingprotocolsandwindowingmechanism
• EnablingBigDatadrivenmachinelearningonbatchorstreamdata
• IntegratedwithbroaderHitachiVantara portfolio
SUMMARY:VisualFuture-ProofBigDataProcessingwithPentaho
Visuallybuildstreamdataprocessingpipelinesfordifferentstreamingengines• ConfigureStreamdataprocessinglogic• Executelogicinmultiplestreamprocessingengineswithoutrework
• Connecttostreamingdatasources
NEWinPentahoü NativeStreaminginPDIü SparkStreamingviaAELü KafkaConnectivity
LeveragethepowerofAdaptiveExecutiontofuture-proofdataprocessingpipelines• Configurelogicwithoutcoding• Switchprocessingengineswithoutrework• HandleBigDataformatsmoreefficiently
NEWinPentahoü AdaptiveExecutionLayerü VisualSparkviaAELü NativeBigdataFormatHandling
NextStepsWanttolearnmore?
• Meet-the-Experts:– AnthonyDeShazor– LukeNazarro– CarloRusso
• RecommendedBreakoutSessions:– JonathanJarvis:UnderstandingParallelismwithPDIandAdaptiveExecutionwithSpark– MarkBurnette:UnderstandingtheBigDataTechnologyEcosystem