processing big data with pentaho - presentation€¦ · summary: visual future-proof big data...

ProcessingBigDatawithPentahoRakeshSahaPentahoSeniorProductManager,HitachiVantara

Agenda

• Processbigdatavisuallyinfuture-proofway– Demo

• Combinestreamdataprocessingwithbatch– Demo

Pentaho’sLatestandUpcomingFeaturesforProcessingBigData– BatchorReal-time

BigDataProcessingisHARD

1)GartnerAnalyst,NickHeudecker;infoworld.com,Sept2015

"Through2018,70%ofHadoopdeploymentswillnotmeet

costsavingsandrevenuegenerationobjectivesduetoskills

andintegrationchallenges.”– GARTNER1

1NewSkillsNecessary

2HighEffortandRisk

3ContinuousChange

BigDataIntegrationandAnalyticsWorkflowwithPentaho

BigDataChallenges• ProcessingSemi/un/structureddata

• Blendingbigdatawithtraditionaldata

• Maintainingsecurity,governanceofdata

• Processingstreamingdatainrealtimeandhistorically

• Enablingandoperationalizingdatascience

DataLake

AnalyticDatabase

PentahoAnalyzer

Sensor

Bigorsmalldata

PentahoData

IntegrationPentahoReporting

MSGQueueKafka,JMS,

MQTTMachineLearning

R,Python

Stream FeedbackLoop

LOBApplications

Embedded

PentahoData

Integration

ProcessBigDataVisuallyinaFutureProofWay

VisualBigDataProcessingwithPentaho

• What:VisuallyingestandprocessBigDataatenterprisescale

• WhatSpecial:VisuallydeveloponceandexecuteonanyenginewithAdaptiveExecutionLayer(AEL)

• Why– Difficulttofindqualifieddevelopers– Difficulttokeepupwithnewtechnologies

• AvailablesincePentaho7.1

AdaptiveExecutionofBigData

BuildOnce,ExecuteonAnyEngineChallenge:Withrapidlychangingbigdatatechnology,codingonvariousenginescanbetime-consumingorimpossiblewithexistingresources

Solution:Future-proofdataintegrationandanalyticsdevelopmentinadrag-and-dropvisualdevelopmentenvironment,eliminatingtheneedforspecializedcodingandAPIknowledge.Seamlesslyswitchbetweenexecutionenginestofitdatavolumeandtransformationcomplexity

PDI

PentahoKettle

AdaptiveExecutionforSpark

ProcessBigDataFasteronSparkWithoutAnyCodingChallenge:FindingthetalentandtimetoworkwithSparkandnewerbigdatatechnologies

Solution:MoreeasilydevelopbigdataapplicationsinPDIusingadaptiveexecutiontoingest,processandblenddatafromarangeofbigdatasourcesandscaleonSparkclusters

PDI

PentahoKettle

UpcomingEnhancedAdaptiveExecutionLayer

• SimplifiedSetup– Fewerstepstosetup– Easytoconfigurefail-over,load-balancing

• Developmentproductivity– Robusttransformationerrorandstatusreporting– CustomizationofSparkjobs

• RobustEnterpriseSecurity– ClienttoAELconnectioncanbesecured– End-2-endKerberosimpersonationfromclienttooltocluster

PDIClient

Spark/HadoopProcessingNodes

HADOOPCLUSTER

AEL-SparkEngine(SparkDriver)

AEL-SparkDaemon(EdgeNodes)

Hadoop/SparkCompatibleStorageCluster

HDFS AzureStorage

AmazonS3

Etc…

SparkExecutors

UpcomingBigDataFileFormatHandling

BigDataplatformsintroducedvariousdataformatstoimproveperformance,compression,andinteroperability

What:• VisualhandlingofdatafileswithBigDataformatsParquetandAvro– Readingandwritingfileswithspecificsteps– NativelyexecuteinSparkviaAEL

Why:• EaseofdevelopmentofBigDataprocessing

• Performanceimprovementduetoavoidanceofintermediateformats

Demonstration

RetailWebLogDataProcessingwithPentaho

• RunwithinSpoonviaPentahoduringdevelopmentandthenuseSparkclusterforproduction

• Lookups,sort,andParquetfilein/outandotherstepsastotestparallelandserialprocessingwithinSparkCluster

CombineStreamProcessingwithBatchProcessing

WhatisStreamDataProcessing?AndWhy?

• Batchdataprocessingisuseful,butsometimesbusinessesneedtoobtaincrucialinsightsfasterandactonthem

• Manyusecasesmustconsiderdata2+times:onthewire,andthensubsequentlyashistoricaldata

• Getcrucialtime-sensitiveinsights– Reacttocustomerinteractionsonawebsiteormobileapp– Predictriskofequipmentbreakdownbeforeithappens

FormerPOV“securedatainDW,thenOLAPASAPafterward”giveswayto

CurrentPOV“analyzeonthewire,writebehind”

NEWStreamDataProcessingwithPentaho

• Visuallyingestandproducedatafrom/toKafkausingNEWsteps

• Processmicro-batchchunksofdatausingeitheratime-basedoramessagesize-basedwindow

• SwitchprocessingenginesbetweenSpark(Streaming)orNativeKettle

• Hardenstreamprocessinglibrariesandstepstoprocessdatafromtraditionalmessagequeues• Benefits:– Lowerthebartobuildstreamingapplications– Enablecombiningbatchandstreamdataprocessing

HowtoProcessStreamDatainPentaho

• StepsforKafkaingestionandpublish– KafkaConsumer– KafkaProducer

• Stepsforstreamprocessing– Getrecordsfromstream

• Ingestandprocesscontinuousstreamofdatainnearreal-timeinparenttransformation

• Processmicro-batchofstreamdatainseparatechildtransformation

CombinedDataProcessingUsingSpark&Pentaho

WebClickstreamandOtherLogs

TraditionalDB/DWandNoSQLDatastores

TraditionalMessageBus

DATASOURCES

IoT DataKafkaCluster

DataCollector

PentahoDIPDIcollectsdatafromsourcesincludingKafkaClusters

DataPublisher

AnalyticalDatabases

PentahoAnalytics

HADOOP/SPARKCLUSTER

DataStore

MicroServices

RTDataProcessors BatchDataProcessors

HadoopMR

HDFS

PentahoDIPDI can process streaming data using Sparkand Spark Streaming or Kettle engine in acompletely visual way

PentahoDIPDIcanretrieve

processedorblendeddatafromHadoop/SparkandpublishtoKafkaclustersorexternal

databases

Ingest Process Publish Reporting

KafkaCluster

Demonstration

RetailStoreEventProcessing

• CanberunwithinSpoonviaPentahoorwithinAEL-Sparkengine• UtilizesKafkain/out,Parquetoutandotherstepsastodemonstratestreamdataingestion,windowprocessingandmuchmore…

AvailabilityandRoadmap

Availability

• AdaptiveExecutionLayer(AEL)andSpark-AELavailableinPentaho7.1– SecureSparkintegration,high-availabilityandsecurityofAELisEEonly– SupportedHadoopdistrosinPentaho7.1- ClouderaCDHandPentaho8.0– ClouderaCDHandHortonworksHDP

• KafkastepsandstreamdataprocessingavailableinPentaho8.0– KafkafromClouderaandHortonworkstobesupported

Roadmap

• ExtendingAELtosupportotherSparkdistrosandotherdataprocessingengines• Advancedstreamprocessingwithotherreal-timemessagingprotocolsandwindowingmechanism

• EnablingBigDatadrivenmachinelearningonbatchorstreamdata

• IntegratedwithbroaderHitachiVantara portfolio

SUMMARY:VisualFuture-ProofBigDataProcessingwithPentaho

Visuallybuildstreamdataprocessingpipelinesfordifferentstreamingengines• ConfigureStreamdataprocessinglogic• Executelogicinmultiplestreamprocessingengineswithoutrework

• Connecttostreamingdatasources

NEWinPentahoü NativeStreaminginPDIü SparkStreamingviaAELü KafkaConnectivity

LeveragethepowerofAdaptiveExecutiontofuture-proofdataprocessingpipelines• Configurelogicwithoutcoding• Switchprocessingengineswithoutrework• HandleBigDataformatsmoreefficiently

NEWinPentahoü AdaptiveExecutionLayerü VisualSparkviaAELü NativeBigdataFormatHandling

NextStepsWanttolearnmore?

• Meet-the-Experts:– AnthonyDeShazor– LukeNazarro– CarloRusso

• RecommendedBreakoutSessions:– JonathanJarvis:UnderstandingParallelismwithPDIandAdaptiveExecutionwithSpark– MarkBurnette:UnderstandingtheBigDataTechnologyEcosystem

processing big data with pentaho - presentation€¦ · summary: visual future-proof big data...

Documents