gnw03: stream processing with apache kafka by gwen shapira

46
Apache Kafka and Real Time Stream Processing Gwen Shapira System Architect Confluent @gwenshap

Upload: gluent

Post on 07-Jan-2017

1.911 views

Category:

Technology


0 download

TRANSCRIPT

ApacheKafkaandRealTimeStreamProcessing

GwenShapiraSystemArchitect

Confluent@gwenshap

I’lltellyouabout

• Whatisstreamprocessingandwhyitmatters• WhatisApacheKafka• HowKafkahelpsstreamprocessing

Stayawakeforthispart

WhatisStreamProcessing?

DataProcessingParadigm

Request/Response

Batch

StreamProcessing

StreamProcessingParadigm

• Dataisgeneratedatitsownrateas“Streams”• Wecanprocessasmuchoraslittleaswewant• Continuously• Resultsareavailableinreal-time• Butnothingwaitsforspecificresults• Timefordataavailability?• Morethan“fewms”• Lessthan“hours”

Thisistheworldchangingbit

• Mostofthebusinessis…• Noturgentenoughtorequireimmediateresponse• Butcan’twaitforthenextday

• “Streamsofevents”representssomethingfundamental• Samewayrelationaltablesarefundamental

Ok,gotthestreamspart.ButwhataboutApacheKafka?

Crossofmessagingsystemandfilesystem

KafkaisallaboutLOGS

IfyouunderstandlogsYouunderstandKafka

RedoLog:

Themostcrucialstructureforrecoveryoperations…storeallchangesmadetothedatabaseastheyoccur.

ImportantPoint

Theredologistheonly reliablesourceofinformationaboutcurrentstateofthedatabase.

ButLogsarealsoaSTREAMofeventsAndKafkastoresthoselogs

Allowingtoreadthepastandkeepgettingupdatesonthefuture

StreamProcessing

Readastreammodifyitoutputanotherstream

Example:CDC-basedETL

IfweuseKafkaforCDC,doesitmeanitisACID?

StreamProcessingisImportant

Kafkaisacollectionoflogs.

HowdoesKafkahelpwithstreamprocessing?

First,Howdoweactuallydostreamprocessing?

Method1:Doityourself(Hipsterstreamprocessing)

Method2:TheStreamProcessingFrameworks• Storm• Spark• Flink• Samza• Apex• Nifi• StreamBase• InfoSphere Streams• GoogleDataFlow (AKABeam)• Icangoonfor5morepages…

Fewofthosearereallypopular!

• Pro:Theyhandlesomehardproblems• Con:Itcanbetoocomplex

WhatdoImeanbytoocomplex?

HadoopClusterIIStorage Processing

SolR

HadoopClusterI

ClientClientFlumeAgents

Hbase /Memory

SparkStreaming

HDFS

Hive/Impala

Map/Reduce

Spark

Search

Automated&Manual

AnalyticalAdjustmentsandPatterndetection

Fetching&UpdatingProfiles

AdjustingNRTStats

HDFSEventSink

SolR Sink

BatchTimeAdjustments

Automated&Manual

ReviewofNRTChangesandCounters

LocalCache

Kafka

Clients:(Swipehere!)

WebApp

Whysomanymovingparts?

Weneeded…Hbase tohandlecomplexstateSparkrequiresHDFSIngestlayerBatchlayertohandlere-calculations

Whatwereallywantedwas…

InputsKafka

Processor

output

EnterKafkaStreams

3Simplifications:

1. UsesKafka2. NoFramework3. UnifyTablesandStreams

Don’tallstreamprocessinguseKafka?

WeuseKafkafor…Partitioning,Scalability,FaultTolerance

Kafka

A A A

GroupA

B

B

GroupB

HandlingTime

NoFramework

• Itisjustalibrarythatdoestransformations• Wecanaddlanguagesontop• Kafkadoeseverythingweneededtheframeworktodo• Youdon’tneed“framework”torunqueries,whydoyouneedittorunqueriescontinuously?

Thereallyimportantbit:StreamsmeetTables

Streams:Thingsthathappen.Events.Tables:Stateofthingsastheyare.

Databases:Onlystates.Streams:Onlyevents.

Wecanconverttablestostreamsandback:

Stream->Apply->TableTable->ChangeCapture->Stream

ThisiscalledTable-StreamDuality.

StreamsandTablessometimesworkthesame.Andsometimesareverydifferent.KafkaStreams handlesboth.

But…Wheredostreamscomefrom?

WereallylikestreamsSowecreatedaStreamDataPlatform

Wherecanwelearnmore?

• http://www.confluent.io/blog• http://kafka.apache.org/documentation.html• http://docs.confluent.io/current