Kafka and Hadoop at LinkedIn Meetup

Download Kafka and Hadoop at LinkedIn Meetup

Post on 14-Jul-2015




1 download


PowerPoint Presentation

Kafka & HadoopGwen Shapira / Software Engineer#2014 Cloudera, Inc. All rights reserved.15 years of moving data aroundFormerly consultantNow Cloudera Engineer:Sqoop CommitterKafkaFlumeAbout Me

#This gives me a lot of perspective regarding the use of Hadoop22014 Cloudera, Inc. All rights reserved.

Theres a book on that!#32014 Cloudera, Inc. All rights reserved.

We are also blogging#2014 Cloudera, Inc. All rights reserved.Getting data from Kafka to HadoopThe nice thing about standards is that there is so many of them to choose from

Andrew S. Tanenbaum#Getting Data from Kafka to HadoopThere are only bad options.

It's about finding the best one.

2014 Cloudera, Inc. All rights reserved.#https://gist.github.com/gwenshap/96990726Batch2014 Cloudera, Inc. All rights reserved.##2014 Cloudera, Inc. All rights reserved.

Camus#Batch MapReduce job. Exactly once semantics. Run once every X minutes. 82014 Cloudera, Inc. All rights reserved.CamusZooKeeperSetupTopic OffsetsProcessesHDFSOther SystemsTaskTaskTaskIn process Avro FilesIn process Avro FilesAudit CountsClean UpKakfaBACDFGHIE#A - The setup stage fetches broker urls and topic information from ZooKeeper.B - The setup stage persists information about topics and offsets in HDFS for the tasks to read.C - The tasks read the persisted information from the setup stage.D - The tasks get events from Kakfa.E - The tasks write data to a temp location in HDFS in the format defined by the user defined decoder, in this case Avro formatted files.F - The tasks move the data in the temp location to a final location when the task is cleaning up.G - The task writes out audit counts on its activities.H - A clean up stage reads all the audit counts from all the tasks.I - The clean up stage reports back to Kakfa what has been persisted.

92014 Cloudera, Inc. All rights reserved.Sqoop2From



(Webserver,Rest API,Repository,MapReduce)Client#2014 Cloudera, Inc. All rights reserved.

NiFi!#MappersHiveKa = Hive + KafkaHiveStorageHandlerKafkaInputFormat.getSplits()KafkaGet topic, partitionsand offsetsMapReduceSetupMappersMappersKafkaRecordReaderGet dataAvroSerDeKafkaKafka#Click to enter confidentiality information

#Click to enter confidentiality information

#Streaming2014 Cloudera, Inc. All rights reserved.##2014 Cloudera, Inc. All rights reserved.Flume + Kafka = Flafka

#Kafka source + sink for Flume16SourcesInterceptorsSelectorsChannelsSinksFlume AgentHow does work?Twitter, logs, webserver, KafkaMask, re-format, validateDR, criticalMemory, file,KafkaHDFS, Hbase, Solr, Kafka#Does not require programming. 17But I just want to get data from Kafka to Hbase / HDFS2014 Cloudera, Inc. All rights reserved.#ChannelsSinksFlume AgentKafka ChannelKafka!HDFS, Hbase, Solr#Does not require programming. 19Kafka ChannelSourcesInterceptorsSelectorsChannelsFlume AgentTwitter, logs, webserver, KafkaMask, re-format, validateDR, criticalMemory, file, Kafka#Does not require programming. 202014 Cloudera, Inc. All rights reserved.SparkStreamingSingle PassSourceRawInputDStreamRDDSourceRawInputDStreamRDDRDDFilterCountPrintSourceRawInputDStreamRDDRDDRDDSingle PassFilterCountPrintPre-first BatchFirst BatchSecond Batch#MicroBatch stream processing framework. Basically Spark code executed in a slightly different context and some windowing functions.212014 Cloudera, Inc. All rights reserved.StormSpoutSourceSplit wordsboltsSplit wordsboltsSpoutSplit wordsboltsSplit wordsboltsCountCountCountSpout LayerFan outLayer 1ShuffleLayer 2#Stream processing framework. Quite popular. Can be event-based or micro-batching (with Trident). Requires low level awareness of API.222014 Cloudera, Inc. All rights reserved.Retro Thoughts

#232014 Cloudera, Inc. All rights reserved.Data often has schemaAt least it shouldKafka is unaware which is goodNeed capability to figure out schema for eventsWithout including it in every eventSchema#242014 Cloudera, Inc. All rights reserved.

Kafka in Cloudera Manager#Questions?#