Kafka and Hadoop at LinkedIn Meetup
Post on 14-Jul-2015
Kafka & HadoopGwen Shapira / Software Engineer#2014 Cloudera, Inc. All rights reserved.15 years of moving data aroundFormerly consultantNow Cloudera Engineer:Sqoop CommitterKafkaFlumeAbout Me
#This gives me a lot of perspective regarding the use of Hadoop22014 Cloudera, Inc. All rights reserved.
Theres a book on that!#32014 Cloudera, Inc. All rights reserved.
We are also blogging#2014 Cloudera, Inc. All rights reserved.Getting data from Kafka to HadoopThe nice thing about standards is that there is so many of them to choose from
Andrew S. Tanenbaum#Getting Data from Kafka to HadoopThere are only bad options.
It's about finding the best one.
2014 Cloudera, Inc. All rights reserved.#https://gist.github.com/gwenshap/96990726Batch2014 Cloudera, Inc. All rights reserved.##2014 Cloudera, Inc. All rights reserved.
Camus#Batch MapReduce job. Exactly once semantics. Run once every X minutes. 82014 Cloudera, Inc. All rights reserved.CamusZooKeeperSetupTopic OffsetsProcessesHDFSOther SystemsTaskTaskTaskIn process Avro FilesIn process Avro FilesAudit CountsClean UpKakfaBACDFGHIE#A - The setup stage fetches broker urls and topic information from ZooKeeper.B - The setup stage persists information about topics and offsets in HDFS for the tasks to read.C - The tasks read the persisted information from the setup stage.D - The tasks get events from Kakfa.E - The tasks write data to a temp location in HDFS in the format defined by the user defined decoder, in this case Avro formatted files.F - The tasks move the data in the temp location to a final location when the task is cleaning up.G - The task writes out audit counts on its activities.H - A clean up stage reads all the audit counts from all the tasks.I - The clean up stage reports back to Kakfa what has been persisted.
92014 Cloudera, Inc. All rights reserved.Sqoop2From
(Webserver,Rest API,Repository,MapReduce)Client#2014 Cloudera, Inc. All rights reserved.
NiFi!#MappersHiveKa = Hive + KafkaHiveStorageHandlerKafkaInputFormat.getSplits()KafkaGet topic, partitionsand offsetsMapReduceSetupMappersMappersKafkaRecordReaderGet dataAvroSerDeKafkaKafka#Click to enter confidentiality information
#Click to enter confidentiality information
#Streaming2014 Cloudera, Inc. All rights reserved.##2014 Cloudera, Inc. All rights reserved.Flume + Kafka = Flafka
#Kafka source + sink for Flume16SourcesInterceptorsSelectorsChannelsSinksFlume AgentHow does work?Twitter, logs, webserver, KafkaMask, re-format, validateDR, criticalMemory, file,KafkaHDFS, Hbase, Solr, Kafka#Does not require programming. 17But I just want to get data from Kafka to Hbase / HDFS2014 Cloudera, Inc. All rights reserved.#ChannelsSinksFlume AgentKafka ChannelKafka!HDFS, Hbase, Solr#Does not require programming. 19Kafka ChannelSourcesInterceptorsSelectorsChannelsFlume AgentTwitter, logs, webserver, KafkaMask, re-format, validateDR, criticalMemory, file, Kafka#Does not require programming. 202014 Cloudera, Inc. All rights reserved.SparkStreamingSingle PassSourceRawInputDStreamRDDSourceRawInputDStreamRDDRDDFilterCountPrintSourceRawInputDStreamRDDRDDRDDSingle PassFilterCountPrintPre-first BatchFirst BatchSecond Batch#MicroBatch stream processing framework. Basically Spark code executed in a slightly different context and some windowing functions.212014 Cloudera, Inc. All rights reserved.StormSpoutSourceSplit wordsboltsSplit wordsboltsSpoutSplit wordsboltsSplit wordsboltsCountCountCountSpout LayerFan outLayer 1ShuffleLayer 2#Stream processing framework. Quite popular. Can be event-based or micro-batching (with Trident). Requires low level awareness of API.222014 Cloudera, Inc. All rights reserved.Retro Thoughts
#232014 Cloudera, Inc. All rights reserved.Data often has schemaAt least it shouldKafka is unaware which is goodNeed capability to figure out schema for eventsWithout including it in every eventSchema#242014 Cloudera, Inc. All rights reserved.
Kafka in Cloudera Manager#Questions?#