Kafka and Hadoop at LinkedIn Meetup

Download Kafka and Hadoop at LinkedIn Meetup

Post on 14-Jul-2015




1 download

Embed Size (px)


<p>PowerPoint Presentation</p> <p>Kafka &amp; HadoopGwen Shapira / Software Engineer#2014 Cloudera, Inc. All rights reserved.15 years of moving data aroundFormerly consultantNow Cloudera Engineer:Sqoop CommitterKafkaFlumeAbout Me</p> <p>#This gives me a lot of perspective regarding the use of Hadoop22014 Cloudera, Inc. All rights reserved.</p> <p>Theres a book on that!#32014 Cloudera, Inc. All rights reserved.</p> <p>We are also blogging#2014 Cloudera, Inc. All rights reserved.Getting data from Kafka to HadoopThe nice thing about standards is that there is so many of them to choose from</p> <p>Andrew S. Tanenbaum#Getting Data from Kafka to HadoopThere are only bad options. </p> <p>It's about finding the best one.</p> <p>2014 Cloudera, Inc. All rights reserved.#https://gist.github.com/gwenshap/96990726Batch2014 Cloudera, Inc. All rights reserved.##2014 Cloudera, Inc. All rights reserved.</p> <p>Camus#Batch MapReduce job. Exactly once semantics. Run once every X minutes. 82014 Cloudera, Inc. All rights reserved.CamusZooKeeperSetupTopic OffsetsProcessesHDFSOther SystemsTaskTaskTaskIn process Avro FilesIn process Avro FilesAudit CountsClean UpKakfaBACDFGHIE#A - The setup stage fetches broker urls and topic information from ZooKeeper.B - The setup stage persists information about topics and offsets in HDFS for the tasks to read.C - The tasks read the persisted information from the setup stage.D - The tasks get events from Kakfa.E - The tasks write data to a temp location in HDFS in the format defined by the user defined decoder, in this case Avro formatted files.F - The tasks move the data in the temp location to a final location when the task is cleaning up.G - The task writes out audit counts on its activities.H - A clean up stage reads all the audit counts from all the tasks.I - The clean up stage reports back to Kakfa what has been persisted.</p> <p>92014 Cloudera, Inc. All rights reserved.Sqoop2From</p> <p>(RDBMS,HDFS,Hive,Hbase)To</p> <p>(RDBMS,HDFS,Hbase,HiveKafka)Engine</p> <p>(Webserver,Rest API,Repository,MapReduce)Client#2014 Cloudera, Inc. All rights reserved.</p> <p>NiFi!#MappersHiveKa = Hive + KafkaHiveStorageHandlerKafkaInputFormat.getSplits()KafkaGet topic, partitionsand offsetsMapReduceSetupMappersMappersKafkaRecordReaderGet dataAvroSerDeKafkaKafka#Click to enter confidentiality information</p> <p>#Click to enter confidentiality information</p> <p>#Streaming2014 Cloudera, Inc. All rights reserved.##2014 Cloudera, Inc. All rights reserved.Flume + Kafka = Flafka</p> <p>#Kafka source + sink for Flume16SourcesInterceptorsSelectorsChannelsSinksFlume AgentHow does work?Twitter, logs, webserver, KafkaMask, re-format, validateDR, criticalMemory, file,KafkaHDFS, Hbase, Solr, Kafka#Does not require programming. 17But I just want to get data from Kafka to Hbase / HDFS2014 Cloudera, Inc. All rights reserved.#ChannelsSinksFlume AgentKafka ChannelKafka!HDFS, Hbase, Solr#Does not require programming. 19Kafka ChannelSourcesInterceptorsSelectorsChannelsFlume AgentTwitter, logs, webserver, KafkaMask, re-format, validateDR, criticalMemory, file, Kafka#Does not require programming. 202014 Cloudera, Inc. All rights reserved.SparkStreamingSingle PassSourceRawInputDStreamRDDSourceRawInputDStreamRDDRDDFilterCountPrintSourceRawInputDStreamRDDRDDRDDSingle PassFilterCountPrintPre-first BatchFirst BatchSecond Batch#MicroBatch stream processing framework. Basically Spark code executed in a slightly different context and some windowing functions.212014 Cloudera, Inc. All rights reserved.StormSpoutSourceSplit wordsboltsSplit wordsboltsSpoutSplit wordsboltsSplit wordsboltsCountCountCountSpout LayerFan outLayer 1ShuffleLayer 2#Stream processing framework. Quite popular. Can be event-based or micro-batching (with Trident). Requires low level awareness of API.222014 Cloudera, Inc. All rights reserved.Retro Thoughts</p> <p>#232014 Cloudera, Inc. All rights reserved.Data often has schemaAt least it shouldKafka is unaware which is goodNeed capability to figure out schema for eventsWithout including it in every eventSchema#242014 Cloudera, Inc. All rights reserved.</p> <p>Kafka in Cloudera Manager#Questions?#</p>