Kafka and Hadoop at LinkedIn Meetup
Post on 14-Jul-2015
PowerPoint PresentationKafka & HadoopGwen Shapira / Software Engineer#2014 Cloudera, Inc. All rights reserved.15 years of moving data aroundFormerly consultantNow Cloudera Engineer:Sqoop CommitterKafkaFlumeAbout Me#This gives me a lot of perspective regarding the use of Hadoop22014 Cloudera, Inc. All rights reserved.Theres a book on that!#32014 Cloudera, Inc. All rights reserved.We are also blogging#2014 Cloudera, Inc. All rights reserved.Getting data from Kafka to HadoopThe nice thing about standards is that there is so many of them to choose fromAndrew S. Tanenbaum#Getting Data from Kafka to HadoopThere are only bad options. It's about finding the best one.2014 Cloudera, Inc. All rights reserved.#https://gist.github.com/gwenshap/96990726Batch2014 Cloudera, Inc. All rights reserved.##2014 Cloudera, Inc. All rights reserved.Camus#Batch MapReduce job. Exactly once semantics. Run once every X minutes. 82014 Cloudera, Inc. All rights reserved.CamusZooKeeperSetupTopic OffsetsProcessesHDFSOther SystemsTaskTaskTaskIn process Avro FilesIn process Avro FilesAudit CountsClean UpKakfaBACDFGHIE#A - The setup stage fetches broker urls and topic information from ZooKeeper.B - The setup stage persists information about topics and offsets in HDFS for the tasks to read.C - The tasks read the persisted information from the setup stage.D - The tasks get events from Kakfa.E - The tasks write data to a temp location in HDFS in the format defined by the user defined decoder, in this case Avro formatted files.F - The tasks move the data in the temp location to a final location when the task is cleaning up.G - The task writes out audit counts on its activities.H - A clean up stage reads all the audit counts from all the tasks.I - The clean up stage reports back to Kakfa what has been persisted.92014 Cloudera, Inc. All rights reserved.Sqoop2From(RDBMS,HDFS,Hive,Hbase)To(RDBMS,HDFS,Hbase,HiveKafka)Engine(Webserver,Rest API,Repository,MapReduce)Client#2014 Cloudera, Inc. All rights reserved.NiFi!#MappersHiveKa = Hive + KafkaHiveStorageHandlerKafkaInputFormat.getSplits()KafkaGet topic, partitionsand offsetsMapReduceSetupMappersMappersKafkaRecordReaderGet dataAvroSerDeKafkaKafka#Click to enter confidentiality information#Click to enter confidentiality information#Streaming2014 Cloudera, Inc. All rights reserved.##2014 Cloudera, Inc. All rights reserved.Flume + Kafka = Flafka#Kafka source + sink for Flume16SourcesInterceptorsSelectorsChannelsSinksFlume AgentHow does work?Twitter, logs, webserver, KafkaMask, re-format, validateDR, criticalMemory, file,KafkaHDFS, Hbase, Solr, Kafka#Does not require programming. 17But I just want to get data from Kafka to Hbase / HDFS2014 Cloudera, Inc. All rights reserved.#ChannelsSinksFlume AgentKafka ChannelKafka!HDFS, Hbase, Solr#Does not require programming. 19Kafka ChannelSourcesInterceptorsSelectorsChannelsFlume AgentTwitter, logs, webserver, KafkaMask, re-format, validateDR, criticalMemory, file, Kafka#Does not require programming. 202014 Cloudera, Inc. All rights reserved.SparkStreamingSingle PassSourceRawInputDStreamRDDSourceRawInputDStreamRDDRDDFilterCountPrintSourceRawInputDStreamRDDRDDRDDSingle PassFilterCountPrintPre-first BatchFirst BatchSecond Batch#MicroBatch stream processing framework. Basically Spark code executed in a slightly different context and some windowing functions.212014 Cloudera, Inc. All rights reserved.StormSpoutSourceSplit wordsboltsSplit wordsboltsSpoutSplit wordsboltsSplit wordsboltsCountCountCountSpout LayerFan outLayer 1ShuffleLayer 2#Stream processing framework. Quite popular. Can be event-based or micro-batching (with Trident). Requires low level awareness of API.222014 Cloudera, Inc. All rights reserved.Retro Thoughts#232014 Cloudera, Inc. All rights reserved.Data often has schemaAt least it shouldKafka is unaware which is goodNeed capability to figure out schema for eventsWithout including it in every eventSchema#242014 Cloudera, Inc. All rights reserved.Kafka in Cloudera Manager#Questions?#
View more >
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at Linkedin Hadoop Summit 2013 Joel Koshy June 2013 LinkedIn Corporation ©2013 All Rights Reserved.