kafka and hadoop at linkedin meetup

Kafka & Hadoop

Gwen Shapira / Software Engineer

2©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data around

• Formerly consultant

• Now Cloudera Engineer:– Sqoop Committer

– Kafka

– Flume

About Me


There’s a book on that!


We are also blogging

6

Getting Data from Kafka to Hadoop

There are only bad options.

It's about finding the best one.

©2014 Cloudera, Inc. All rights reserved.

7

Batch



Camus


Camus

ZooKeeper

Setup

Topic Offsets

Pro

cesses

HD

FS

Oth

er

Syste

ms

TaskTask

Task

In process

Avro Files

In process

Avro FilesAudit Counts

Clean Up

Kakfa

B

A

C

D

F

G H

I

E


Sqoop2

From

(RDBMS,

HDFS,

Hive,

Hbase)

To

(RDBMS,

HDFS,

Hbase,

Hive

Kafka)

Engine

(Webserver,

Rest API,

Repository,

MapReduce)

Client


NiFi!

12

Mappers

HiveKa = Hive + Kafka

Hive

Storag

e

Handle

r

KafkaInputFor

mat.

getSplits()

Kafka

Get topic, partitionsand offsets

MapReduc

e

SetupMappers

Mappers

KafkaRecordRea

der

Get data

Avro

SerDe

KafkaKafka

13Click to enter confidentiality information

14Click to enter confidentiality information

15

Streaming



Flume + Kafka = Flafka

17

Sources Interceptors Selectors Channels Sinks

Flume Agent

How does work?Twitter, logs,

webserver,

Kafka…

Mask, re-format,

validate…DR, critical

Memory, file,

Kafka

HDFS,

Hbase, Solr,

Kafka

18

But I just want to

get data from Kafka

to Hbase / HDFS


19

Channels Sinks

Flume Agent

Kafka ChannelKafka! HDFS,

Hbase, Solr

20

Kafka Channel

Sources Interceptors Selectors Channels

Flume Agent

Twitter, logs,

webserver,

Kafka…

Mask, re-format,

validate…DR, critical

Memory, file,

Kafka


SparkStreaming

Single Pass

SourceRawInput

DStreamRDD

SourceRawInput

DStreamRDD

RDD

Filter Count Print

SourceRawInput

DStreamRDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first

Batch

First

Batch

Second

Batch


Storm

Spout

Source

Split

words

bolts

Split

words

bolts

Spout

Split

words

bolts

Split

words

bolts

Count

Count

Count

Spout Layer Fan out Layer 1 Shuffle Layer 2


Retro Thoughts


• Data often has schema

• At least it should

• Kafka is unaware – which is good

• Need capability to figure out schema for events

• Without including it in every event

Schema


Kafka in Cloudera Manager

Questions?