kafka & hadoop - for nyc kafka meetup

Post on 02-Dec-2014

2.559 Views

Category:

Software

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

How do you integrate Kafka & Hadoop, and how will it get better soon.

TRANSCRIPT

Kafka & Hadoop

Gwen Shapira / Software Engineer

2©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data around• Formerly consultant• Now Cloudera Engineer:– Flume– Sqoop– Kafka

About Me

3©2014 Cloudera, Inc. All rights reserved.

There’s a book on that!

4©2014 Cloudera, Inc. All rights reserved.

We are also blogging

5©2014 Cloudera, Inc. All rights reserved.

Getting Data from Kafka to Hadoop

There are only bad options.

It's about finding the best one.

6©2014 Cloudera, Inc. All rights reserved.

Camus

7©2014 Cloudera, Inc. All rights reserved.

Camus

ZooKeeper

Setup

Topic Offsets

Pro

cess

es

HD

FSO

ther

Syst

em

s

TaskTask

Task

In process Avro Files

In process Avro Files

Audit Counts

Clean Up

Kakfa

B

A

C

D

F

G H

I

E

8©2014 Cloudera, Inc. All rights reserved.

• Kafka has no MR layer– InputFormat, OutputFormat, Utils…

• Sqoop is a generic batch ingest framework– Why no Kafka?

Missing in Action

9©2014 Cloudera, Inc. All rights reserved.

Flume + Kafka = Flafka

10

Sources Interceptors Selectors Channels Sinks

Flume Agent

How does work?Twitter, logs, webserver,

Kafka…

Mask, re-format,

validate…DR, critical

Memory, file

HDFS, Hbase,

Solr, Kafka

11

But I just want to get data from Kafka to Hbase / HDFS

©2014 Cloudera, Inc. All rights reserved.

12

Channels Sinks

Flume Agent

Kafka ChannelKafka! HDFS,

Hbase, Solr

13©2014 Cloudera, Inc. All rights reserved.

SparkStreaming

Single Pass

SourceRawInputDStream

RDD

SourceRawInputDStream

RDD

RDD

Filter Count Print

SourceRawInputDStream

RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

14©2014 Cloudera, Inc. All rights reserved.

Storm

Spout

Source

Split wordsbolts

Split wordsbolts

Spout

Split wordsbolts

Split wordsbolts

Count

Count

Count

Spout Layer Fan out Layer 1 Shuffle Layer 2

15©2014 Cloudera, Inc. All rights reserved.

Retro Thoughts

16©2014 Cloudera, Inc. All rights reserved.

• Data often has schema• At least it should• Kafka is unaware – which is good• Need capability to figure out schema for

events• Without including it in every event

Schema

17©2014 Cloudera, Inc. All rights reserved.

Kafka in Cloudera Manager

18

Visit us at

Booth #305

BOOK SIGNINGS THEATER SESSIONS

TECHNICAL DEMOS GIVEAWAYS

top related