apache samza: reliable stream processing atop apache kafka and hadoop yarn

29
Apache Samza Reliable Stream Processing Atop Apache Kafka and Hadoop YARN Jakob Homan London HUG

Upload: blueboxtraveler

Post on 27-Jan-2015

119 views

Category:

Technology


1 download

DESCRIPTION

Overview of Apache Samza presented to the London HUG, October 22, 2013

TRANSCRIPT

Page 1: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Apache Samza

Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Jakob Homan London HUG

Page 2: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Who I am

• Samza for five months• Before that Hadoop, Hive, Giraph• Say hi: @blueboxtraveler

Page 3: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Things we would like to do(better)

Page 4: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Provide timely, relevant updates to your newsfeed

Page 5: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Update search results with new information as it appears

Page 6: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Sculpt metrics and logs into useful shapes

Page 7: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Tools?

Response latency

Samza

Milliseconds to minutes

RPC

Synchronous Later. Possibly much later.

Page 8: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Frame(work) of reference

ClassicHadoop

Samza

Storage layerExecutionengine API

HDFS

Kafka

Map-Reduce

YARN

map(k, v) => (k,v)reduce(k, list(v)) => (k,v)

process(msg(k,v)) => msg(k,v)

Page 9: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Storage layer: Kafka

Page 10: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Apache Kafka

• Persistent, reliable,distributed message queue

Shiny new logo!

Page 11: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

At LinkedIn

10+ billionwrites per day

172kmessages per second

(average)

55+ billionmessages per day

to real-time consumers

Page 12: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Quick aside…

Kafka: First among (pluggable) equals

LinkedIn: Espresso and Databus

Coming soon? HDFS, ActiveMQ, Amazon SQS

Page 13: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Kafka in four bullet points

• Producers send messages to brokers• Messages are key, value pairs• Brokers store messages in topics for

consumers• Consumers pull messages from brokers

Page 14: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

A Kafka Topic

“Very sleepy”53 4 “Car nicked!”75 5 “The ref’s blind!”23 4 “Nicked a car!”53 4

Topic: StatusUpdateEvent

Key: User ID of user who updated the status

Value: Timestamp, new status, geolocation, etc.

Page 15: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Kafka topics are partitioned

Message contentsKe y Message

contentsKe yMessage contentsKe y Message

contentsKe y Message contentsKe y Message

contentsKe y

Message contentsKe y Message

contentsKe y Message contentsKe y Message

contentsKe yPartition 0

Partition 1

Partition 2

For our purposes, hash partitioned on the key!

Page 16: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

A Samza job

Input topics

• StatusUpdateEvent• NewConnectionEvent• LikeUpdateEvent

Some code

MyStreamTask implements StreamTask{ …………. }

Output topics

• NewsUpdatePost• UpdatesPerHourMetric

Page 17: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Execution engine: YARN

Page 18: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

What we use YARN for

• Distributing our tasks across multiple machines

• Letting us know when one has died• Distributing a replacement• Isolating our tasks from each other

Page 19: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Machine 1 Machine 1

YARN: Execution and reliability

MyStreamTask:process()

Samza TaskRunner: Partition 0

MyStreamTask:process()

Samza TaskRunner: Partition 1

Node Manager 2Node Manager 1

Samza App Master

Kafka Broker Kafka Broker

Page 20: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Co-partitioning of topics

MyStreamTask:process()

Samza TaskRunner: Partition 0StatusUpdateEvent, Partition 0

NewConnectionEvent, Partition 0

NewsUpdatePost

An instance of StreamTask is responsible for a specific partition

Page 21: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

API: process()

Page 22: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator ) }

getKey(), getMsg()

sendMsg(topic, key, value)commit(), shutdown()

Page 23: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Awesome feature: State

• Generic data store interface• Key-value out-of-box– More soon? Bloom filter, lucene, etc.

• Restored by Samza upon task crash

MyStreamTask:process()

Samza TaskRunner: Partition 0

Store state

Page 24: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

(Pseudo)code snippet: Newsfeed

• Consume StatusUpdateEvent– Send those updates to all your conmections via

the NewsUpdatePost topic• Consume NewConnectionEvent– Maintain state of connections to know who to

send to

Page 25: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

public class NewsFeed implements StreamTask { void process(envelope, collector, coordinator) { msg = env.getMsg() userId = msg.get(“userID”); if(msg.get(“type”)==STATUS_UPDATE) { foreach(conn: kvStore.get(userId) { collector.send(“NewsUpdatePost”, new Msg(conn, msg.get(“newStatus”))

} } else { newConn = msg.get(“newConnection”) connections = kvStore.get(userId) kvStore.put(userID, connections ++ newConn) }

Page 26: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Current status

Page 27: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Hello, Samza!

Cool, eh? bit.ly/hello-samza

Consume Wikipedia edits live

Up and running in 3 minutes

Generate stats on those edits

Page 28: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

samza.incubator.apache.org bit.ly/samza_newbie_issues

Page 29: Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Cheers!

• Quick start: bit.ly/hello-samza• Project homepage: samza.incubator.apache.org• Newbie issues: bit.ly/samza_newbie_issues• Detailed Samza and YARN talk: bit.ly/samza_and_yarn• Twitter: @samzastream