apache samza: reliable stream processing atop apache kafka and hadoop yarn

Apache Samza

Reliable Stream Processing Atop Apache Kafka and Hadoop YARN

Jakob Homan London HUG

Who I am

• Samza for five months• Before that Hadoop, Hive, Giraph• Say hi: @blueboxtraveler

Things we would like to do(better)

Provide timely, relevant updates to your newsfeed

Update search results with new information as it appears

Sculpt metrics and logs into useful shapes

Tools?

Response latency

Samza

Milliseconds to minutes

RPC

Synchronous Later. Possibly much later.

Frame(work) of reference

ClassicHadoop

Samza

Storage layerExecutionengine API

HDFS

Kafka

Map-Reduce

YARN

map(k, v) => (k,v)reduce(k, list(v)) => (k,v)

process(msg(k,v)) => msg(k,v)

Storage layer: Kafka

Apache Kafka

• Persistent, reliable,distributed message queue

Shiny new logo!

At LinkedIn

10+ billionwrites per day

172kmessages per second

(average)

55+ billionmessages per day

to real-time consumers

Quick aside…

Kafka: First among (pluggable) equals

LinkedIn: Espresso and Databus

Coming soon? HDFS, ActiveMQ, Amazon SQS

Kafka in four bullet points

• Producers send messages to brokers• Messages are key, value pairs• Brokers store messages in topics for

consumers• Consumers pull messages from brokers

A Kafka Topic

“Very sleepy”53 4 “Car nicked!”75 5 “The ref’s blind!”23 4 “Nicked a car!”53 4

Topic: StatusUpdateEvent

Key: User ID of user who updated the status

Value: Timestamp, new status, geolocation, etc.

Kafka topics are partitioned

Message contentsKe y Message

contentsKe yMessage contentsKe y Message

contentsKe y Message contentsKe y Message

contentsKe y

Message contentsKe y Message

contentsKe y Message contentsKe y Message

contentsKe yPartition 0

Partition 1

Partition 2

For our purposes, hash partitioned on the key!

A Samza job

Input topics

• StatusUpdateEvent• NewConnectionEvent• LikeUpdateEvent

Some code

MyStreamTask implements StreamTask{ …………. }

Output topics

• NewsUpdatePost• UpdatesPerHourMetric

Execution engine: YARN

What we use YARN for

• Distributing our tasks across multiple machines

• Letting us know when one has died• Distributing a replacement• Isolating our tasks from each other

Machine 1 Machine 1

YARN: Execution and reliability

MyStreamTask:process()

Samza TaskRunner: Partition 0



Node Manager 2Node Manager 1

Samza App Master

Kafka Broker Kafka Broker

Co-partitioning of topics


Samza TaskRunner: Partition 0StatusUpdateEvent, Partition 0

NewConnectionEvent, Partition 0

NewsUpdatePost

An instance of StreamTask is responsible for a specific partition

API: process()

public interface StreamTask { void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator ) }

getKey(), getMsg()

sendMsg(topic, key, value)commit(), shutdown()

Awesome feature: State

• Generic data store interface• Key-value out-of-box– More soon? Bloom filter, lucene, etc.

• Restored by Samza upon task crash



Store state

(Pseudo)code snippet: Newsfeed

• Consume StatusUpdateEvent– Send those updates to all your conmections via

the NewsUpdatePost topic• Consume NewConnectionEvent– Maintain state of connections to know who to

send to

public class NewsFeed implements StreamTask { void process(envelope, collector, coordinator) { msg = env.getMsg() userId = msg.get(“userID”); if(msg.get(“type”)==STATUS_UPDATE) { foreach(conn: kvStore.get(userId) { collector.send(“NewsUpdatePost”, new Msg(conn, msg.get(“newStatus”))

} } else { newConn = msg.get(“newConnection”) connections = kvStore.get(userId) kvStore.put(userID, connections ++ newConn) }

Current status

Hello, Samza!

Cool, eh? bit.ly/hello-samza

Consume Wikipedia edits live

Up and running in 3 minutes

Generate stats on those edits

samza.incubator.apache.org bit.ly/samza_newbie_issues

Cheers!

• Quick start: bit.ly/hello-samza• Project homepage: samza.incubator.apache.org• Newbie issues: bit.ly/samza_newbie_issues• Detailed Samza and YARN talk: bit.ly/samza_and_yarn• Twitter: @samzastream

apache samza: reliable stream processing atop apache kafka and hadoop yarn

Technology

process samza taskrunner

samza app mastermystreamtask

process statusupdateevent

lyhellosamza project

specific partition

new status

apache kafka persistent

day172k messages