storm

84
Nathan Marz Twitter Distributed and fault-tolerant realtime computation Storm

Upload: nathanmarz

Post on 09-May-2015

11.480 views

Category:

Technology


4 download

DESCRIPTION

My presentation of Storm at the Bay Area Hadoop User Group on January 18th, 2012.

TRANSCRIPT

Page 1: Storm

Nathan MarzTwitter

Distributed and fault-tolerant realtime computation

Storm

Page 2: Storm

Basic info

• Open sourced September 19th

• Implementation is 12,000 lines of code

• Used by over 25 companies

• >2280 watchers on Github (most watched JVM project)

• Very active mailing list

• >1700 messages

• >520 members

Page 3: Storm

Hadoop

Batch computation

Distributed Fault-tolerant

Page 4: Storm

Storm

Realtime computation

Distributed Fault-tolerant

Page 5: Storm

Hadoop

• Large, finite jobs

• Process a lot of data at once

• High latency

Page 6: Storm

Storm

• Infinite computations called topologies

• Process infinite streams of data

• Tuple-at-a-time computational model

• Low latency

Page 7: Storm

Before Storm

Queues Workers

Page 8: Storm

Example

(simplified)

Page 9: Storm

Example

Workers schemify tweets and append to Hadoop

Page 10: Storm

Example

Workers update statistics on URLs by incrementing counters in Cassandra

Page 11: Storm

Problems

• Scaling is painful

• Poor fault-tolerance

• Coding is tedious

Page 12: Storm

What we want

• Guaranteed data processing

• Horizontal scalability

• Fault-tolerance

• No intermediate message brokers!

• Higher level abstraction than message passing

• “Just works”

Page 13: Storm

Storm

Guaranteed data processing

Horizontal scalability

Fault-tolerance

No intermediate message brokers!

Higher level abstraction than message passing

“Just works”

Page 14: Storm

Streamprocessing

Continuouscomputation

DistributedRPC

Use cases

Page 15: Storm

Storm Cluster

Page 16: Storm

Storm Cluster

Master node (similar to Hadoop JobTracker)

Page 17: Storm

Storm Cluster

Used for cluster coordination

Page 18: Storm

Storm Cluster

Run worker processes

Page 19: Storm

Starting a topology

Page 20: Storm

Killing a topology

Page 21: Storm

Concepts

• Streams

• Spouts

• Bolts

• Topologies

Page 22: Storm

Streams

Unbounded sequence of tuples

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Page 23: Storm

Spouts

Source of streams

Page 24: Storm

Spout examples

• Read from Kestrel queue

• Read from Twitter streaming API

Page 25: Storm

Bolts

Processes input streams and produces new streams

Page 26: Storm

Bolts

• Functions

• Filters

• Aggregation

• Joins

• Talk to databases

Page 27: Storm

Topology

Network of spouts and bolts

Page 28: Storm

Tasks

Spouts and bolts execute as many tasks across the cluster

Page 29: Storm

Task execution

Tasks are spread across the cluster

Page 30: Storm

Task execution

Tasks are spread across the cluster

Page 31: Storm

Stream grouping

When a tuple is emitted, which task does it go to?

Page 32: Storm

Stream grouping

• Shuffle grouping: pick a random task

• Fields grouping: mod hashing on a subset of tuple fields

• All grouping: send to all tasks

• Global grouping: pick task with lowest id

Page 33: Storm

Topology

shuffle

[“url”]

shuffle

shuffle

[“id1”, “id2”]

all

Page 34: Storm

Streaming word count

TopologyBuilder is used to construct topologies in Java

Page 35: Storm

Streaming word count

Define a spout in the topology with parallelism of 5 tasks

Page 36: Storm

Streaming word count

Split sentences into words with parallelism of 8 tasks

Page 37: Storm

Consumer decides what data it receives and how it gets grouped

Streaming word count

Split sentences into words with parallelism of 8 tasks

Page 38: Storm

Streaming word count

Create a word count stream

Page 39: Storm

Streaming word count

splitsentence.py

Page 40: Storm

Streaming word count

Page 41: Storm

Streaming word count

Submitting topology to a cluster

Page 42: Storm

Streaming word count

Running topology in local mode

Page 43: Storm

Demo

Page 44: Storm

Distributed RPC

Data flow for Distributed RPC

Page 45: Storm

DRPC Example

Computing “reach” of a URL on the fly

Page 46: Storm

Reach

Reach is the number of unique peopleexposed to a URL on Twitter

Page 47: Storm

Computing reach

URL

Tweeter

Tweeter

Tweeter

Follower

Follower

Follower

Follower

Follower

Follower

Distinct follower

Distinct follower

Distinct follower

Count Reach

Page 48: Storm

Reach topology

Page 49: Storm

Reach topology

Page 50: Storm

Reach topology

Page 51: Storm

Reach topology

Keep set of followers foreach request id in memory

Page 52: Storm

Reach topology

Update followers set whenreceive a new follower

Page 53: Storm

Reach topology

Emit partial count afterreceiving all followers for a

request id

Page 54: Storm

Demo

Page 55: Storm

Guaranteeing message processing

“Tuple tree”

Page 56: Storm

Guaranteeing message processing

• A spout tuple is not fully processed until all tuples in the tree have been completed

Page 57: Storm

Guaranteeing message processing

• If the tuple tree is not completed within a specified timeout, the spout tuple is replayed

Page 58: Storm

Guaranteeing message processing

Reliability API

Page 59: Storm

Guaranteeing message processing

“Anchoring” creates a new edge in the tuple tree

Page 60: Storm

Guaranteeing message processing

Marks a single node in the tree as complete

Page 61: Storm

Guaranteeing message processing

• Storm tracks tuple trees for you in an extremely efficient way

Page 62: Storm

Transactional topologies

How do you do idempotent counting with anat least once delivery guarantee?

Page 63: Storm

Won’t you overcount?

Transactional topologies

Page 64: Storm

Transactional topologies solve this problem

Transactional topologies

Page 65: Storm

Built completely on top of Storm’s primitivesof streams, spouts, and bolts

Transactional topologies

Page 66: Storm

Batch 1 Batch 2 Batch 3

Transactional topologies

Process small batches of tuples

Page 67: Storm

Batch 1 Batch 2 Batch 3

Transactional topologies

If a batch fails, replay the whole batch

Page 68: Storm

Batch 1 Batch 2 Batch 3

Transactional topologies

Once a batch is completed, commit the batch

Page 69: Storm

Batch 1 Batch 2 Batch 3

Transactional topologies

Bolts can optionally implement “commit” method

Page 70: Storm

Commit 1

Transactional topologies

Commits are ordered. If there’s a failure during commit, the whole batch + commit is retried

Commit 1 Commit 2 Commit 3 Commit 4 Commit 4

Page 71: Storm

Example

Page 72: Storm

Example

New instance of this objectfor every transaction attempt

Page 73: Storm

Example

Aggregate the count forthis batch

Page 74: Storm

Example

Only update database iftransaction ids differ

Page 75: Storm

Example

This enables idempotency sincecommits are ordered

Page 76: Storm

Example

(Credit goes to Kafka guysfor figuring out this trick)

Page 77: Storm

Transactional topologies

Multiple batches can be processed in parallel, but commits are guaranteed to be ordered

Page 78: Storm

• Will be available in next version of Storm (0.7.0)

• Requires a source queue that can replay identical batches of messages

• Aiming for first TransactionalSpout implementation to use Kafka

Transactional topologies

Page 79: Storm

Storm UI

Page 80: Storm

Storm on EC2

https://github.com/nathanmarz/storm-deploy

One-click deploy tool

Page 81: Storm

Starter code

https://github.com/nathanmarz/storm-starter

Example topologies

Page 82: Storm

Documentation

Page 83: Storm

Ecosystem

• Scala, JRuby, and Clojure DSL’s

• Kestrel, AMQP, JMS, and other spout adapters

• Serializers

• Multilang adapters

• Cassandra, MongoDB integration

Page 84: Storm

Questions?

http://github.com/nathanmarz/storm