learning stream processing with apache storm

CONTACT ME @edvorkin

real-time medical news from curated Twitter feed

Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day

350,000 ^

1 % = 3500 ^

• How to scale

• How to deal with failures

• What to do with failed messages

• A lot of infrastructure concerns

• Complexity

• Tedious coding

*Image credit:Nathanmarz: slideshare: storm

Inherently BATCH-Oriented System

• Exponential rise in real-time data

• New business opportunity

• Economics of OSS and commodity hardware

Stream processing has emerged as a key use case*

*Source: Discover HDP2.1: Apache Storm for Stream Data Processing. Hortonworks. 2014

• Detecting fraud while someone swiping credit card

• Place ad on website while someone is reading a specific article

• Alerts on application and machine failures

• Use stream-processing in batch oriented fashion

Created by Nathan Martz

Acquired by Twitter

Apache Incubator Project

Open sourced

Part of Hortonworks HDP2 platform

Top Level Apache Project

Most mature, widely adopted framework

Source: http://storm.incubator.apache.org/

Process endless stream

of data.

1M+ messages / sec on a 10-15 node cluster

Guaranteed message

processing

Tuples, Streams, Spouts, Bolts and Topologies

å å å

Storm data type: Immutable List of Key/Value pair of any data type

word: “Hello” Count: 25 Frequency: 0.25

Unbounded Sequence of Tuples between nodes

STREAM

The Source of the Stream

Read from stream of data – queues, web logs, API calls, databases

Spout responsibilities

• Process tuples and perform actions: calculations, API calls, DB calls

• Produce new output stream based on computations

• A topology is a network of spouts and bolts

• Defines data flow

• May have multiple spouts

• Each spout and bolt may have many instances that perform all the processing in parallel

How tuples are send between instances of spouts and bolts

Random Distribution.

Routes tuples to bolt based on the value of the field.

Same values always route to the same bolt

Replicates the tuple stream across all the

bolt tasks. Each task receive a copy of tuple.

Routes all tuple in the stream to

single task. Should be used

with caution.

å å å å

compile 'org.apache.storm:storm-core:0.9.2’

<groupId>org.apache.storm</groupId>

<artifactId>storm-core</artifactId>

</dependency>

Two 1 Households 1 Both 1 Alike 1 In 1 Dignity 1

sentence word

⚡ ⚡

3 final count: Two 20 Households 24 Both 22 Alike 1 In 1 Dignity 10

"Two households, both alike in dignity" Two Households Both alike in dignity

Data Source

SplitSentenceBolt

Resource initialization

WordCountBolt

PrinterBolt

Linking it all together

How to scale stream processing

å å å å å

storm main components

Machines in a storm cluster

JVM processes

running on a node. One or

more per node.

Java thread

running within worker JVM

process.

Instances of spouts and

bolts.

How tuples are send between instances of spouts and bolts

å å å å å å

Tuple tree

Reliable vs unreliable topologies

Methods from ISpout interface

Reliability in Bolts

Anchoring Ack Fail

Unit testing Storm components

BDD style of testing

Extending OutputCollector

å å å å å å å

Physical View

deploying topology to a cluster

storm jar wordcount-1.0.jar com.demo.storm.WordCountTopology word-count-topology

Monitoring and performance tuning

å å å å å å å å

Run under supervision: Monit, supervisord

Nimbus move work to another node

Supervisor will restart worker

Micro-Batch Stream Processing

å å å å å å å å å

Functions, Filters, aggregations, joins, grouping

Ordered batches of tuples. Batches can be partitioned.

Similar to Pig or Cascading

Transactional spouts

Trident has first class abstraction for reading and writing to stateful sources

Stream processed in small batches

• Each batch has a unique ID which is always the same on each replay • If one tuple failed, the whole batch is reprocessed • Higher throutput than storm but higher latency as well

How trident provides exactly –one semantics?

Store the count along with BatchID COUNT 100

BATCHID 1

COUNT 110

BATCHID 2

10 more tuples with batchId 2

Failure: Batch 2 replayed The same batchId (2)

• Spout should replay a batch exactly as it was played before

• Trident API hide dealing with batchID complexity

Word count with trident

Word count with Trident

Style of computation

By styles of computation

å å å å å å å å å å

Enhancing Twitter feed with lead Image and Title

• Readability enhancements • Image Scaling • Remove duplicates • Custom Business Logic

Writing twitter spout

Status

use Twitter4J java library

use existing Spout from Storm contrib project on GitHub

Spouts exists for: Twitter, Kafka,

JMS, RabbitMQ, Amazon SQS, Kinesis, MongoDB….

• Storm takes care of scalability and fault-tolerance • What happens if there is burst in traffic?

Introducing Queuing Layer with Kafka

Solr Indexing

Processing Groovy Rules (DSL) on a scale in real-time

å å å å å å å å å å å

Statsd and Storm Metrics API

http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/

• Use cache if you can: for example Google Guava caching utilities

• In memory DB

• Tick tuples (for batch updates)

• Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW)

• Linear regression (Perceptron, Passive-Aggresive)

• Clustering (KMeans)

• Feature scaling (standardization, normalization)

• Text feature extraction

• Stream statistics (mean, variance)

• Pre-Trained Twitter sentiment classifier

Trident-ML

http://www.michael-noll.com http://www.bigdata-cookbook.com/post/72320512609/storm-metrics-how-to http://svendvanderveken.wordpress.com/

edvorkin/Storm_Demo_Spring2GX

Go ahead. Ask away.

learning stream processing with apache storm

Engineering

discover hdp2.1: apache storm for stream data processing in...

apache storm tutorial

stream processing · •forwarded throughout dataflow graph...

formal verification of storm topologies through...

verificación formal de un modelo de simulación devs de una...

towards declarative stream processing using apache...

r-storm: resource-aware scheduling in storm...for real-time...

apache storm - tutorialspoint.com · apache storm 6...

a storm architecture for fusing iot...

sahara+storm: real time processing in sahara · 2019. 2....

data stream ingestion & complex event processing systems...

apache storm vs. spark streaming - two stream processing...

apache storm -...

d-storm: dynamic resource-efficient scheduling of … ·...

data stream ingestion & complex event processing systems ......

mhug apache storm

real-time streaming with apache spark streaming and apache...

apache storm vs. spark streaming – two stream processing...

apache storm

apache storm concepts