learning stream processing with apache storm

Post on 30-Jun-2015

552 Views

Category:

Engineering

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc. In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well. Following topics will be covered: • Why use Apache Storm? • Common use cases • Storm Architecture - components, concepts, topology • Building simple Storm topology with Java and Groovy • Trident and micro-batch processing • Fault tolerance and guaranteed message delivery • Running and monitoring Storm in production • Kafka • Storm at WebMD • Resources

TRANSCRIPT

a

[

b

K

Z

CONTACT ME @edvorkin

[

real-time medical news from curated Twitter feed

Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day

350,000 ^

1 % = 3500 ^

• How to scale

• How to deal with failures

• What to do with failed messages

• A lot of infrastructure concerns

• Complexity

• Tedious coding

DB

t

*Image credit:Nathanmarz: slideshare: storm

Inherently BATCH-Oriented System

• Exponential rise in real-time data

• New business opportunity

• Economics of OSS and commodity hardware

Stream processing has emerged as a key use case*

*Source: Discover HDP2.1: Apache Storm for Stream Data Processing. Hortonworks. 2014

• Detecting fraud while someone swiping credit card

• Place ad on website while someone is reading a specific article

• Alerts on application and machine failures

• Use stream-processing in batch oriented fashion

4

%

å å

Created by Nathan Martz

Acquired by Twitter

Apache Incubator Project

Open sourced

Part of Hortonworks HDP2 platform

U

a

x

Top Level Apache Project

Most mature, widely adopted framework

Source: http://storm.incubator.apache.org/

Process endless stream

of data.

1M+ messages / sec on a 10-15 node cluster

/

4

Guaranteed message

processing

Û

Tuples, Streams, Spouts, Bolts and Topologies

Z

å å å

TUPLE

Storm data type: Immutable List of Key/Value pair of any data type

word: “Hello” Count: 25 Frequency: 0.25

Unbounded Sequence of Tuples between nodes

STREAM

SPOUT

The Source of the Stream

Read from stream of data – queues, web logs, API calls, databases

Spout responsibilities

BOLT

• Process tuples and perform actions: calculations, API calls, DB calls

• Produce new output stream based on computations

Bolt

F(x)

• A topology is a network of spouts and bolts

• Defines data flow

4

• May have multiple spouts

4

• Each spout and bolt may have many instances that perform all the processing in parallel

4

How tuples are send between instances of spouts and bolts

Random Distribution.

Routes tuples to bolt based on the value of the field.

Same values always route to the same bolt

Replicates the tuple stream across all the

bolt tasks. Each task receive a copy of tuple.

Routes all tuple in the stream to

single task. Should be used

with caution.

4

å å å å

compile 'org.apache.storm:storm-core:0.9.2’

<dependency>

<groupId>org.apache.storm</groupId>

<artifactId>storm-core</artifactId>

<version>0.9.2</version>

</dependency>

Two 1 Households 1 Both 1 Alike 1 In 1 Dignity 1

sentence word

Word

⚡ ⚡

3 final count: Two 20 Households 24 Both 22 Alike 1 In 1 Dignity 10

"Two households, both alike in dignity" Two Households Both alike in dignity

Data Source

SplitSentenceBolt

Resource initialization

WordCountBolt

PrinterBolt

Linking it all together

How to scale stream processing

q

å å å å å

storm main components

Machines in a storm cluster

JVM processes

running on a node. One or

more per node.

Java thread

running within worker JVM

process.

Instances of spouts and

bolts.

q

q

How tuples are send between instances of spouts and bolts

a

å å å å å å

Tuple tree

Reliable vs unreliable topologies

Methods from ISpout interface

Reliability in Bolts

Anchoring Ack Fail

Unit testing Storm components

a

BDD style of testing

Extending OutputCollector

Extending OutputCollector

Z

å å å å å å å

Physical View

4

deploying topology to a cluster

storm jar wordcount-1.0.jar com.demo.storm.WordCountTopology word-count-topology

Monitoring and performance tuning

x

å å å å å å å å

Run under supervision: Monit, supervisord

Nimbus move work to another node

Supervisor will restart worker

Micro-Batch Stream Processing

K

å å å å å å å å å

Functions, Filters, aggregations, joins, grouping

Ordered batches of tuples. Batches can be partitioned.

Similar to Pig or Cascading

Transactional spouts

Trident has first class abstraction for reading and writing to stateful sources

Ü

4

Stream processed in small batches

• Each batch has a unique ID which is always the same on each replay • If one tuple failed, the whole batch is reprocessed • Higher throutput than storm but higher latency as well

How trident provides exactly –one semantics?

Store the count along with BatchID COUNT 100

BATCHID 1

COUNT 110

BATCHID 2

10 more tuples with batchId 2

Failure: Batch 2 replayed The same batchId (2)

• Spout should replay a batch exactly as it was played before

• Trident API hide dealing with batchID complexity

Word count with trident

Word count with Trident

Word count with Trident

Style of computation

4

By styles of computation

4

å å å å å å å å å å

Enhancing Twitter feed with lead Image and Title

• Readability enhancements • Image Scaling • Remove duplicates • Custom Business Logic

Writing twitter spout

Status

use Twitter4J java library

use existing Spout from Storm contrib project on GitHub

Spouts exists for: Twitter, Kafka,

JMS, RabbitMQ, Amazon SQS, Kinesis, MongoDB….

• Storm takes care of scalability and fault-tolerance • What happens if there is burst in traffic?

Introducing Queuing Layer with Kafka

Ñ

4

Solr Indexing

Processing Groovy Rules (DSL) on a scale in real-time

å å å å å å å å å å å

Statsd and Storm Metrics API

http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/

• Use cache if you can: for example Google Guava caching utilities

• In memory DB

• Tick tuples (for batch updates)

• Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW)

• Linear regression (Perceptron, Passive-Aggresive)

• Clustering (KMeans)

• Feature scaling (standardization, normalization)

• Text feature extraction

• Stream statistics (mean, variance)

• Pre-Trained Twitter sentiment classifier

Trident-ML

http://www.michael-noll.com http://www.bigdata-cookbook.com/post/72320512609/storm-metrics-how-to http://svendvanderveken.wordpress.com/

edvorkin/Storm_Demo_Spring2GX

Go ahead. Ask away.

top related