apache storm basics

19
Apache Storm Parallel Real Time Computation

Upload: joao-paulo-leonidas-fernandes-dias-da-silva

Post on 07-Apr-2017

219 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Apache Storm Basics

Apache StormParallel Real Time Computation

Page 2: Apache Storm Basics

What’s Storm

• It’s a distributed real time computation system

• It’s free and open source

Page 3: Apache Storm Basics

Storm Applications

• Real time analytics• Online machine learning• Distributed RPC• Others

Page 4: Apache Storm Basics

Storm Qualities• Broad set of use cases• Scalable• Guaranteed no data loss• Robust / Fault Tolerant• Programming language agnostic

Page 5: Apache Storm Basics

Storm Architecture

Page 6: Apache Storm Basics

Streams

• A stream is an unbounded sequence of tuples.

• Streams are defined with a schema that names the fields in the stream’s tuples.

Page 7: Apache Storm Basics

Spouts

• Spouts - a spout is a source of streams for a given topology.

• It will read data from an external source and emit them into the topology as tuples.

Page 8: Apache Storm Basics

Bolts

• A bolt is the processing element in the topology.

• Bolts can do simple stream transformations like: filtering, aggregations, functions, joins, etc.

Page 9: Apache Storm Basics

Topologies

• A topology contains all the logic for the realtime application.

• A topology is a graph of spouts and bolts that are connected by stream groupings.

Page 10: Apache Storm Basics

Tasks• Each spout or bolt executes as many tasks

across the cluster.• Each task corresponds to one thread of

execution.• Stream groupings define how to send

tuples from one set of tasks to another set of tasks.

Page 11: Apache Storm Basics

Stream Groupings

• A stream grouping defines for a given bolt which streams it should receive as input.

• A stream grouping also defines how the stream’s tuples are partitioned among the bolt tasks.

Page 12: Apache Storm Basics

Shuffle Grouping

• Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.

Page 13: Apache Storm Basics

Fields Grouping

• The stream is partitioned by the fields specified in the grouping.

• If the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task.

Page 14: Apache Storm Basics

Global Grouping

• The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.

Page 15: Apache Storm Basics

Workers• Topologies execute across one or more

worker processes.• Each worker process is a physical JVM and

executes a subset of all the tasks for the topology.

• If the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks

Page 16: Apache Storm Basics

A Basic StormTopology

Page 17: Apache Storm Basics

A (not so) Basic StormTopology

Page 18: Apache Storm Basics

Demo

Page 19: Apache Storm Basics

Thanks!