apache storm concepts
TRANSCRIPT
Apache Storm ConceptsAndré Dias
Highlights
● Background● Storm’s History● Concepts● Integrations● Real Cases● Q&A
Background
BackgroundBig Data V’s
BackgroundBig Data V’s
VolumePetabytes / Terabytes
VelocityReal-time / Near Real-time
VarietySensors, Blog Posts, Logs, Social Networks...
BackgroundData Streaming
BackgroundData Streaming
BackgroundData Streaming
Now Please… I’m in traffic…
● Discovery● Ingest● Process● Persist● Analyze● Expose
BackgroundData Value Chain
● Discovery● Ingest● Process● Persist● Analyze● Expose
BackgroundData Value Chain (Storm Focus)
BackgroundData Value Chain
● Discovery● Ingest● Process● Persist● Analyze● Expose
BackgroundData Value Chain (Ingest)
Stream ProcessingData in Motion
Batch Processing
● Data processing architecture
○ Generic
○ Scalable
○ Fault-tolerant (human/hardware)
● Low latency
BackgroundLambda Architecture (LA)
BackgroundLambda Architecture (LA)
Storm’s HistoryNathan Marz
● Backtype (2008), acquired by Twitter (2011)
● Lambda Architecture Creator
● September, 2011. Storm Creation
● September, 2013. Storm entered to ASF
Why Use It?● Scalable, as Hadoop
● No data loss, reliable
● Fault tolerant
● Language agnostic
● Real-time, real fast
Why Use It?
Storm is TLP !!
Concepts
Concepts
● Tuple● Streams● Spouts● Bolts● Topologies / Trident API● Stream Groupings
ConceptsSpouts
● Source of Streams
● Data Consumers (Ingestion)
● Emits Tuples
ConceptsBolts
● Units of Work to tuples
● Data streaming logic
● Can emit tuples as well
● Data store integration
ConceptsTopology
● Data Streaming Flow Representation
● DAG (Direct Acyclic Graph) of Spouts and
Bolts
● Streaming computation
● Each node as individual task (parallel
execution)
● Stateless
ConceptsTrident API
● Abstraction Layer over low-level Storm API
● More Complex Topologies
● Stateful
● Micro-batch
● High-level API (similar to Pig / Cascading -
Hadoop)
● Message processed at least once
(guaranteed)
ConceptsTrident API - Micro Batch
● Trident Batches
○ are Ordered
ConceptsTrident API - Micro Batch
● Trident Batches
○ can be Partitioned
ConceptsStreaming Groups
● Data Flow Control over
Topologies
Architecture
ArchitectureComponents - Nimbus
● Master node (similar to JobTracker)
● Monitor and distribute the processing
workload across worker nodes
● Stores all its data into Zookeeper
ArchitectureComponents - Supervisor
● Worker node (similar to TaskTracker)
● Monitor and distribute the processing
workload across worker nodes
● Stores all its data into Zookeeper
ArchitectureOverview
● Master-slave approach
● Cluster coordination
(Zookeeper)
● Nimbus HA
Integrations
Real Cases
Collector sensor information to a
Data Lake
Micro-batch user contents, content feeds
and application logs
Real-time user music
recommendations
Q&A
THANK YOU!Use Storm