hadoop and storm - ajug talk
DESCRIPTION
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.TRANSCRIPT
©MapR Technologies
Hadoop and StormAJUG 5/21/2013
whoami
• Brad Anderson
• Solutions Architect at MapR (Atlanta)
• ATLHUG co-chair
• NoSQL East Conference 2009
• “boorad” most places (twitter, github)
Hadoop: A Paradigm Shift
Distributed computing platform– Large clusters– Commodity hardware
Pioneered at Google– Google File System, MapReduce and BigTable
Commercially available as Hadoop
Ship the Function to the Data
SAN/NAS
data data data
data data data
data data data
data data data
data data data
function
RDBMS
Traditional Architecture
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
data
function
Distributed Computing
MapReduce Flow
Input
Map Combine
Shuffleand sort
Reduce
Output
Reduce
Variation: No Reduce NecessaryExample: Batch File Transformation
Input
Map Output
MPG M4V
Variation: Multiple MapReducesExample: Fraud Detection in User Transactions
LDA training
Transaction data
LDA scoring
HBase /MapR M7 Edition
G2 score
Candidate events for
analyst review
95 %-ile LDA anomaly
MapReduce
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Pig
MR Equivalent to Pig Script
Hive
MapR Distribution for Apache Hadoop
Complete Hadoop distribution
Comprehensive management suite
Industry-standard interfaces
Enterprise-grade dependability
Enterprise-grade security (US Intelligence Agency)
Patents - IP
Higher performance
Pig
Hive
HBase
Mahout
Oozie
Whirr
Map Reduce
Cascading
Nagios
Ganglia
MapR Control System
MapR Data Platform
MapR Control System
MapR Data Platform
Flume
Sqoop
HCatalog
Zookeeper
Avro
Map
Reduc
e
Hadoop Use Cases
ETL/EDW Offload
Sensor / Telemetry Data
Recommendation Engine
Search•ML algorithms•eDiscovery
Fleet Management
Fraud Detection / Risk Management
Traffic Decongestion
One Platform for Big Data
…
99.999% HA
Data Protection
Disaster Recovery
Scalability &
PerformanceEnterprise Integration
Multi-tenancy
MapReduce
File-Based Applications SQL Database Search Stream
Processing
Batch
Interactive
Realtime
BatchLog file Analysis
Data Warehouse OffloadFraud Detection
Clickstream Analytics
RealtimeSensor Analysis
“Twitterscraping”Telematics
Process Optimization
InteractiveForensic Analysis
Analytic ModelingBI User Focus
©MapR Technologies
Storm
“Hadoop for Realtime”
©MapR Technologies
Before Storm
Queues Workers
©MapR Technologies
Example
(simplified)
©MapR Technologies
Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message passing
“Just works”
©MapR Technologies
Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Streams
©MapR Technologies
Source of streams
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Spouts
©MapR Technologies
public interface ISpout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void nextTuple(); void ack(Object msgId); void fail(Object msgId);}
Spouts
©MapR Technologies
Processes input streams and produces new streams
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
Tuple Tuple Tuple Tuple
Bolts
©MapR Technologies
public class DoubleAndTripleBolt extends BaseRichBolt { private OutputCollectorBase _collector;
public void prepare(Map conf, TopologyContext context, OutputCollectorBase collector) { _collector = collector; }
public void execute(Tuple input) { int val = input.getInteger(0); _collector.emit(input, new Values(val*2, val*3)); _collector.ack(input); }
public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("double", "triple")); } }
Bolts
©MapR Technologies
Network of spouts and bolts
Topologies
©MapR Technologies
TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count")) .parallelismHint(6);
Trident
Cascading for Storm
Storm
©MapR Technologies
Hadoop
batchprocesses
Apps
Business
Value
RawData
realtime processes
Queue (
Kafk
a)
Parallel Cluster Ingest
©MapR Technologies
Hadoop
batchprocesses
Apps
Business
Value
RawData
realtime processes
Storm
TailSpoutFr
anzQueue (
Kafk
a)
StormKafka
Twitter API
TweetLoggerKafka
ClusterKafka
ClusterKafka
Cluster
Kafka API
Storm
Web Service NAS
Web Data
Hadoop
Flume
HDFS Data
TwitterAPI
Catcher Storm
Topic Queue
Web-server
http
Web Data
MapR
TweetLogger
Scaling EstimatesTwitter Firehose
Old School – 8+ separate clusters, 20-25 nodes• >3 Kafka nodes• >2 TweetLoggers• 5-10 Hadoop• >2 Catcher nodes• >3 Storm• 3 zookeepers• NAS for web storage• >2 web servers
MapR – One Platform• 5-10 nodes total• Any node does any job• Full HA included• Backups included
©MapR Technologies
github
• Watch TailSpout & Franz development
• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout
• And our example Twitter implementation
• https://github.com/{tdunning | boorad | pfcurtis}/mapr-spout-test
Demo