storm and cassandra
DESCRIPTION
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra. There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler. http://github.com/tjake/stormscraperTRANSCRIPT
Storm and CassandraCassandra NYC Meetup 11/5/2013
Jake Luciani (@tjake)
What is Storm?
• Distributed event processor
• Provides constructs to reliably process all events
• Simple conceptual model
• New to Apache Incubator: http://wiki.apache.org/incubator/StormProposal
Storm ConceptsSpout - Collects work and submits it to be processed. Tracks success or failure of each tuple.
Bolt - Processes tuples and optionally emits more tuples.
… Tuple - A collection of data that is passed within storm.
Stream - Identifies outputs from a Spout/Bolt. Forces tuples have some declared structure.
Host C
Host B
Host A
Storm TopologiesA directed graph of spouts and bolts connected via streams
Zookeeper
A-F
G-P
Q-Z
Firehose Cassandra (optional)
Example Topologies
• Track the top 10 most popular links being shared in the last N minutes.
Where does data end up?
• Storm supports built in RPC so client requests can effectively become a spout.
!
• Put the data into a database…
• Why Cassandra though?
Why Cassandra?
• Cassandra’s Data model allows incremental modifications to rows.
• Different bolts can update different parts of a Cassandra row asynchronously.
Example
StormScraper!A web crawling system built on
Storm + Cassandra !
http://github.com/tjake/stormscraper
StormScraper C* DataModel!CREATE TABLE pages ( url text, scrape_date timestamp, title text, html text, text text, inbound_links set<text>, outbound_links set<text>, PRIMARY KEY (url, scrape_date) );
CREATE TABLE scrape_list ( url text PRIMARY KEY, last_update timestamp, depth int );
StormScraper Topology
StormScraper Topology
Cassandra
StormScraper Topology
Url Spout
Cassandra
StormScraper Topology
Url Spout
Cassandra
StormScraper Topology
Url Spout
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Html Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Html Writer
Link Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
Fail
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
Fail
StormScraper Topology
Url Spout
Scraper Bolt
Text Extraction
Bolt
Html Writer
Link Writer
Text Writer
Cassandra
Fail
Storm Summary
• Powerful
• But easy to make mistakes
• Wrong tuple expectation, names, types
• Bad topology wiring
Thank You! Q&A?