c*ollege credit: cep distribtued processing on cassandra with storm
DESCRIPTION
Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.TRANSCRIPT
COMPLEX EVENT PROCESSING W/
CASSANDRA
Brian O’NeillLead Architect, Health Market [email protected]@boneill42
Taylor GoetzDevelopment Lead, Health Market [email protected]@ptgoetz
Use CaseWhat is CEP? Why? What for?
Storm BackgroundCluster configuration
Examples / Demo Future : Trident
Agenda
Our Products
Master Data ManagementGood, bad doctors?
Prescriber eligibility and remediation.
Cassandra to the Rescue
1000’s of Feeds
Δt
C* Masterfile
Big Data for us == Variety of Data
But…
Search unstructured data Real-time Analytics / Reporting Transactional Processing
Changes reflected immediately.Wide-row Indexes
What might that look like?
C*
RDBMS
I’m happy
Dro
pwiz
ard
wid
e-ro
w in
dex
Provide for Polyglot Persistence
What we did wrong… (part 1)
Could not react to transactional changes Needed extra logic to track what changed Took too long
What we did wrong… (part 2)
Good IntentionsTake the onus off the clients
Bad Result Guaranteeing executionWrite Overhead
C*Wide Row….Indexing
TRIGGERS
What is Complex Event Processing? Event processing is a method of tracking and
analyzing (processing) streams of information (data) about things that happen (events),[1] and deriving a conclusion from them.
Complex event processing, or CEP, is event processing that combines data from multiple sources[2] to infer events or patterns that suggest more complicated circumstances.
http://en.wikipedia.org/wiki/Complex_event_processing
Imagine…
Treating CRUD operations as events in a system.
Then Suddenly,
CEP = (ETL or Analytics)
What Storm is to us…
CrudOp
A High Throughput Data Processing Pipeline
RDBMSSoR
Dimensional Counts
ETL Enrichment
Fuzzy Index
Enter Storm
Storm Overview
Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data (i.e.
CEP)
Anatomy of a Storm Cluster
NimbusMaster Node
ZookeeperCluster Coordination
SupervisorsWorker Nodes
Storm Components
SpoutsStream Sources
BoltsUnit of Computation
TopologiesCombination of n Spouts and n BoltsDefines the overall “Computation”
Storm Spouts
Represents a source (stream) of dataQueues (JMS, Kafka, Kestrel, etc.)Twitter FirehoseSensor Data
Emits “Tuples” (Events) based on sourcePrimary Storm data structureSet of Key-Value pairs
Storm Bolts
Receive Tuples from Spouts or other Bolts Operate on, or React to Data
Functions/Filters/Joins/AggregationsDatabase writes/lookups
Optionally emit additional Tuples
Storm Topologies
Data flow between spouts and bolts Routing of Tuples between spouts/bolts
Stream “Groupings” Parallelism of Components Long-Lived
Storm and Cassandra
Use Cases:Write Storm Tuple data to C*
○ Computation Results○ Pre-computed indices
Read data from C* and emit Storm Tuples○ Dynamic Lookups
http://github.com/hmsonline/storm-cassandra
Storm Cassandra Bolt Types
CassandraBolt
C*CassandraLookupBolt
STORM
CassandraBoltWrites data to CassandraAvailable in Batching and Non-Batching
CassandraLookupBoltReads data from Cassandra
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
Provides generic Bolts for writing/reading Storm Tuples to/from C*
TupleTuple
Mapper Rows
C*TuplesColumnsMapper Columns
STORM
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
TupleMapper InterfaceTells the CassandraBolt how to write a tuple to an
arbitrary data model
Given a Storm Tuple:Map to Column FamilyMap to Row KeyMap to Columns
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
ColumnsMapper InterfaceTells the CassandraLookupBolt how to transform a
C* row into a Storm Tuple
Given a C* Row Key and list of Columns:Return a list of Storm Tuples
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project Current State:
Version 0.4.0-WIPUses Astyanax ClientSeveral out-of-the-box *Mapper Implementations:
○ Basic Key-Value Columns○ Value-less Columns○ Counter Columns○ Lookup by row key○ Lookup by range query
Initial pass at Trident supportInitial pass at Composite Column Support
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
Future Plans:Switch to CQL (???)Full Trident Support
http://github.com/hmsonline/storm-cassandra
Word Count Demo
http://github.com/hmsonline/storm-cassandra
DRPC
Reach Demo
Next Level : Trident
Trident
Provides a higher-level abstraction for stream processingConstructs for state management and Batching
Adds additional primitives that abstract away common topological patterns
Deprecates transactional topologies Distributes with Storm
Sample Trident Operations Partition Local
Functions ( execute(x) x + y )Filters ( isKeep(x) 0,x )PartitionAggregate
○ Combiner ( pairwise combining )○ Reducer ( iterative accumulation )○ Aggregator ( byoa )
A sample topologyTridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),
new Split(), new Fields("word"))
.groupBy(new Fields("word")) .persistentAggregate(
MemcachedState.opaque(serverLocations), new Count(), new Fields("count"))
.parallelismHint(6);
https://github.com/nathanmarz/storm/wiki/Trident-state
Trident StateSequenced writes by batch/transaction id. Spouts
Transactional○ Batch contents never change
Opaque○ Batch contents can change
StateTransactional
○ Store tx_id with counts to maintain sequencing of writes.Opaque
○ Store previous value in order to overwrite the current value when contents of a batch change.
Shameless Shoutouts
HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)
ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals
Brian O’NeillLead Architect, Health Market [email protected]@boneill42
Taylor GoetzDevelopment Lead, Health Market [email protected]@ptgoetz
THANKS!