cassandra and storm at health market sceince
TRANSCRIPT
STORM AND CASSANDRA AT HEALTH MARKET
SCIENCE
P. Taylor GoetzDevelopment Lead, Health Market [email protected]@ptgoetz
Cassandra @ HMS Storm Overview Storm-Cassandra Examples / Demo Future : Trident
Agenda
Our Products
Master Data ManagementGood, bad doctors?
Prescriber eligibility and remediation.
Cassandra to the Rescue
1000’s of Feeds
Δt
C* Masterfile
Big Data for us == Variety of Data
But…
Search unstructured data Real-time Analytics / Reporting Transactional Processing
Changes reflected immediately.Wide-row Indexes
What might that look like?
C*
RDBMS
I’m happy
Dro
pwiz
ard
wid
e-ro
w in
dex
Provide for Polyglot Persistence
What we did wrong…
Could not react to transactional changes Needed extra logic to track what changed Took too long
C* at HMS
Load, Standardize, Match, ConsolidateWrite results to C*
Track Changes over TimePractitioner DataFeed Quality
C* at HMS
Best PracticesPrefer Write over Read (esp. Read before Write)Avoid Queries/Scans
○ Fetch by key whenever possible○ Put Comparators to work○ Pre-compute whenever possible
How?
Treat All Data as ImmutableUpdates are inserts with new version/timestamp
Data ModelHeavy use of compositesTimestamps/Versions in Keys
Treat Feeds as Real-Time Streams
What Storm is to us…
CrudOp
A High Throughput Data Processing Pipeline
RDBMSSoR
Dimensional Counts
ETL Enrichment
Fuzzy Index
Enter Storm
Storm Overview
Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data
Anatomy of a Storm Cluster
NimbusMaster Node
ZookeeperCluster Coordination
SupervisorsWorker Nodes
Storm Primatives
StreamsUnbounded sequence of tuples
SpoutsStream Sources
BoltsUnit of Computation
TopologiesCombination of n Spouts and n BoltsDefines the overall “Computation”
Storm Spouts
Represents a source (stream) of dataQueues (JMS, Kafka, Kestrel, etc.)Twitter FirehoseSensor Data
Emits “Tuples” (Events) based on sourcePrimary Storm data structureSet of Key-Value pairs
Storm Bolts
Receive Tuples from Spouts or other Bolts Operate on, or React to Data
Functions/Filters/Joins/AggregationsDatabase writes/lookups
Optionally emit additional Tuples
Storm Topologies
Data flow between spouts and bolts Routing of Tuples between spouts/bolts
Stream “Groupings” Parallelism of Components Long-Lived
Storm Topologies
Storm and Cassandra
Use Cases:Write Storm Tuple data to C*
○ Computation Results○ Pre-computed indices
Read data from C* and emit Storm Tuples○ Dynamic Lookups
http://github.com/hmsonline/storm-cassandra
Storm Cassandra Bolt Types
CassandraBolt
C*CassandraLookupBolt
STORM
CassandraBoltWrites data to CassandraAvailable in Batching and Non-Batching
CassandraLookupBoltReads data from Cassandra
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
Provides generic Bolts for writing/reading Storm Tuples to/from C*
TupleTuple
Mapper Rows
C*TuplesColumnsMapper Columns
STORM
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
TupleMapper InterfaceTells the CassandraBolt how to write a tuple to an
arbitrary data model
Given a Storm Tuple:Map to Column FamilyMap to Row KeyMap to Columns
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
ColumnsMapper InterfaceTells the CassandraLookupBolt how to transform a
C* row into a Storm Tuple
Given a C* Row Key and list of Columns:Return a list of Storm Tuples
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project Current State:
Version 0.4.0Uses Astyanax ClientSeveral out-of-the-box *Mapper Implementations:
○ Basic Key-Value Columns○ Value-less Columns○ Counter Columns○ Lookup by row key○ Lookup by range query
Composite Key/Column SupportTrident support
http://github.com/hmsonline/storm-cassandra
Storm-Cassandra Project
Future Plans:Switch to CQLEnhanced Trident Support
http://github.com/hmsonline/storm-cassandra
Word Count Demo
http://github.com/hmsonline/storm-cassandra
DRPC
Reach Demo
Next Level : Trident
Trident
Provides a higher-level abstraction for stream processingConstructs for state management and Batching
Adds additional primitives that abstract away common topological patterns
Deprecates transactional topologies Distributes with Storm
Sample Trident Operations Partition Local
Functions ( execute(x) x + y )Filters ( isKeep(x) 0,x )PartitionAggregate
○ Combiner ( pairwise combining )○ Reducer ( iterative accumulation )○ Aggregator ( byoa )
A sample topologyTridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),
new Split(), new Fields("word"))
.groupBy(new Fields("word")) .persistentAggregate(
MemcachedState.opaque(serverLocations), new Count(), new Fields("count"))
.parallelismHint(6);
https://github.com/nathanmarz/storm/wiki/Trident-state
Trident StateSequenced writes by batch/transaction id. Spouts
Transactional○ Batch contents never change
Opaque○ Batch contents can change
StateTransactional
○ Store tx_id with counts to maintain sequencing of writes.Opaque
○ Store previous value in order to overwrite the current value when contents of a batch change.
Shameless Shoutouts
HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)
ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals
P. Taylor GoetzDevelopment Lead, Health Market [email protected]@ptgoetz
THANKS!