cassandra and storm at health market sceince

36
STORM AND CASSANDRA AT HEALTH MARKET SCIENCE Taylor Goetz velopment Lead, Health Market Science [email protected] tgoetz

Upload: p-taylor-goetz

Post on 06-May-2015

4.927 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Cassandra and Storm at Health Market Sceince

STORM AND CASSANDRA AT HEALTH MARKET

SCIENCE

P. Taylor GoetzDevelopment Lead, Health Market [email protected]@ptgoetz

Page 2: Cassandra and Storm at Health Market Sceince

Cassandra @ HMS Storm Overview Storm-Cassandra Examples / Demo Future : Trident

Agenda

Page 3: Cassandra and Storm at Health Market Sceince

Our Products

Master Data ManagementGood, bad doctors?

Prescriber eligibility and remediation.

Page 4: Cassandra and Storm at Health Market Sceince

Cassandra to the Rescue

1000’s of Feeds

Δt

C* Masterfile

Big Data for us == Variety of Data

Page 5: Cassandra and Storm at Health Market Sceince

But…

Search unstructured data Real-time Analytics / Reporting Transactional Processing

Changes reflected immediately.Wide-row Indexes

Page 6: Cassandra and Storm at Health Market Sceince

What might that look like?

C*

RDBMS

I’m happy

Dro

pwiz

ard

wid

e-ro

w in

dex

Provide for Polyglot Persistence

Page 7: Cassandra and Storm at Health Market Sceince

What we did wrong…

Could not react to transactional changes Needed extra logic to track what changed Took too long

Page 8: Cassandra and Storm at Health Market Sceince

C* at HMS

Load, Standardize, Match, ConsolidateWrite results to C*

Track Changes over TimePractitioner DataFeed Quality

Page 9: Cassandra and Storm at Health Market Sceince

C* at HMS

Best PracticesPrefer Write over Read (esp. Read before Write)Avoid Queries/Scans

○ Fetch by key whenever possible○ Put Comparators to work○ Pre-compute whenever possible

Page 10: Cassandra and Storm at Health Market Sceince

How?

Treat All Data as ImmutableUpdates are inserts with new version/timestamp

Data ModelHeavy use of compositesTimestamps/Versions in Keys

Treat Feeds as Real-Time Streams

Page 11: Cassandra and Storm at Health Market Sceince

What Storm is to us…

CrudOp

A High Throughput Data Processing Pipeline

RDBMSSoR

Dimensional Counts

ETL Enrichment

Fuzzy Index

Page 12: Cassandra and Storm at Health Market Sceince

Enter Storm

Page 13: Cassandra and Storm at Health Market Sceince

Storm Overview

Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data

Page 14: Cassandra and Storm at Health Market Sceince

Anatomy of a Storm Cluster

NimbusMaster Node

ZookeeperCluster Coordination

SupervisorsWorker Nodes

Page 15: Cassandra and Storm at Health Market Sceince

Storm Primatives

StreamsUnbounded sequence of tuples

SpoutsStream Sources

BoltsUnit of Computation

TopologiesCombination of n Spouts and n BoltsDefines the overall “Computation”

Page 16: Cassandra and Storm at Health Market Sceince

Storm Spouts

Represents a source (stream) of dataQueues (JMS, Kafka, Kestrel, etc.)Twitter FirehoseSensor Data

Emits “Tuples” (Events) based on sourcePrimary Storm data structureSet of Key-Value pairs

Page 17: Cassandra and Storm at Health Market Sceince

Storm Bolts

Receive Tuples from Spouts or other Bolts Operate on, or React to Data

Functions/Filters/Joins/AggregationsDatabase writes/lookups

Optionally emit additional Tuples

Page 18: Cassandra and Storm at Health Market Sceince

Storm Topologies

Data flow between spouts and bolts Routing of Tuples between spouts/bolts

Stream “Groupings” Parallelism of Components Long-Lived

Page 19: Cassandra and Storm at Health Market Sceince

Storm Topologies

Page 20: Cassandra and Storm at Health Market Sceince

Storm and Cassandra

Use Cases:Write Storm Tuple data to C*

○ Computation Results○ Pre-computed indices

Read data from C* and emit Storm Tuples○ Dynamic Lookups

http://github.com/hmsonline/storm-cassandra

Page 21: Cassandra and Storm at Health Market Sceince

Storm Cassandra Bolt Types

CassandraBolt

C*CassandraLookupBolt

STORM

CassandraBoltWrites data to CassandraAvailable in Batching and Non-Batching

CassandraLookupBoltReads data from Cassandra

http://github.com/hmsonline/storm-cassandra

Page 22: Cassandra and Storm at Health Market Sceince

Storm-Cassandra Project

Provides generic Bolts for writing/reading Storm Tuples to/from C*

TupleTuple

Mapper Rows

C*TuplesColumnsMapper Columns

STORM

http://github.com/hmsonline/storm-cassandra

Page 23: Cassandra and Storm at Health Market Sceince

Storm-Cassandra Project

TupleMapper InterfaceTells the CassandraBolt how to write a tuple to an

arbitrary data model

Given a Storm Tuple:Map to Column FamilyMap to Row KeyMap to Columns

http://github.com/hmsonline/storm-cassandra

Page 24: Cassandra and Storm at Health Market Sceince

Storm-Cassandra Project

ColumnsMapper InterfaceTells the CassandraLookupBolt how to transform a

C* row into a Storm Tuple

Given a C* Row Key and list of Columns:Return a list of Storm Tuples

http://github.com/hmsonline/storm-cassandra

Page 25: Cassandra and Storm at Health Market Sceince

Storm-Cassandra Project Current State:

Version 0.4.0Uses Astyanax ClientSeveral out-of-the-box *Mapper Implementations:

○ Basic Key-Value Columns○ Value-less Columns○ Counter Columns○ Lookup by row key○ Lookup by range query

Composite Key/Column SupportTrident support

http://github.com/hmsonline/storm-cassandra

Page 26: Cassandra and Storm at Health Market Sceince

Storm-Cassandra Project

Future Plans:Switch to CQLEnhanced Trident Support

http://github.com/hmsonline/storm-cassandra

Page 27: Cassandra and Storm at Health Market Sceince

Word Count Demo

http://github.com/hmsonline/storm-cassandra

Page 28: Cassandra and Storm at Health Market Sceince

DRPC

Page 29: Cassandra and Storm at Health Market Sceince

Reach Demo

Page 30: Cassandra and Storm at Health Market Sceince

Next Level : Trident

Page 31: Cassandra and Storm at Health Market Sceince

Trident

Provides a higher-level abstraction for stream processingConstructs for state management and Batching

Adds additional primitives that abstract away common topological patterns

Deprecates transactional topologies Distributes with Storm

Page 32: Cassandra and Storm at Health Market Sceince

Sample Trident Operations Partition Local

Functions ( execute(x) x + y )Filters ( isKeep(x) 0,x )PartitionAggregate

○ Combiner ( pairwise combining )○ Reducer ( iterative accumulation )○ Aggregator ( byoa )

Page 33: Cassandra and Storm at Health Market Sceince

A sample topologyTridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),

new Split(), new Fields("word"))

.groupBy(new Fields("word")) .persistentAggregate(

MemcachedState.opaque(serverLocations), new Count(), new Fields("count"))

.parallelismHint(6);

https://github.com/nathanmarz/storm/wiki/Trident-state

Page 34: Cassandra and Storm at Health Market Sceince

Trident StateSequenced writes by batch/transaction id. Spouts

Transactional○ Batch contents never change

Opaque○ Batch contents can change

StateTransactional

○ Store tx_id with counts to maintain sequencing of writes.Opaque

○ Store previous value in order to overwrite the current value when contents of a batch change.

Page 35: Cassandra and Storm at Health Market Sceince

Shameless Shoutouts

HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)

ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals

Page 36: Cassandra and Storm at Health Market Sceince

P. Taylor GoetzDevelopment Lead, Health Market [email protected]@ptgoetz

THANKS!