c*ollege credit: cep distribtued processing on cassandra with storm

35
COMPLEX EVENT PROCESSING W/ CASSANDRA an O’Neill d Architect, Health Market Science [email protected] neill42 Taylor Goetz Development Lead, Health Market Scie [email protected] @ptgoetz

Upload: datastax

Post on 26-Jan-2015

103 views

Category:

Technology


0 download

DESCRIPTION

Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.

TRANSCRIPT

Page 1: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

COMPLEX EVENT PROCESSING W/

CASSANDRA

Brian O’NeillLead Architect, Health Market [email protected]@boneill42

Taylor GoetzDevelopment Lead, Health Market [email protected]@ptgoetz

Page 2: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Use CaseWhat is CEP? Why? What for?

Storm BackgroundCluster configuration

Examples / Demo Future : Trident

Agenda

Page 3: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Our Products

Master Data ManagementGood, bad doctors?

Prescriber eligibility and remediation.

Page 4: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Cassandra to the Rescue

1000’s of Feeds

Δt

C* Masterfile

Big Data for us == Variety of Data

Page 5: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

But…

Search unstructured data Real-time Analytics / Reporting Transactional Processing

Changes reflected immediately.Wide-row Indexes

Page 6: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

What might that look like?

C*

RDBMS

I’m happy

Dro

pwiz

ard

wid

e-ro

w in

dex

Provide for Polyglot Persistence

Page 7: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

What we did wrong… (part 1)

Could not react to transactional changes Needed extra logic to track what changed Took too long

Page 8: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

What we did wrong… (part 2)

Good IntentionsTake the onus off the clients

Bad Result Guaranteeing executionWrite Overhead

C*Wide Row….Indexing

TRIGGERS

Page 9: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

What is Complex Event Processing? Event processing is a method of tracking and

analyzing (processing) streams of information (data) about things that happen (events),[1] and deriving a conclusion from them.

Complex event processing, or CEP, is event processing that combines data from multiple sources[2] to infer events or patterns that suggest more complicated circumstances.

http://en.wikipedia.org/wiki/Complex_event_processing

Page 10: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Imagine…

Treating CRUD operations as events in a system.

Then Suddenly,

CEP = (ETL or Analytics)

Page 11: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

What Storm is to us…

CrudOp

A High Throughput Data Processing Pipeline

RDBMSSoR

Dimensional Counts

ETL Enrichment

Fuzzy Index

Page 12: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Enter Storm

Page 13: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm Overview

Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data (i.e.

CEP)

Page 14: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Anatomy of a Storm Cluster

NimbusMaster Node

ZookeeperCluster Coordination

SupervisorsWorker Nodes

Page 15: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm Components

SpoutsStream Sources

BoltsUnit of Computation

TopologiesCombination of n Spouts and n BoltsDefines the overall “Computation”

Page 16: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm Spouts

Represents a source (stream) of dataQueues (JMS, Kafka, Kestrel, etc.)Twitter FirehoseSensor Data

Emits “Tuples” (Events) based on sourcePrimary Storm data structureSet of Key-Value pairs

Page 17: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm Bolts

Receive Tuples from Spouts or other Bolts Operate on, or React to Data

Functions/Filters/Joins/AggregationsDatabase writes/lookups

Optionally emit additional Tuples

Page 18: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm Topologies

Data flow between spouts and bolts Routing of Tuples between spouts/bolts

Stream “Groupings” Parallelism of Components Long-Lived

Page 19: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm and Cassandra

Use Cases:Write Storm Tuple data to C*

○ Computation Results○ Pre-computed indices

Read data from C* and emit Storm Tuples○ Dynamic Lookups

http://github.com/hmsonline/storm-cassandra

Page 20: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm Cassandra Bolt Types

CassandraBolt

C*CassandraLookupBolt

STORM

CassandraBoltWrites data to CassandraAvailable in Batching and Non-Batching

CassandraLookupBoltReads data from Cassandra

http://github.com/hmsonline/storm-cassandra

Page 21: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm-Cassandra Project

Provides generic Bolts for writing/reading Storm Tuples to/from C*

TupleTuple

Mapper Rows

C*TuplesColumnsMapper Columns

STORM

http://github.com/hmsonline/storm-cassandra

Page 22: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm-Cassandra Project

TupleMapper InterfaceTells the CassandraBolt how to write a tuple to an

arbitrary data model

Given a Storm Tuple:Map to Column FamilyMap to Row KeyMap to Columns

http://github.com/hmsonline/storm-cassandra

Page 23: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm-Cassandra Project

ColumnsMapper InterfaceTells the CassandraLookupBolt how to transform a

C* row into a Storm Tuple

Given a C* Row Key and list of Columns:Return a list of Storm Tuples

http://github.com/hmsonline/storm-cassandra

Page 24: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm-Cassandra Project Current State:

Version 0.4.0-WIPUses Astyanax ClientSeveral out-of-the-box *Mapper Implementations:

○ Basic Key-Value Columns○ Value-less Columns○ Counter Columns○ Lookup by row key○ Lookup by range query

Initial pass at Trident supportInitial pass at Composite Column Support

http://github.com/hmsonline/storm-cassandra

Page 25: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Storm-Cassandra Project

Future Plans:Switch to CQL (???)Full Trident Support

http://github.com/hmsonline/storm-cassandra

Page 26: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Word Count Demo

http://github.com/hmsonline/storm-cassandra

Page 27: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

DRPC

Page 28: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Reach Demo

Page 29: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Next Level : Trident

Page 30: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Trident

Provides a higher-level abstraction for stream processingConstructs for state management and Batching

Adds additional primitives that abstract away common topological patterns

Deprecates transactional topologies Distributes with Storm

Page 31: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Sample Trident Operations Partition Local

Functions ( execute(x) x + y )Filters ( isKeep(x) 0,x )PartitionAggregate

○ Combiner ( pairwise combining )○ Reducer ( iterative accumulation )○ Aggregator ( byoa )

Page 32: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

A sample topologyTridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),

new Split(), new Fields("word"))

.groupBy(new Fields("word")) .persistentAggregate(

MemcachedState.opaque(serverLocations), new Count(), new Fields("count"))

.parallelismHint(6);

https://github.com/nathanmarz/storm/wiki/Trident-state

Page 33: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Trident StateSequenced writes by batch/transaction id. Spouts

Transactional○ Batch contents never change

Opaque○ Batch contents can change

StateTransactional

○ Store tx_id with counts to maintain sequencing of writes.Opaque

○ Store previous value in order to overwrite the current value when contents of a batch change.

Page 34: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Shameless Shoutouts

HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)

ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals

Page 35: C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

Brian O’NeillLead Architect, Health Market [email protected]@boneill42

Taylor GoetzDevelopment Lead, Health Market [email protected]@ptgoetz

THANKS!