c*ollege credit: cep distribtued processing on cassandra with storm

Post on 26-Jan-2015

103 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.

TRANSCRIPT

COMPLEX EVENT PROCESSING W/

CASSANDRA

Brian O’NeillLead Architect, Health Market Scienceboneill@healthmarketscience.com@boneill42

Taylor GoetzDevelopment Lead, Health Market Scienceptgoetz@healthmarketscience.com@ptgoetz

Use CaseWhat is CEP? Why? What for?

Storm BackgroundCluster configuration

Examples / Demo Future : Trident

Agenda

Our Products

Master Data ManagementGood, bad doctors?

Prescriber eligibility and remediation.

Cassandra to the Rescue

1000’s of Feeds

Δt

C* Masterfile

Big Data for us == Variety of Data

But…

Search unstructured data Real-time Analytics / Reporting Transactional Processing

Changes reflected immediately.Wide-row Indexes

What might that look like?

C*

RDBMS

I’m happy

Dro

pwiz

ard

wid

e-ro

w in

dex

Provide for Polyglot Persistence

What we did wrong… (part 1)

Could not react to transactional changes Needed extra logic to track what changed Took too long

What we did wrong… (part 2)

Good IntentionsTake the onus off the clients

Bad Result Guaranteeing executionWrite Overhead

C*Wide Row….Indexing

TRIGGERS

What is Complex Event Processing? Event processing is a method of tracking and

analyzing (processing) streams of information (data) about things that happen (events),[1] and deriving a conclusion from them.

Complex event processing, or CEP, is event processing that combines data from multiple sources[2] to infer events or patterns that suggest more complicated circumstances.

http://en.wikipedia.org/wiki/Complex_event_processing

Imagine…

Treating CRUD operations as events in a system.

Then Suddenly,

CEP = (ETL or Analytics)

What Storm is to us…

CrudOp

A High Throughput Data Processing Pipeline

RDBMSSoR

Dimensional Counts

ETL Enrichment

Fuzzy Index

Enter Storm

Storm Overview

Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data (i.e.

CEP)

Anatomy of a Storm Cluster

NimbusMaster Node

ZookeeperCluster Coordination

SupervisorsWorker Nodes

Storm Components

SpoutsStream Sources

BoltsUnit of Computation

TopologiesCombination of n Spouts and n BoltsDefines the overall “Computation”

Storm Spouts

Represents a source (stream) of dataQueues (JMS, Kafka, Kestrel, etc.)Twitter FirehoseSensor Data

Emits “Tuples” (Events) based on sourcePrimary Storm data structureSet of Key-Value pairs

Storm Bolts

Receive Tuples from Spouts or other Bolts Operate on, or React to Data

Functions/Filters/Joins/AggregationsDatabase writes/lookups

Optionally emit additional Tuples

Storm Topologies

Data flow between spouts and bolts Routing of Tuples between spouts/bolts

Stream “Groupings” Parallelism of Components Long-Lived

Storm and Cassandra

Use Cases:Write Storm Tuple data to C*

○ Computation Results○ Pre-computed indices

Read data from C* and emit Storm Tuples○ Dynamic Lookups

http://github.com/hmsonline/storm-cassandra

Storm Cassandra Bolt Types

CassandraBolt

C*CassandraLookupBolt

STORM

CassandraBoltWrites data to CassandraAvailable in Batching and Non-Batching

CassandraLookupBoltReads data from Cassandra

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project

Provides generic Bolts for writing/reading Storm Tuples to/from C*

TupleTuple

Mapper Rows

C*TuplesColumnsMapper Columns

STORM

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project

TupleMapper InterfaceTells the CassandraBolt how to write a tuple to an

arbitrary data model

Given a Storm Tuple:Map to Column FamilyMap to Row KeyMap to Columns

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project

ColumnsMapper InterfaceTells the CassandraLookupBolt how to transform a

C* row into a Storm Tuple

Given a C* Row Key and list of Columns:Return a list of Storm Tuples

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project Current State:

Version 0.4.0-WIPUses Astyanax ClientSeveral out-of-the-box *Mapper Implementations:

○ Basic Key-Value Columns○ Value-less Columns○ Counter Columns○ Lookup by row key○ Lookup by range query

Initial pass at Trident supportInitial pass at Composite Column Support

http://github.com/hmsonline/storm-cassandra

Storm-Cassandra Project

Future Plans:Switch to CQL (???)Full Trident Support

http://github.com/hmsonline/storm-cassandra

Word Count Demo

http://github.com/hmsonline/storm-cassandra

DRPC

Reach Demo

Next Level : Trident

Trident

Provides a higher-level abstraction for stream processingConstructs for state management and Batching

Adds additional primitives that abstract away common topological patterns

Deprecates transactional topologies Distributes with Storm

Sample Trident Operations Partition Local

Functions ( execute(x) x + y )Filters ( isKeep(x) 0,x )PartitionAggregate

○ Combiner ( pairwise combining )○ Reducer ( iterative accumulation )○ Aggregator ( byoa )

A sample topologyTridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"),

new Split(), new Fields("word"))

.groupBy(new Fields("word")) .persistentAggregate(

MemcachedState.opaque(serverLocations), new Count(), new Fields("count"))

.parallelismHint(6);

https://github.com/nathanmarz/storm/wiki/Trident-state

Trident StateSequenced writes by batch/transaction id. Spouts

Transactional○ Batch contents never change

Opaque○ Batch contents can change

StateTransactional

○ Store tx_id with counts to maintain sequencing of writes.Opaque

○ Store previous value in order to overwrite the current value when contents of a batch change.

Shameless Shoutouts

HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)

ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals

Brian O’NeillLead Architect, Health Market Scienceboneill@healthmarketscience.com@boneill42

Taylor GoetzDevelopment Lead, Health Market Scienceptgoetz@healthmarketscience.com@ptgoetz

THANKS!

top related