hms nyc* talk

49
1• 800.593.4467 • [email protected] The Big Data Quadfecta Brian O’Neill Lead Architect, Health Market Science @boneill42, [email protected] Taylor Goetz Development Lead, Health Market Science @ptgoetz, [email protected]

Upload: brian-oneill

Post on 10-May-2015

620 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Big Data Quadfecta

Brian O’NeillLead Architect, Health Market Science@boneill42, [email protected]

Taylor GoetzDevelopment Lead, Health Market Science@ptgoetz, [email protected]

Page 2: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Quadfecta?

1. Quadfecta• A legendary beirut/beer pong shot that lands on

the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover-and-sink, and simultaneous 6-cup game-ending double bounce-in.

• Kafka• Storm• Elastic Search• Cassandra

Page 3: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

3 V’s

Volume Variety Velocity

Page 4: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Use Case

Page 5: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Our Mission

Prescriber eligibility and remediation

Eliminate fraud, waste and abuse

Insights into the healthcare space

Page 6: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The BusinessBusiness Solutions

Health Care Provider & Facilities

Variety/Velocity• >l2000 of sources

• 6 Million unique HCPs

• 10+ years history

Data Challenges• Constant change in

real world data

• Conflicting & partial info

• Frequent changes to source structure

• Authoritative sources vs. crowdsource

• Predicting source quality

Master Data SolutionsMedical Procedures &

Diagnosis

Volume/Velocity• ~1B claims annually

• +5B records annually

• 5+ years history

Data Challenges• Sources have

incomplete capture

• Overlapping source data

• Statistical projections & biases

• Social media type relationships

Medical Claims Data

CompleteView, Expense Manager,

CompleteSpend

Prescriber Eligibility/Remdi

ation

Analtyics (Influencer Networks)

Page 7: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Our SolutionsBusiness

Needs

Finance & LegalBusiness SystemsCompliance Sales & Marketing

SolutionsProvider Data ComplianceData Assessment, Integration &

Enrichment Services

01010011

Market Intelligence

HMSAuthoritative

SourcesPDC Federal StateMedical Claims Web Derived

AdvancedTechnology

Master Data Management

Page 8: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Datacenter

¾ Petabytes of raw storage

Virtualized (VMware)

On a SAN

Should we go physical???

Page 9: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Under the Hood

Visualization

Dashboard / Reports

Structured Storage

RelationalIndexing

Flexible Storage

NoSQL Graph(s)

Interfacing

Web Services

Distributed Processing

Standardize

Validate

MatchConsolidat

e

Analytics

Data Sources

Government

Web

Customer

I’m happy

User Interface

Page 10: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Master Data Management

Harvested

Government

PrivateSchema Change!

Page 11: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Design

Page 12: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

System of Record

Flexibility (Variety)Scalability (Velocity + Volume)

Page 13: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Primitives of Distributed Processing

emit/proce

ss(tuple(…

))

map<key<map<[], value>>

pop(push(v))

index(field, type)

Kafka

Page 14: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Design Principles

PatternsIdempotent Operations

Elegantly handle replay

Immutable dataAssertions of facts over time

Anti-PatternsTransactions / Locking

Page 15: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

State / Counting

Exactly-once semantics for state

Create small batchesOrder batchesBatch 1

Batch 2

Batch 3

Batch 3’

4

13

13

6

Batch Total

1 4

3 4 (wait)

2 10 (+6)

3 23 (+13)

3’ 23 (+0)

Page 16: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did wrong…

Could not react to transactional changes

Needed extra logic to track what changed

Took too long

Page 17: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did wrong… (II)

AOP-based triggersWorked well initially.Business Processes captured as side-effects.

Page 18: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What we did right.

REST APIs for Loose Coupling

See Virgil:https://github.com/hmsonline/virgil

But really… watch out for Intraverthttps://github.com/zznate/intravert-ug

Page 19: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Kafka• Millions of Messages• Replay Enabled• No transactions / Lightning Fast

Page 20: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Elastic Search• Edit Distance / Soundex• Native Scalability• Fuzzy Search• Geospatial• Facets

Page 21: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm• Guaranteed once semantics• Well-designed processing

abstraction• Beats BYODP• Momentum

Page 22: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The System

KafkaQueue(s)

Offset

C*

A

BC

C* ES1Kafka

ElasticSearch

ES2C*

REST API

NP. We can route around

it.

NP. Replication Factor > 1.

NP. Rewind!

Page 23: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

What comes after Quadfecta?

?

Page 24: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Real-Time Integration

Real-time CRUD via Web Services

DRPC“Real-time” Queue

Not quite sure?

Page 25: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

The Storm/C* Bridge

Page 26: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Anatomy of a Storm Cluster

NimbusMaster Node

ZookeeperCluster Coordination

SupervisorsWorker Nodes

Page 27: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm Primatives

StreamsUnbounded sequence of tuples

SpoutsStream Sources

BoltsUnit of Computation

TopologiesCombination of n Spouts and n BoltsDefines the overall “Computation”

Page 28: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm Spouts

Represents a source (stream) of dataQueues (JMS, Kafka, Kestrel, etc.)Twitter FirehoseSensor Data

Emits “Tuples” (Events) based on source

Primary Storm data structureSet of Key-Value pairs

Page 29: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm Bolts

Receive Tuples from Spouts or other Bolts

Operate on, or React to DataFunctions/Filters/Joins/AggregationsDatabase writes/lookups

Optionally emit additional Tuples

Page 30: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm Topologies

Data flow between spouts and bolts

Routing of Tuples between spouts/bolts

Stream “Groupings”

Parallelism of Components

Long-Lived

Page 31: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm Topologies

Page 32: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm and Cassandra

Use Cases:Write Storm Tuple data to C*

Computation ResultsPre-compute indices

Read data from C* and emit Storm Tuples

Dynamic Lookupshttp://github.com/hmsonline/storm-cassandra

Page 33: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm Cassandra Bolt Types

CassandraBolt

CassandraLookupBolt

CassandraBoltWrites data to CassandraAvailable in Batching and Non-Batching

CassandraLookupBoltReads data from Cassandra

http://github.com/hmsonline/storm-cassandra

C*STORM

Page 34: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm-Cassandra Project

Provides generic Bolts for writing/reading Storm Tuples to/from C*

TupleTuple

Mapper Rows

C*Tuples

ColumnsMapper Columns

STORM

http://github.com/hmsonline/storm-cassandra

Page 35: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm-Cassandra Project

TupleMapper InterfaceTells the CassandraBolt how to write a tuple to an arbitrary data model

Given a Storm Tuple:Map to Column FamilyMap to Row KeyMap to Columns

http://github.com/hmsonline/storm-cassandra

Page 36: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm-Cassandra Project

ColumnsMapper InterfaceTells the CassandraLookupBolt how to transform a C* row into a Storm Tuple

Given a C* Row Key and list of Columns:

Return a list of Storm Tuples

http://github.com/hmsonline/storm-cassandra

Page 37: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm-Cassandra Project

Current State:Version 0.4.0Uses Astyanax ClientSeveral out-of-the-box *Mapper Implementations:

Basic Key-Value ColumnsValue-less ColumnsCounter ColumnsLookup by row keyLookup by range query

Composite Key/Column SupportTrident support

http://github.com/hmsonline/storm-cassandra

Page 38: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Storm-Cassandra Project

Future Plans:Switch to CQLEnhanced Trident Support

http://github.com/hmsonline/storm-cassandra

Page 39: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Persistent Word Count

http://github.com/hmsonline/storm-cassandra

Page 40: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

DRPC

Page 41: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

“Reach” Computation

Page 42: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

MDM Topology*

*Notional

Page 43: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Load Topology

Page 44: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Shameless Shoutouts

HMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)

ptgoetz (https://github.com/ptgoetz) storm-jmsstorm-signals

Page 45: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Next Level : Trident

Page 46: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Trident

Provides a higher-level abstraction for stream processing

Constructs for state management and Batching

Adds additional primitives that abstract away common topological patterns

Deprecates transactional topologies

Distributes with Storm

Page 47: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Sample Trident Operations

Partition LocalFunctions ( execute(x) x + y )

Filters ( isKeep(x) 0,x )

PartitionAggregateCombiner ( pairwise combining )Reducer ( iterative accumulation )Aggregator ( byoa )

Page 48: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

A sample topology

TridentTopology topology = new TridentTopology();

TridentState wordCounts =

topology.newStream("spout1", spout)

.each(new Fields("sentence"),

new Split(),

new Fields("word"))

.groupBy(new Fields("word"))

.persistentAggregate(

MemcachedState.opaque(serverLocations),

new Count(),

new Fields("count"))

.parallelismHint(6);

https://github.com/nathanmarz/storm/wiki/Trident-state

Page 49: Hms nyc* talk

1• 8

00.5

93.4

467

• in

fo@

heal

thm

arke

tsci

ence

.com

Trident State

Sequenced writes by batch/transaction id.

SpoutsTransactional

Batch contents never change

OpaqueBatch contents can change

StateTransactional

Store tx_id with counts to maintain sequencing of writes.

OpaqueStore previous value in order to overwrite the current value when contents of a batch change.