using mongodb + hadoop together

43
Using MongoDB and Hadoop Together For Success Enterprise Architect, MongoDB Buzz Moschetti [email protected] #MongoDB

Upload: mongodb

Post on 14-Jun-2015

1.684 views

Category:

Technology


3 download

DESCRIPTION

Learn how to leverage the power of MongoDB and Hadoop together with this presenation.

TRANSCRIPT

Page 1: Using MongoDB + Hadoop Together

Using MongoDB and Hadoop Together For Success

Enterprise Architect, MongoDB

Buzz Moschetti [email protected]

#MongoDB

Page 2: Using MongoDB + Hadoop Together

Who is your Presenter? •  Yes, I use “Buzz” on my business cards

•  Former Investment Bank Chief Architect at JPMorganChase and Bear Stearns before that

•  Over 25 years of designing and building systems

•  Big and small •  Super-specialized to broadly useful in any vertical •  “Traditional” to completely disruptive •  Advocate of language leverage and strong factoring

•  Still programming – using emacs, of course

Page 3: Using MongoDB + Hadoop Together

Agenda •  (Occasionally) Brutal Truths about Big Data

•  The Key To Success in Large Scale Data Management •  Review of Directed Content Business Architecture

•  Technical Implementation Examples •  Recommendation Capability •  Realtime Trade / Position Risk

•  Q & A

Page 4: Using MongoDB + Hadoop Together

Truths •  Clear definition of Big Data still maturing

•  Efficiently operationalizing Big Data is non-trivial •  Developing, debugging, understanding MapReduce •  Cluster monitoring & management, job scheduling/recovery •  If you thought regular ETL Hell was bad….

•  Big Data is not about math/set accuracy •  The last 25000 items in a 25,497,612 set “don’t matter”

•  Big Data questions are best asked periodically •  “Are we there yet?”

•  Realtime means … realtime

Page 5: Using MongoDB + Hadoop Together

It’s About The Functions, not the Terms DON’T ASK: •  Is this an operations or an analytics problem? •  Is this online or offline? •  What query language should we use? •  What is my integration strategy across tools?

ASK INSTEAD: •  Am I incrementally addressing data (esp. writes)? •  Am I computing a precise answer or a trend? •  Do I need to operate on this data in realtime? •  What is my holistic architecture?

Page 6: Using MongoDB + Hadoop Together

Success in Big Data: MongoDB + Hadoop

•  Efficient Operationalization •  Robust data movements •  Clarity and fidelity of data movements •  Designing for change

•  Analysis Feedback •  Data computed in Hadoop integrated back into

MongoDB

Page 7: Using MongoDB + Hadoop Together

What We’re Going to “Build” today Realtime Directed Content System

•  Based on what users click, “recommended” content is returned in addition to the target

•  The example is sector (manufacturing, financial services, retail) neutral

•  System dynamically updates behavior in response to user activity

Page 8: Using MongoDB + Hadoop Together

The Participants and Their Roles

Directed Content System

Customers

Content Creators

Management/ Strategy

Analysts/ Data Scientists

Generate and tag content from a known domain of tags

Make decisions based on trends and other summarized data

Operate on data to identify trends and develop tag domains

Developers/ ProdOps

Bring it all together: apps, SDLC, integration, etc.

Page 9: Using MongoDB + Hadoop Together

Priority #1: Maximizing User value

Considerations/Requirements

Maximize realtime user value and experience

Provide management reporting and trend analysis

Engineer for Day 2 agility on recommendation engine

Provide scrubbed click history for customer

Permit low-cost horizontal scaling

Minimize technical integration

Minimize technical footprint

Use conventional and/or approved tools

Provide a RESTful service layer

…..

Page 10: Using MongoDB + Hadoop Together

The Architecture

MongoDB Hadoop App(s) MapReduce

Page 11: Using MongoDB + Hadoop Together

Complementary Strengths

MongoDB Hadoop App(s) MapReduce

•  Standard design paradigm (objects, tools, 3rd party products, IDEs, test drivers, skill pool, etc. etc.)

•  Language flexibility (Java, C#, C++ python, Scala, …)

•  Webscale deployment model •  appservers, DMZ, monitoring

•  High performance rich shape CRUD

•  MapReduce design paradigm •  Node deployment model •  Very large set operations •  Computationally intensive, longer

duration •  Read-dominated workload

Page 12: Using MongoDB + Hadoop Together

“Legacy” Approach: Somewhat unidirectional

MongoDB Hadoop App(s) MapReduce

•  Extract data from mongoDB and other sources nightly (or weekly)

•  Generate reports for people to read •  Same pains as existing ETL:

reconciliation, transformation, change management …

ETL

Page 13: Using MongoDB + Hadoop Together

Somewhat better approach

MongoDB Hadoop App(s) MapReduce

•  Extract data from mongoDB and other sources nightly (or weekly)

•  Generate reports for people to read •  Move important summary data back to

mongoDB for consumption by apps. •  Still in ETL-dominated landscape

ETL

ETL

Page 14: Using MongoDB + Hadoop Together

…but the overall problem remains:

•  How to realtime integrate and operate upon both periodically generated data and realtime current data?

•  Lackluster integration between OLTP and Hadoop •  It’s not just about the database: you need a

realtime profile and profile update function

Page 15: Using MongoDB + Hadoop Together

The legacy problem in pseudocode

onContentClick() {!String[] tags = content.getTags();!Resource[] r = f1(database, tags);!

}!

•  Realtime intraday state not well-handled

•  Baselining is a different problem than click handling

Page 16: Using MongoDB + Hadoop Together

The Right Approach •  Users have a specific Profile entity

•  The Profile captures trend analytics as baselining information

•  The Profile has per-tag “counters” that are updated with

each interaction / click

•  Counters plus baselining are passed to fetch function

•  The fetch function itself could be dynamic!

Page 17: Using MongoDB + Hadoop Together

24 hours in the life of The System

•  Assume some content has been created and tagged

•  Two systemetized tags: Pets & PowerTools

Page 18: Using MongoDB + Hadoop Together

Monday, 1:30AM EST

•  Fetch all user Profiles from MongoDB; load into Hadoop •  Or skip if using the MongoDB-Hadoop connector!

MongoDB Hadoop App(s) MapReduce

Page 19: Using MongoDB + Hadoop Together

MongoDB-Hadoop MapReduce Example public class ProfileMapper !extends Mapper<Object, BSONObject, IntWritable, IntWritable> {! @Override! public void map(final Object pKey,!

! ! ! !final BSONObject pValue,!! ! ! !final Context pContext )!!throws IOException, InterruptedException{!

String user = (String)pValue.get(”user");! Date d1 = (Date)pValue.get(“lastUpdate”);! int count = 0;! List<String> keys = pValue.get(“tags”).keys();! for ( String tag : keys) {! count += pValue.get(tag).get(“hist”).size();! )! int avg = count / keys.size();! pContext.write( new IntWritable( count), new IntWritable( avg ) );! }!}!

Page 20: Using MongoDB + Hadoop Together

MongoDB-Hadoop v1 (today)

Hadoop

ü  V1 adapter draws data directly from MongoDB ü No ETL, scripts, change management, etc. ü  Storage optimized: NO data copies

MR

Map

per

Mon

goD

B-H

adoo

p v1

Page 21: Using MongoDB + Hadoop Together

MongoDB-Hadoop v2 (soon)

Hadoop

ü  V2 flows data directly into HDFS via a special MongoDB secondary

ü No ETL, scripts, change management, etc. ü  Data is copied – but still one data fabric ü  Realtime data with snapshotting as an option

MR

Map

per

HD

FS

Page 22: Using MongoDB + Hadoop Together

Monday, 1:45AM EST

•  Grind through all content data and user Profile data to produce: •  Tags based on feature extraction (vs. creator-applied tags) •  Trend baseline per user for tags Pets and PowerTools

•  Load Profiles with new baseline back into MongoDB

MongoDB Hadoop App(s) MapReduce

Page 23: Using MongoDB + Hadoop Together

Monday, 8AM EST

•  User Bob logs in and Profile retrieved from MongoDB •  Bob clicks on Content X which is already tagged as “Pets” •  Bob has clicked on Pets tagged content many times •  Adjust Profile for tag “Pets” and save back to MongoDB

•  Analysis = f(Profile)

•  Analysis can be “anything”; it is simply a result. It could trigger an ad, a compliance alert, etc.

MongoDB Hadoop App(s) MapReduce

Page 24: Using MongoDB + Hadoop Together

Monday, 8:02AM EST

•  Bob clicks on Content Y which is already tagged as “Spices” •  Spice is a new tag type for Bob •  Adjust Profile for tag “Spices” and save back to MongoDB •  Analysis = f(profile)

MongoDB Hadoop App(s) MapReduce

Page 25: Using MongoDB + Hadoop Together

Profile in Detail {! user: “Bob”,! personalData: {! zip: “10024”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [0,0,10,4,1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 },! { ts: datetime2, url: url2 } // 100 more! ]},! SPICE: { hist: [! { ts: datetime3, url: url3 }! ]}! }!}!

Page 26: Using MongoDB + Hadoop Together

Tag-based algorithm detail getRecommendedContent(profile, [“PETS”, other]) { if algo for a tag available {!

!filter = algo(profile, tag);! }! fetch N recommendations (filter);!}!!A4(profile, tag) {! weight = get tag (“PETS”) global weighting;! adjustForPersonalBaseline(weight, “PETS” baseline); ! if “PETS” clicked more than 2 times in past 10 mins! then weight += 10;! if “PETS” clicked more than 10 times in past 2 days! then weight += 3; !! ! return new filter({“PETS”, weight}, globals)!}!

Page 27: Using MongoDB + Hadoop Together

Tuesday, 1AM EST

MongoDB Hadoop App(s) MapReduce

•  Fetch all user Profiles from MongoDB; load into Hadoop •  Or skip if using the MongoDB-Hadoop connector!

Page 28: Using MongoDB + Hadoop Together

Tuesday, 1:30AM EST

•  Grind through all content data and user profile data to produce: •  Tags based on feature extraction (vs. creator-applied tags) •  Trend baseline for Pets and PowerTools and Spice

•  Data can be specific to individual or by group •  Load new baselines back into MongoDB

MongoDB Hadoop App(s) MapReduce

Page 29: Using MongoDB + Hadoop Together

New Profile in Detail {! user: “Bob”,! personalData: {! zip: “10024”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [0,4,10,4,1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 },! { ts: datetime2, url: url2 } // 100 more! ]},! SPICE: { hist: [! baseline: [1],! { ts: datetime3, url: url3 }! ]}! }!}!

Page 30: Using MongoDB + Hadoop Together

Tuesday, 1:35AM EST

•  Perform maintenance on user Profiles •  Click history trimming (variety of algorithms) •  “Dead tag” removal •  Update of auxiliary reference data

MongoDB Hadoop App(s) MapReduce

Page 31: Using MongoDB + Hadoop Together

New Profile in Detail {! user: “Bob”,! personalData: {! zip: “10022”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [ 1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 } // 50 more! ]},! SPICE: { algo: “Z1”, hist: [! baseline: [1],! { ts: datetime3, url: url3 }! ]}! }!}!

Page 32: Using MongoDB + Hadoop Together

Feel free to run the baselining more frequently

… but avoid “Are We There Yet?”

MongoDB Hadoop App(s) MapReduce

Page 33: Using MongoDB + Hadoop Together

Nearterm / Realtime Questions & Actions

With respect to the Customer: •  What has Bob done over the past 24 hours? •  Given an input, make a logic decision in 100ms or less

With respect to the Provider: •  What are all current users doing or looking at? •  Can we nearterm correlate single events to shifts in behavior?

Page 34: Using MongoDB + Hadoop Together

Longterm/ Not Realtime Questions & Actions

With respect to the Customer: •  Any way to explain historic performance / actions? •  What are recommendations for the future?

With respect to the Provider: •  Can we correlate multiple events from multiple sources

over a long period of time to identify trends? •  What is my entire customer base doing over 2 years? •  Show me a time vs. aggregate tag hit chart •  Slice and dice and aggregate tags vs. XYZ •  What tags are trending up or down?

Page 35: Using MongoDB + Hadoop Together

Applications

Another Example: Realtime Risk

Trade Processing

Risk Calculation

(Spark)

Risk Service

Log trade activities

Query trades

Query Risk

Risk Params Admin

Analysis/Reporting (Impala)

OTHER HDFS DATA

OTHER HDFS DATA

Page 36: Using MongoDB + Hadoop Together

Recording a trade

Applications

Trade Processing

1. Bank makes a trade 2. Trade sent to Trade Processing 3. Trade Processing writes trade to MongoDB 4.  Realtime replicate trade to Hadoop/HDFS

Non-functional notes: •  High volume of data ingestion (10,000s or more

events per second) •  Durable storage of trade data •  Store trade events across all asset classes

1

2

3

4

Page 37: Using MongoDB + Hadoop Together

Querying deal / trade / event data

1. Query on deal attributes (id, counterparty, asset class, termination date, notional amount, book)

2. MongoDB performs index-optimized query and Trade Processing assembles Deal/Trade/Event data into response packet

3. Return response packet to caller

Non-functional notes: •  System can support very high volume (10,000s

or more queries per second) •  Millisecond response times

Applications

Trade Processing

1

2

3

Page 38: Using MongoDB + Hadoop Together

Updating intra-day risk data

1. Mirror of trade data already stored in HDFS Trade data partitioned into time windows

2. Signal/timer kicks off a “run” 3. Spark ingests new partition of trade data as RDD

and calculates and merges risk data based on latest trade data

4. Risk data written directly to MongoDB and indexed and available for online queries / aggregations / applications logic

Applications

Risk Service

Risk Calculation

(Spark)

1

2

4

3

Page 39: Using MongoDB + Hadoop Together

Querying detail & aggregated risk on demand

1. Applications can use full MongoDB query API to access risk data and trade data

2. Risk data can be indexed on multiple fields for fast access by multiple dimensions

3. Hadoop jobs periodically apply incremental updates to risk data with no down time

4.  Interpolated / matrix risk can be computed on-the-fly

Non-functional notes •  System can support very high volume (10,000s

or more queries per second) •  Millisecond response times

Applications

Risk Service

1

2

3

Page 40: Using MongoDB + Hadoop Together

Trade Analytics & Reporting

1.  Impala provides full SQL access to all content in Hadoop

2. Dashboards and Reporting frameworks deliver periodic information to consumers

3. Breadth of data discovery / ad-hoc analysis tools can be brought bear on all data in Hadoop

Non-functional notes: •  Lower query frequency •  Full SQL query flexibility •  Most queries / analysis yield value accessing large

volumes of data (e.g. all events in the last 30 days – or 30 months)

Applications

Impala

Dashboards Reports

Ad-hoc Analysis

Page 41: Using MongoDB + Hadoop Together

The Key To Success: It is One System

MongoDB

Hadoop

App(s)

MapReduce

Page 42: Using MongoDB + Hadoop Together

Q&A [email protected]

Page 43: Using MongoDB + Hadoop Together

Thank You

Buzz Moschetti [email protected]

#MongoDB