using mongodb + hadoop together

Using MongoDB and Hadoop Together For Success

Enterprise Architect, MongoDB

Buzz Moschetti [email protected]

#MongoDB

Who is your Presenter? •  Yes, I use “Buzz” on my business cards

•  Former Investment Bank Chief Architect at JPMorganChase and Bear Stearns before that

•  Over 25 years of designing and building systems

•  Big and small •  Super-specialized to broadly useful in any vertical •  “Traditional” to completely disruptive •  Advocate of language leverage and strong factoring

•  Still programming – using emacs, of course

Agenda •  (Occasionally) Brutal Truths about Big Data

•  The Key To Success in Large Scale Data Management •  Review of Directed Content Business Architecture

•  Technical Implementation Examples •  Recommendation Capability •  Realtime Trade / Position Risk

•  Q & A

Truths •  Clear definition of Big Data still maturing

•  Efficiently operationalizing Big Data is non-trivial •  Developing, debugging, understanding MapReduce •  Cluster monitoring & management, job scheduling/recovery •  If you thought regular ETL Hell was bad….

•  Big Data is not about math/set accuracy •  The last 25000 items in a 25,497,612 set “don’t matter”

•  Big Data questions are best asked periodically •  “Are we there yet?”

•  Realtime means … realtime

It’s About The Functions, not the Terms DON’T ASK: •  Is this an operations or an analytics problem? •  Is this online or offline? •  What query language should we use? •  What is my integration strategy across tools?

ASK INSTEAD: •  Am I incrementally addressing data (esp. writes)? •  Am I computing a precise answer or a trend? •  Do I need to operate on this data in realtime? •  What is my holistic architecture?

Success in Big Data: MongoDB + Hadoop

•  Efficient Operationalization •  Robust data movements •  Clarity and fidelity of data movements •  Designing for change

•  Analysis Feedback •  Data computed in Hadoop integrated back into

MongoDB

What We’re Going to “Build” today Realtime Directed Content System

•  Based on what users click, “recommended” content is returned in addition to the target

•  The example is sector (manufacturing, financial services, retail) neutral

•  System dynamically updates behavior in response to user activity

The Participants and Their Roles

Directed Content System

Customers

Content Creators

Management/ Strategy

Analysts/ Data Scientists

Generate and tag content from a known domain of tags

Make decisions based on trends and other summarized data

Operate on data to identify trends and develop tag domains

Developers/ ProdOps

Bring it all together: apps, SDLC, integration, etc.

Priority #1: Maximizing User value

Considerations/Requirements

Maximize realtime user value and experience

Provide management reporting and trend analysis

Engineer for Day 2 agility on recommendation engine

Provide scrubbed click history for customer

Permit low-cost horizontal scaling

Minimize technical integration

Minimize technical footprint

Use conventional and/or approved tools

Provide a RESTful service layer

…..

The Architecture

MongoDB Hadoop App(s) MapReduce

Complementary Strengths


•  Standard design paradigm (objects, tools, 3rd party products, IDEs, test drivers, skill pool, etc. etc.)

•  Language flexibility (Java, C#, C++ python, Scala, …)

•  Webscale deployment model •  appservers, DMZ, monitoring

•  High performance rich shape CRUD

•  MapReduce design paradigm •  Node deployment model •  Very large set operations •  Computationally intensive, longer

duration •  Read-dominated workload

“Legacy” Approach: Somewhat unidirectional


•  Extract data from mongoDB and other sources nightly (or weekly)

•  Generate reports for people to read •  Same pains as existing ETL:

reconciliation, transformation, change management …

ETL

Somewhat better approach


•  Extract data from mongoDB and other sources nightly (or weekly)

•  Generate reports for people to read •  Move important summary data back to

mongoDB for consumption by apps. •  Still in ETL-dominated landscape

ETL

ETL

…but the overall problem remains:

•  How to realtime integrate and operate upon both periodically generated data and realtime current data?

•  Lackluster integration between OLTP and Hadoop •  It’s not just about the database: you need a

realtime profile and profile update function

The legacy problem in pseudocode

onContentClick() {!String[] tags = content.getTags();!Resource[] r = f1(database, tags);!

}!

•  Realtime intraday state not well-handled

•  Baselining is a different problem than click handling

The Right Approach •  Users have a specific Profile entity

•  The Profile captures trend analytics as baselining information

•  The Profile has per-tag “counters” that are updated with

each interaction / click

•  Counters plus baselining are passed to fetch function

•  The fetch function itself could be dynamic!

24 hours in the life of The System

•  Assume some content has been created and tagged

•  Two systemetized tags: Pets & PowerTools

Monday, 1:30AM EST

•  Fetch all user Profiles from MongoDB; load into Hadoop •  Or skip if using the MongoDB-Hadoop connector!


MongoDB-Hadoop MapReduce Example public class ProfileMapper !extends Mapper<Object, BSONObject, IntWritable, IntWritable> {! @Override! public void map(final Object pKey,!

! ! ! !final BSONObject pValue,!! ! ! !final Context pContext )!!throws IOException, InterruptedException{!

String user = (String)pValue.get(”user");! Date d1 = (Date)pValue.get(“lastUpdate”);! int count = 0;! List<String> keys = pValue.get(“tags”).keys();! for ( String tag : keys) {! count += pValue.get(tag).get(“hist”).size();! )! int avg = count / keys.size();! pContext.write( new IntWritable( count), new IntWritable( avg ) );! }!}!

MongoDB-Hadoop v1 (today)

Hadoop

ü  V1 adapter draws data directly from MongoDB ü No ETL, scripts, change management, etc. ü  Storage optimized: NO data copies

MR

Map

per

Mon

goD

B-H

adoo

p v1

MongoDB-Hadoop v2 (soon)

Hadoop

ü  V2 flows data directly into HDFS via a special MongoDB secondary

ü No ETL, scripts, change management, etc. ü  Data is copied – but still one data fabric ü  Realtime data with snapshotting as an option

MR

Map

per

HD

FS

Monday, 1:45AM EST

•  Grind through all content data and user Profile data to produce: •  Tags based on feature extraction (vs. creator-applied tags) •  Trend baseline per user for tags Pets and PowerTools

•  Load Profiles with new baseline back into MongoDB


Monday, 8AM EST

•  User Bob logs in and Profile retrieved from MongoDB •  Bob clicks on Content X which is already tagged as “Pets” •  Bob has clicked on Pets tagged content many times •  Adjust Profile for tag “Pets” and save back to MongoDB

•  Analysis = f(Profile)

•  Analysis can be “anything”; it is simply a result. It could trigger an ad, a compliance alert, etc.


Monday, 8:02AM EST

•  Bob clicks on Content Y which is already tagged as “Spices” •  Spice is a new tag type for Bob •  Adjust Profile for tag “Spices” and save back to MongoDB •  Analysis = f(profile)


Profile in Detail {! user: “Bob”,! personalData: {! zip: “10024”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [0,0,10,4,1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 },! { ts: datetime2, url: url2 } // 100 more! ]},! SPICE: { hist: [! { ts: datetime3, url: url3 }! ]}! }!}!

Tag-based algorithm detail getRecommendedContent(profile, [“PETS”, other]) { if algo for a tag available {!

!filter = algo(profile, tag);! }! fetch N recommendations (filter);!}!!A4(profile, tag) {! weight = get tag (“PETS”) global weighting;! adjustForPersonalBaseline(weight, “PETS” baseline); ! if “PETS” clicked more than 2 times in past 10 mins! then weight += 10;! if “PETS” clicked more than 10 times in past 2 days! then weight += 3; !! ! return new filter({“PETS”, weight}, globals)!}!

Tuesday, 1AM EST


•  Fetch all user Profiles from MongoDB; load into Hadoop •  Or skip if using the MongoDB-Hadoop connector!

Tuesday, 1:30AM EST

•  Grind through all content data and user profile data to produce: •  Tags based on feature extraction (vs. creator-applied tags) •  Trend baseline for Pets and PowerTools and Spice

•  Data can be specific to individual or by group •  Load new baselines back into MongoDB


New Profile in Detail {! user: “Bob”,! personalData: {! zip: “10024”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [0,4,10,4,1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 },! { ts: datetime2, url: url2 } // 100 more! ]},! SPICE: { hist: [! baseline: [1],! { ts: datetime3, url: url3 }! ]}! }!}!

Tuesday, 1:35AM EST

•  Perform maintenance on user Profiles •  Click history trimming (variety of algorithms) •  “Dead tag” removal •  Update of auxiliary reference data


New Profile in Detail {! user: “Bob”,! personalData: {! zip: “10022”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [ 1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 } // 50 more! ]},! SPICE: { algo: “Z1”, hist: [! baseline: [1],! { ts: datetime3, url: url3 }! ]}! }!}!

Feel free to run the baselining more frequently

… but avoid “Are We There Yet?”


Nearterm / Realtime Questions & Actions

With respect to the Customer: •  What has Bob done over the past 24 hours? •  Given an input, make a logic decision in 100ms or less

With respect to the Provider: •  What are all current users doing or looking at? •  Can we nearterm correlate single events to shifts in behavior?

Longterm/ Not Realtime Questions & Actions

With respect to the Customer: •  Any way to explain historic performance / actions? •  What are recommendations for the future?

With respect to the Provider: •  Can we correlate multiple events from multiple sources

over a long period of time to identify trends? •  What is my entire customer base doing over 2 years? •  Show me a time vs. aggregate tag hit chart •  Slice and dice and aggregate tags vs. XYZ •  What tags are trending up or down?

Applications

Another Example: Realtime Risk

Trade Processing

Risk Calculation

(Spark)

Risk Service

Log trade activities

Query trades

Query Risk

Risk Params Admin

Analysis/Reporting (Impala)

OTHER HDFS DATA

OTHER HDFS DATA

Recording a trade

Applications

Trade Processing

1. Bank makes a trade 2. Trade sent to Trade Processing 3. Trade Processing writes trade to MongoDB 4.  Realtime replicate trade to Hadoop/HDFS

Non-functional notes: •  High volume of data ingestion (10,000s or more

events per second) •  Durable storage of trade data •  Store trade events across all asset classes

1

2

3

4

Querying deal / trade / event data

1. Query on deal attributes (id, counterparty, asset class, termination date, notional amount, book)

2. MongoDB performs index-optimized query and Trade Processing assembles Deal/Trade/Event data into response packet

3. Return response packet to caller

Non-functional notes: •  System can support very high volume (10,000s

or more queries per second) •  Millisecond response times

Applications

Trade Processing

1

2

3

Updating intra-day risk data

1. Mirror of trade data already stored in HDFS Trade data partitioned into time windows

2. Signal/timer kicks off a “run” 3. Spark ingests new partition of trade data as RDD

and calculates and merges risk data based on latest trade data

4. Risk data written directly to MongoDB and indexed and available for online queries / aggregations / applications logic

Applications

Risk Service

Risk Calculation

(Spark)

1

2

4

3

Querying detail & aggregated risk on demand

1. Applications can use full MongoDB query API to access risk data and trade data

2. Risk data can be indexed on multiple fields for fast access by multiple dimensions

3. Hadoop jobs periodically apply incremental updates to risk data with no down time

4.  Interpolated / matrix risk can be computed on-the-fly

Non-functional notes •  System can support very high volume (10,000s

or more queries per second) •  Millisecond response times

Applications

Risk Service

1

2

3

Trade Analytics & Reporting

1.  Impala provides full SQL access to all content in Hadoop

2. Dashboards and Reporting frameworks deliver periodic information to consumers

3. Breadth of data discovery / ad-hoc analysis tools can be brought bear on all data in Hadoop

Non-functional notes: •  Lower query frequency •  Full SQL query flexibility •  Most queries / analysis yield value accessing large

volumes of data (e.g. all events in the last 30 days – or 30 months)

Applications

Impala

Dashboards Reports

Ad-hoc Analysis

The Key To Success: It is One System

MongoDB

Hadoop

App(s)

MapReduce

Q&A [email protected]

Thank You

Buzz Moschetti [email protected]

#MongoDB

using mongodb + hadoop together

Technology

big data questions

mongodbhadoop connector

mongodb load

mongodbusing mongodb

fidelity of data movements

important summary data

change analysis feedback

realtime currentdata