using mongodb + hadoop together
DESCRIPTION
Learn how to leverage the power of MongoDB and Hadoop together with this presenation.TRANSCRIPT
Using MongoDB and Hadoop Together For Success
Enterprise Architect, MongoDB
Buzz Moschetti [email protected]
#MongoDB
Who is your Presenter? • Yes, I use “Buzz” on my business cards
• Former Investment Bank Chief Architect at JPMorganChase and Bear Stearns before that
• Over 25 years of designing and building systems
• Big and small • Super-specialized to broadly useful in any vertical • “Traditional” to completely disruptive • Advocate of language leverage and strong factoring
• Still programming – using emacs, of course
Agenda • (Occasionally) Brutal Truths about Big Data
• The Key To Success in Large Scale Data Management • Review of Directed Content Business Architecture
• Technical Implementation Examples • Recommendation Capability • Realtime Trade / Position Risk
• Q & A
Truths • Clear definition of Big Data still maturing
• Efficiently operationalizing Big Data is non-trivial • Developing, debugging, understanding MapReduce • Cluster monitoring & management, job scheduling/recovery • If you thought regular ETL Hell was bad….
• Big Data is not about math/set accuracy • The last 25000 items in a 25,497,612 set “don’t matter”
• Big Data questions are best asked periodically • “Are we there yet?”
• Realtime means … realtime
It’s About The Functions, not the Terms DON’T ASK: • Is this an operations or an analytics problem? • Is this online or offline? • What query language should we use? • What is my integration strategy across tools?
ASK INSTEAD: • Am I incrementally addressing data (esp. writes)? • Am I computing a precise answer or a trend? • Do I need to operate on this data in realtime? • What is my holistic architecture?
Success in Big Data: MongoDB + Hadoop
• Efficient Operationalization • Robust data movements • Clarity and fidelity of data movements • Designing for change
• Analysis Feedback • Data computed in Hadoop integrated back into
MongoDB
What We’re Going to “Build” today Realtime Directed Content System
• Based on what users click, “recommended” content is returned in addition to the target
• The example is sector (manufacturing, financial services, retail) neutral
• System dynamically updates behavior in response to user activity
The Participants and Their Roles
Directed Content System
Customers
Content Creators
Management/ Strategy
Analysts/ Data Scientists
Generate and tag content from a known domain of tags
Make decisions based on trends and other summarized data
Operate on data to identify trends and develop tag domains
Developers/ ProdOps
Bring it all together: apps, SDLC, integration, etc.
Priority #1: Maximizing User value
Considerations/Requirements
Maximize realtime user value and experience
Provide management reporting and trend analysis
Engineer for Day 2 agility on recommendation engine
Provide scrubbed click history for customer
Permit low-cost horizontal scaling
Minimize technical integration
Minimize technical footprint
Use conventional and/or approved tools
Provide a RESTful service layer
…..
The Architecture
MongoDB Hadoop App(s) MapReduce
Complementary Strengths
MongoDB Hadoop App(s) MapReduce
• Standard design paradigm (objects, tools, 3rd party products, IDEs, test drivers, skill pool, etc. etc.)
• Language flexibility (Java, C#, C++ python, Scala, …)
• Webscale deployment model • appservers, DMZ, monitoring
• High performance rich shape CRUD
• MapReduce design paradigm • Node deployment model • Very large set operations • Computationally intensive, longer
duration • Read-dominated workload
“Legacy” Approach: Somewhat unidirectional
MongoDB Hadoop App(s) MapReduce
• Extract data from mongoDB and other sources nightly (or weekly)
• Generate reports for people to read • Same pains as existing ETL:
reconciliation, transformation, change management …
ETL
Somewhat better approach
MongoDB Hadoop App(s) MapReduce
• Extract data from mongoDB and other sources nightly (or weekly)
• Generate reports for people to read • Move important summary data back to
mongoDB for consumption by apps. • Still in ETL-dominated landscape
ETL
ETL
…but the overall problem remains:
• How to realtime integrate and operate upon both periodically generated data and realtime current data?
• Lackluster integration between OLTP and Hadoop • It’s not just about the database: you need a
realtime profile and profile update function
The legacy problem in pseudocode
onContentClick() {!String[] tags = content.getTags();!Resource[] r = f1(database, tags);!
}!
• Realtime intraday state not well-handled
• Baselining is a different problem than click handling
The Right Approach • Users have a specific Profile entity
• The Profile captures trend analytics as baselining information
• The Profile has per-tag “counters” that are updated with
each interaction / click
• Counters plus baselining are passed to fetch function
• The fetch function itself could be dynamic!
24 hours in the life of The System
• Assume some content has been created and tagged
• Two systemetized tags: Pets & PowerTools
Monday, 1:30AM EST
• Fetch all user Profiles from MongoDB; load into Hadoop • Or skip if using the MongoDB-Hadoop connector!
MongoDB Hadoop App(s) MapReduce
MongoDB-Hadoop MapReduce Example public class ProfileMapper !extends Mapper<Object, BSONObject, IntWritable, IntWritable> {! @Override! public void map(final Object pKey,!
! ! ! !final BSONObject pValue,!! ! ! !final Context pContext )!!throws IOException, InterruptedException{!
String user = (String)pValue.get(”user");! Date d1 = (Date)pValue.get(“lastUpdate”);! int count = 0;! List<String> keys = pValue.get(“tags”).keys();! for ( String tag : keys) {! count += pValue.get(tag).get(“hist”).size();! )! int avg = count / keys.size();! pContext.write( new IntWritable( count), new IntWritable( avg ) );! }!}!
MongoDB-Hadoop v1 (today)
Hadoop
ü V1 adapter draws data directly from MongoDB ü No ETL, scripts, change management, etc. ü Storage optimized: NO data copies
MR
Map
per
Mon
goD
B-H
adoo
p v1
MongoDB-Hadoop v2 (soon)
Hadoop
ü V2 flows data directly into HDFS via a special MongoDB secondary
ü No ETL, scripts, change management, etc. ü Data is copied – but still one data fabric ü Realtime data with snapshotting as an option
MR
Map
per
HD
FS
Monday, 1:45AM EST
• Grind through all content data and user Profile data to produce: • Tags based on feature extraction (vs. creator-applied tags) • Trend baseline per user for tags Pets and PowerTools
• Load Profiles with new baseline back into MongoDB
MongoDB Hadoop App(s) MapReduce
Monday, 8AM EST
• User Bob logs in and Profile retrieved from MongoDB • Bob clicks on Content X which is already tagged as “Pets” • Bob has clicked on Pets tagged content many times • Adjust Profile for tag “Pets” and save back to MongoDB
• Analysis = f(Profile)
• Analysis can be “anything”; it is simply a result. It could trigger an ad, a compliance alert, etc.
MongoDB Hadoop App(s) MapReduce
Monday, 8:02AM EST
• Bob clicks on Content Y which is already tagged as “Spices” • Spice is a new tag type for Bob • Adjust Profile for tag “Spices” and save back to MongoDB • Analysis = f(profile)
MongoDB Hadoop App(s) MapReduce
Profile in Detail {! user: “Bob”,! personalData: {! zip: “10024”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [0,0,10,4,1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 },! { ts: datetime2, url: url2 } // 100 more! ]},! SPICE: { hist: [! { ts: datetime3, url: url3 }! ]}! }!}!
Tag-based algorithm detail getRecommendedContent(profile, [“PETS”, other]) { if algo for a tag available {!
!filter = algo(profile, tag);! }! fetch N recommendations (filter);!}!!A4(profile, tag) {! weight = get tag (“PETS”) global weighting;! adjustForPersonalBaseline(weight, “PETS” baseline); ! if “PETS” clicked more than 2 times in past 10 mins! then weight += 10;! if “PETS” clicked more than 10 times in past 2 days! then weight += 3; !! ! return new filter({“PETS”, weight}, globals)!}!
Tuesday, 1AM EST
MongoDB Hadoop App(s) MapReduce
• Fetch all user Profiles from MongoDB; load into Hadoop • Or skip if using the MongoDB-Hadoop connector!
Tuesday, 1:30AM EST
• Grind through all content data and user profile data to produce: • Tags based on feature extraction (vs. creator-applied tags) • Trend baseline for Pets and PowerTools and Spice
• Data can be specific to individual or by group • Load new baselines back into MongoDB
MongoDB Hadoop App(s) MapReduce
New Profile in Detail {! user: “Bob”,! personalData: {! zip: “10024”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [0,4,10,4,1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 },! { ts: datetime2, url: url2 } // 100 more! ]},! SPICE: { hist: [! baseline: [1],! { ts: datetime3, url: url3 }! ]}! }!}!
Tuesday, 1:35AM EST
• Perform maintenance on user Profiles • Click history trimming (variety of algorithms) • “Dead tag” removal • Update of auxiliary reference data
MongoDB Hadoop App(s) MapReduce
New Profile in Detail {! user: “Bob”,! personalData: {! zip: “10022”,! gender: “M”! },! tags: {! PETS: { algo: “A4”, ! baseline: [ 1322,44,23, … ],! hist: [! { ts: datetime1, url: url1 } // 50 more! ]},! SPICE: { algo: “Z1”, hist: [! baseline: [1],! { ts: datetime3, url: url3 }! ]}! }!}!
Feel free to run the baselining more frequently
… but avoid “Are We There Yet?”
MongoDB Hadoop App(s) MapReduce
Nearterm / Realtime Questions & Actions
With respect to the Customer: • What has Bob done over the past 24 hours? • Given an input, make a logic decision in 100ms or less
With respect to the Provider: • What are all current users doing or looking at? • Can we nearterm correlate single events to shifts in behavior?
Longterm/ Not Realtime Questions & Actions
With respect to the Customer: • Any way to explain historic performance / actions? • What are recommendations for the future?
With respect to the Provider: • Can we correlate multiple events from multiple sources
over a long period of time to identify trends? • What is my entire customer base doing over 2 years? • Show me a time vs. aggregate tag hit chart • Slice and dice and aggregate tags vs. XYZ • What tags are trending up or down?
Applications
Another Example: Realtime Risk
Trade Processing
Risk Calculation
(Spark)
Risk Service
Log trade activities
Query trades
Query Risk
Risk Params Admin
Analysis/Reporting (Impala)
OTHER HDFS DATA
OTHER HDFS DATA
Recording a trade
Applications
Trade Processing
1. Bank makes a trade 2. Trade sent to Trade Processing 3. Trade Processing writes trade to MongoDB 4. Realtime replicate trade to Hadoop/HDFS
Non-functional notes: • High volume of data ingestion (10,000s or more
events per second) • Durable storage of trade data • Store trade events across all asset classes
1
2
3
4
Querying deal / trade / event data
1. Query on deal attributes (id, counterparty, asset class, termination date, notional amount, book)
2. MongoDB performs index-optimized query and Trade Processing assembles Deal/Trade/Event data into response packet
3. Return response packet to caller
Non-functional notes: • System can support very high volume (10,000s
or more queries per second) • Millisecond response times
Applications
Trade Processing
1
2
3
Updating intra-day risk data
1. Mirror of trade data already stored in HDFS Trade data partitioned into time windows
2. Signal/timer kicks off a “run” 3. Spark ingests new partition of trade data as RDD
and calculates and merges risk data based on latest trade data
4. Risk data written directly to MongoDB and indexed and available for online queries / aggregations / applications logic
Applications
Risk Service
Risk Calculation
(Spark)
1
2
4
3
Querying detail & aggregated risk on demand
1. Applications can use full MongoDB query API to access risk data and trade data
2. Risk data can be indexed on multiple fields for fast access by multiple dimensions
3. Hadoop jobs periodically apply incremental updates to risk data with no down time
4. Interpolated / matrix risk can be computed on-the-fly
Non-functional notes • System can support very high volume (10,000s
or more queries per second) • Millisecond response times
Applications
Risk Service
1
2
3
Trade Analytics & Reporting
1. Impala provides full SQL access to all content in Hadoop
2. Dashboards and Reporting frameworks deliver periodic information to consumers
3. Breadth of data discovery / ad-hoc analysis tools can be brought bear on all data in Hadoop
Non-functional notes: • Lower query frequency • Full SQL query flexibility • Most queries / analysis yield value accessing large
volumes of data (e.g. all events in the last 30 days – or 30 months)
Applications
Impala
Dashboards Reports
Ad-hoc Analysis
The Key To Success: It is One System
MongoDB
Hadoop
App(s)
MapReduce