analytics with mongodb aggregation framework and hadoop connector

48
Analytics with MongoDB alone and with Hadoop Connector Solution Architect, MongoDB Henrik Ingo @h_ingo

Upload: henrik-ingo

Post on 26-Jan-2015

129 views

Category:

Technology


0 download

DESCRIPTION

What are the Big Data tools in and around MongoDB.

TRANSCRIPT

Page 1: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Analytics with MongoDBalone and with Hadoop Connector

Solution Architect, MongoDB

Henrik Ingo

@h_ingo

Page 2: Analytics with MongoDB Aggregation Framework and Hadoop Connector

The Science in Data Science

• Collect data

• Explore the data, use visualization

• Use math

• Make predictions

• Test predictions– Collect even more data

• Repeat...

Page 3: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Why MongoDB?

When MongoDB?

Page 4: Analytics with MongoDB Aggregation Framework and Hadoop Connector

5 NoSQL categories

Key Value Wide Column Document

Graph Map Reduce

Redis Cassandra

Neo4j Hadoop

Page 5: Analytics with MongoDB Aggregation Framework and Hadoop Connector

MongoDB and Enterprise IT Stack

EDWHadoop

Ma

na

ge

me

nt

& M

on

ito

rin

gS

ecu

rity &

Au

ditin

g

RDBMS

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

RDBMS

Applications

Infrastructure

Data Management

Online Data Offline Data

Page 6: Analytics with MongoDB Aggregation Framework and Hadoop Connector

How do we do it with MongoDB?

Page 7: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Collect data

Page 8: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Exponential Data Growth

http://www.worldwidewebsize.com/

Page 9: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Volume Velocity Variety

Page 10: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Volume Velocity Variety

Data Sources

Asynchronous writes

Upserts avoid unnecessary reads

Writes buffered in RAM and flushed to disk in

bulk

Data Sources

Data Sources

Data Sources

Spread writes over multiple shards

Page 11: Analytics with MongoDB Aggregation Framework and Hadoop Connector

RDBMSMongoDB

{

_id : ObjectId("4c4ba5e5e8aabf3"),

employee_name: "Dunham, Justin",

department : "Marketing",

title : "Product Manager, Web",

report_up: "Neray, Graham",

pay_band: “C",

benefits : [

{ type : "Health",

plan : "PPO Plus" },

{ type : "Dental",

plan : "Standard" }

]

}

Volume Velocity Variety

Page 12: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Visualization

Page 13: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Visualization

d3js.org, …

Page 14: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Use math

Page 15: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Data Processing in MongoDB

• Pre-aggregated documents

• Aggregation Framework

• Map/Reduce

• Hadoop Connector

Page 16: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Pre-aggregated documents

Design Pattern

Page 17: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Pre-Aggregation

Data for

URL /

Date

{_id: "20101010/site-1/apache_pb.gif",metadata: {

date: ISODate("2000-10-10T00:00:00Z"),site: "site-1",page: "/apache_pb.gif" },

daily: 5468426,hourly: {

"0": 227850,"1": 210231,..."23": 20457 },

minute: {"0": 3612,"1": 3241,..."1439": 2819 }

}

Page 18: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Pre-Aggregation

Data for

URL /

Date

query = { '_id': "20101010/site-1/apache_pb.gif" }

update = { '$inc': {'hourly.12' : 1,'minute.739': 1 } }

db.stats.daily.update(query, update, upsert=True)

Page 19: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Aggregation framework

Page 20: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Dynamic Queries

Find all logs for a

URL

db.logs.find( { ‘path’ : ‘/index.html’ } )

Find all logs for a

time range

db.logs.find( {‘time’ : {

‘$gte’: new Date(2013, 0),‘$lt’: new Date(2013, s1) }

} )

Find all logs for a

host over a range of

dates

db.logs.find( { ‘host’ : ‘127.0.0.1’,‘time’ : {

‘$gte’: new Date(2013, 0),‘$lt’: new Date(2013, 1) }

} )

Page 21: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Aggregation Framework

Requests

per day by

URL

db.logs.aggregate( [{ '$match': {

'time': {'$gte': new Date(2013, 0),'$lt': new Date(2013, 1) } } },

{ '$project': {'path': 1,'date': {

'y': { '$year': '$time' },'m': { '$month': '$time' },'d': { '$dayOfMonth': '$time' } } } },

{ '$group': {'_id': {'p': '$path','y': '$date.y','m': '$date.m','d': '$date.d' },

'hits': { '$sum': 1 } } },])

Page 22: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Aggregation Framework

{‘ok’: 1, ‘result’: [ { '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 1 }, 'hits’: 124 },{ '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 2 }, 'hits’: 245 },{ '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 3 }, 'hits’: 322 },{ '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 4 }, 'hits’: 175 },{ '_id': {'p':’/index.html’,'y': 2013,'m': 1,'d': 5 }, 'hits’: 94 }

]}

Page 23: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Aggregation Framework Benefits

• Real-time

• Simple yet powerful interface

• Scale-out

• Declared in JSON, executes in C++

• Runs inside MongoDB on local data

Page 24: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Map Reduce in MongoDB

Page 25: Analytics with MongoDB Aggregation Framework and Hadoop Connector

MongoDB Map/Reduce

Page 26: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Map Reduce – Map PhaseGenerate hourly

rollups from log

data

var map = function() {

var key = {

p: this.path,

d: new Date(

this.ts.getFullYear(),

this.ts.getMonth(),

this.ts.getDate(),

this.ts.getHours(),

0, 0, 0) };

emit( key, { hits: 1 } );

}

Page 27: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Map Reduce – Reduce PhaseGenerate hourly

rollups from log

data

var reduce = function(key, values) {

var r = { hits: 0 };

values.forEach(function(v) {

r.hits += v.hits;

});

return r;

}

)

Page 28: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Map Reduce - Execution

query = { 'ts': {'$gte': new Date(2013, 0, 1),'$lte': new Date(2013, 0, 31) } }

db.logs.mapReduce( map, reduce, { ‘query’: query,‘out’: {

‘reduce’ : ‘stats.monthly’ }} )

Page 29: Analytics with MongoDB Aggregation Framework and Hadoop Connector

MongoDB Map/Reduce Benefits

• Runs inside MongoDB

• Sharding supported

• JavaScript– Pro: functionality, expressiveness

– Con: overhead

• Input can be a collection or query!

• Output directly to document or collection

• Easy, when you don’t want overhead of Hadoop

Page 30: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Hadoop Connector

Page 31: Analytics with MongoDB Aggregation Framework and Hadoop Connector

MongoDB with Hadoop

Page 32: Analytics with MongoDB Aggregation Framework and Hadoop Connector

MongoDB with Hadoop

Page 33: Analytics with MongoDB Aggregation Framework and Hadoop Connector

MongoDB

MongoDB with Hadoop

Page 34: Analytics with MongoDB Aggregation Framework and Hadoop Connector

How it works

• Adapter examines MongoDB input collection and calculates a set of splits from data

• Each split is assigned to a Hadoop node

• In parallel hadoop pulls data from splits on MongoDB (or BSON) and starts processing locally

• Hadoop merges results and streams output back to MongoDB (or BSON) output collection

Page 35: Analytics with MongoDB Aggregation Framework and Hadoop Connector

mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat

mongo.input.uri=mongodb://my-db:27017/enron.messages

mongo.job.input.format=com.mongodb.hadoop.BSONFileInputFormat

mapred.input.dir= file:///tmp/messages.bson

mapred.input.dir= hdfs:///tmp/messages.bson

mapred.input.dir= s3:///tmp/messages.bson

(or BSON)Read From MongoDB

Page 36: Analytics with MongoDB Aggregation Framework and Hadoop Connector

mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat

mongo.output.uri=mongodb://my-db:27017/enron.results_out

mongo.job.output.format=com.mongodb.hadoop.BSONFileOutputFormat

mapred.output.dir= file:///tmp/results.bson

mapred.output.dir= hdfs:///tmp/results.bson

mapred.output.dir= s3:///tmp/results.bson

(or BSON)Write To MongoDB

Page 37: Analytics with MongoDB Aggregation Framework and Hadoop Connector

{

"_id" : ObjectId("4f2ad4c4d1e2d3f15a000000"),

"body" : "Here is our forecast\n\n ",

"filename" : "1.",

"headers" : {

"From" : "[email protected]",

"Subject" : "Forecast Info",

"X-bcc" : "",

"To" : "[email protected]",

"X-Origin" : "Allen-P",

"X-From" : "Phillip K Allen",

"Date" : "Mon, 14 May 2001 16:39:00 -0700 (PDT)",

"X-To" : "Tim Belden ",

"Message-ID" : "<18782981.1075855378110.JavaMail.evans@thyme>",

"Content-Type" : "text/plain; charset=us-ascii",

"Mime-Version" : "1.0"

}

}

Document Example

Page 38: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Graph Sketch

Page 39: Analytics with MongoDB Aggregation Framework and Hadoop Connector

{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 14}

{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 9}

{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 99}

{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 48}

{"_id": {"t":"[email protected]", "f":"[email protected]"}, "count" : 20}

Receiver Sender Pairs

Page 40: Analytics with MongoDB Aggregation Framework and Hadoop Connector

@Override

public void map(NullWritable key, BSONObject val, final Context context){

BSONObject headers = (BSONObject)val.get("headers"); if(headers.containsKey("From") && headers.containsKey("To")){

String from = (String)headers.get("From"); String to = (String) headers.get("To"); String[] recips = to.split(",");

for(int i=0;i<recips.length;i++){

String recip = recips[i].trim();

context.write(new MailPair(from, recip), new IntWritable(1));

}

}

}

Map Phase – each document get’s through mapper function

Page 41: Analytics with MongoDB Aggregation Framework and Hadoop Connector

public void reduce(final MailPair pKey, final Iterable<IntWritable> pValues, final Context pContext ){

int sum = 0;

for ( final IntWritable value : pValues ){

sum += value.get();

}

BSONObject outDoc = new BasicDBObjectBuilder().start()

.add( "f" , pKey.from)

.add( "t" , pKey.to )

.get();BSONWritable pkeyOut = new BSONWritable(outDoc);

pContext.write( pkeyOut, new IntWritable(sum) ); }

Reduce Phase – output Maps are grouped by key and passed to Reducer

Page 42: Analytics with MongoDB Aggregation Framework and Hadoop Connector

mongos> db.streaming.output.find({"_id.t": /^kenneth.lay/})

{ "_id" : { "t" : "[email protected]",

"f" : "[email protected]" }, "count" : 1 }

{ "_id" : { "t" : "[email protected]",

"f" : "[email protected]" }, "count" : 1 }

{ "_id" : { "t" : "[email protected]",

"f" : "[email protected]" }, "count" : 2 }

{ "_id" : { "t" : "[email protected]",

"f" : "[email protected]" }, "count" : 2 }

{ "_id" : { "t" : "[email protected]",

"f" : "[email protected]" }, "count" : 4 }

{ "_id" : { "t" : "[email protected]",

"f" : "[email protected]" }, "count" : 1 }

{ "_id" : { "t" : "[email protected]",

"f" : "[email protected]" }, "count" : 1 }

Query Data

Page 43: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Hadoop Connector Benefits

• Full multi-core parallelism to process MongoDB data

• mongo.input.query

• Full integration w/ Hadoop and JVM ecosystem

• Mahout, et.al.

• Can be used on Amazon Elastic MapReduce

• Read and write backup files to local, HDFS and S3

• Vanilla Java MapReduce, Hadoop Streaming, Pig, Hive

Page 44: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Make predictions & test

Page 45: Analytics with MongoDB Aggregation Framework and Hadoop Connector

A/B testing

• Hey, it looks like teenage girls clicked a lot on that ad with a pink background...

• Hypothesis: Given otherwise the same ad, teenage girls are more likely to click on ads with pink

backgrounds than white

• Test 50-50 pink vs white ads

• Collect click stream stats in MongoDB or Hadoop

• Analyze results

Page 46: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Recommendations – social filtering

• ”Customers who bought this book also bought”

• Computed offline / nightly

• As easy as it sounds! google it: Amazon item-to-item algorithm

Page 47: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Personalization

• ”Even if you are a teenage girl, you seem to be 60% more likely to click on blue ads than pink.”

• User specific recommendations a hybrid of offline & online recommendations

• User profile in MongoDB

• May even be updated real time

Page 48: Analytics with MongoDB Aggregation Framework and Hadoop Connector

Solution Architect, MongoDB

Henrik Ingo

@h_ingo

Questions?