webinar: data processing and aggregation options

62
Data Processing and Aggregation Consulting Engineer, MongoDB Achille Brighton

Upload: mongodb

Post on 10-May-2015

12.227 views

Category:

Technology


2 download

DESCRIPTION

MongoDB scales easily to store mass volumes of data. However, when it comes to making sense of it all what options do you have? In this talk, we'll take a look at 3 different ways of aggregating your data with MongoDB, and determine the reasons why you might choose one way over another. No matter what your big data needs are, you will find out how MongoDB the big data store is evolving to help make sense of your data.

TRANSCRIPT

Page 1: Webinar: Data Processing and Aggregation Options

Data Processing and Aggregation

Consulting Engineer, MongoDB

Achille Brighton

Page 2: Webinar: Data Processing and Aggregation Options

Big Data

Page 3: Webinar: Data Processing and Aggregation Options

Exponential Data Growth

0

200

400

600

800

1000

1200

2000 2002 2004 2006 2008

Billions of URLs indexed by Google

Page 4: Webinar: Data Processing and Aggregation Options

For over a decade

Big Data == Custom Software

Page 5: Webinar: Data Processing and Aggregation Options

In the past few years Open source software has emerged enabling the rest of us to handle Big Data

Page 6: Webinar: Data Processing and Aggregation Options

How MongoDB Meets Our Requirements

•  MongoDB is an operational database

•  MongoDB provides high performance for storage and retrieval at large scale

•  MongoDB has a robust query interface permitting intelligent operations

•  MongoDB is not a data processing engine, but provides processing functionality

Page 7: Webinar: Data Processing and Aggregation Options

MongoDB data processing options

Page 8: Webinar: Data Processing and Aggregation Options

Getting Example Data

Page 9: Webinar: Data Processing and Aggregation Options

The “hello world” of MapReduce is counting words in a paragraph of text.

Let’s try something a little more interesting…

Page 10: Webinar: Data Processing and Aggregation Options

What is the most popular pub name?

Page 11: Webinar: Data Processing and Aggregation Options

#!/usr/bin/env python # Data Source # http://www.overpass-api.de/api/xapi?*[amenity=pub][bbox=-10.5,49.78,1.78,59] import re import sys from imposm.parser import OSMParser import pymongo class Handler(object): def nodes(self, nodes): if not nodes: return docs = [] for node in nodes: osm_id, doc, (lon, lat) = node if "name" not in doc: node_points[osm_id] = (lon, lat) continue doc["name"] = doc["name"].title().lstrip("The ").replace("And", "&") doc["_id"] = osm_id doc["location"] = {"type": "Point", "coordinates": [lon, lat]} docs.append(doc) collection.insert(docs)

Open Street Map Data

Page 12: Webinar: Data Processing and Aggregation Options

{ "_id" : 451152, "amenity" : "pub", "name" : "The Dignity", "addr:housenumber" : "363", "addr:street" : "Regents Park Road", "addr:city" : "London", "addr:postcode" : "N3 1DH", "toilets" : "yes", "toilets:access" : "customers", "location" : { "type" : "Point", "coordinates" : [-0.1945732, 51.6008172] } }

Example Pub Data

Page 13: Webinar: Data Processing and Aggregation Options

MongoDB MapReduce • 

MongoDB

map

reduce

finalize

Page 14: Webinar: Data Processing and Aggregation Options

MongoDB MapReduce • 

Page 15: Webinar: Data Processing and Aggregation Options

Map Function

> var map = function() {

emit(this.name, 1);

MongoDB

map

reduce finalize

Page 16: Webinar: Data Processing and Aggregation Options

Reduce Function

> var reduce = function (key, values) {

var sum = 0;

values.forEach( function (val) {sum += val;} );

return sum;

}

MongoDB

map

reduce finalize

Page 17: Webinar: Data Processing and Aggregation Options

Results

> db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "The Red Lion", "value" : 407 } { "_id" : "The Royal Oak", "value" : 328 } { "_id" : "The Crown", "value" : 242 } { "_id" : "The White Hart", "value" : 214 } { "_id" : "The White Horse", "value" : 200 } { "_id" : "The New Inn", "value" : 187 } { "_id" : "The Plough", "value" : 185 } { "_id" : "The Rose & Crown", "value" : 164 } { "_id" : "The Wheatsheaf", "value" : 147 } { "_id" : "The Swan", "value" : 140 }

Page 18: Webinar: Data Processing and Aggregation Options
Page 19: Webinar: Data Processing and Aggregation Options

> db.pubs.mapReduce(map, reduce, { out: "pub_names", query: { location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959] } }} }) { "result" : "pub_names", "timeMillis" : 116, "counts" : { "input" : 643, "emit" : 643, "reduce" : 54, "output" : 537 }, "ok" : 1, }

Pub Names in the Center of London

Page 20: Webinar: Data Processing and Aggregation Options

> db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 }

Results

Page 21: Webinar: Data Processing and Aggregation Options

Double Checking

Page 22: Webinar: Data Processing and Aggregation Options

MongoDB MapReduce

•  Real-time

•  Output directly to document or collection

•  Runs inside MongoDB on local data

− Adds load to your DB

− In Javascript – debugging can be a challenge

− Translating in and out of C++

Page 23: Webinar: Data Processing and Aggregation Options

Aggregation Framework

Page 24: Webinar: Data Processing and Aggregation Options

Aggregation Framework

•  Declared in JSON, executes in C++

Data Processing in MongoDB

Page 25: Webinar: Data Processing and Aggregation Options

Aggregation Framework

•  Declared in JSON, executes in C++ •  Flexible, functional, and simple

Data Processing in MongoDB

Page 26: Webinar: Data Processing and Aggregation Options

Aggregation Framework

•  Declared in JSON, executes in C++ •  Flexible, functional, and simple •  Plays nice with sharding

Data Processing in MongoDB

Page 27: Webinar: Data Processing and Aggregation Options

Pipeline

ps ax | grep mongod | head 1

Piping command line operations

Data Processing in MongoDB

Page 28: Webinar: Data Processing and Aggregation Options

Pipeline

$match $group | $sort |

Piping aggregation operations

Stream of documents Result document

Data Processing in MongoDB

Page 29: Webinar: Data Processing and Aggregation Options

Pipeline Operators

• $match • $project • $group • $unwind

Data Processing in MongoDB

• $sort • $limit • $skip • $geoNear

Page 30: Webinar: Data Processing and Aggregation Options

$match

•  Filter documents

•  Uses existing query syntax

•  If using $geoNear it has to be first in pipeline

•  $where is not supported

Page 31: Webinar: Data Processing and Aggregation Options

Matching Field Values { "_id" : 271421, "amenity" : "pub", "name" : "Sir Walter Tyrrell", "location" : { "type" : "Point", "coordinates" : [ -1.6192422, 50.9131996 ] } } { "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ] }

Matching Field Values

{ "$match": { "name": "The Red Lion"

}}

{ "_id" : 271466, "amenity" : "pub", "name" : "The Red Lion", "location" : { "type" : "Point", "coordinates" : [ -1.5494749, 50.7837119 ]} }

Page 32: Webinar: Data Processing and Aggregation Options

$project

•  Reshape documents

•  Include, exclude or rename fields

•  Inject computed fields

•  Create sub-document fields

Page 33: Webinar: Data Processing and Aggregation Options

Including and Excluding Fields {

"_id" : 271466,

"amenity" : "pub",

"name" : "The Red Lion",

"location" : {

"type" : "Point",

"coordinates" : [

-1.5494749,

50.7837119

]

}

}

{ “$project”: { “_id”: 0, “amenity”: 1, “name”: 1, }}

{ “amenity” : “pub”, “name” : “The Red Lion” }

Page 34: Webinar: Data Processing and Aggregation Options

Reformatting Documents {

"_id" : 271466,

"amenity" : "pub",

"name" : "The Red Lion",

"location" : {

"type" : "Point",

"coordinates" : [

-1.5494749,

50.7837119

]

}

}

{ “$project”: { “_id”: 0, “name”: 1, “meta”: {

“type”: “$amenity”} }}

{ “name” : “The Red Lion” “meta” : {

“type” : “pub” }}

Page 35: Webinar: Data Processing and Aggregation Options

$group

•  Group documents by an ID

•  Field reference, object, constant

•  Other output fields are computed $max, $min, $avg, $sum $addToSet, $push $first, $last

•  Processes all data in memory

Page 36: Webinar: Data Processing and Aggregation Options

Summating fields

{ title: "The Great Gatsby", pages: 218, language: "English" } { title: "War and Peace", pages: 1440, language: "Russian” } { title: "Atlas Shrugged", pages: 1088, language: "English" }

{ $group: { _id: "$language", numTitles: { $sum: 1 }, sumPages: { $sum: "$pages" } }} { _id: "Russian", numTitles: 1, sumPages: 1440 } { _id: "English", numTitles: 2, sumPages: 1306 }

Page 37: Webinar: Data Processing and Aggregation Options

Add To Set { title: "The Great Gatsby", pages: 218, language: "English" } { title: "War and Peace", pages: 1440, language: "Russian" } { title: "Atlas Shrugged", pages: 1088, language: "English" }

{ $group: { _id: "$language", titles: { $addToSet: "$title" } }} { _id: "Russian", titles: [ "War and Peace" ] } { _id: "English", titles: [ "Atlas Shrugged", "The Great Gatsby" ] }

Page 38: Webinar: Data Processing and Aggregation Options

Expanding Arrays

{ title: "The Great Gatsby", ISBN: "9781857150193", subjects: [ "Long Island", "New York", "1920s" ] }

{ $unwind: "$subjects" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "Long Island" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "New York" } { title: "The Great Gatsby", ISBN: "9781857150193", subjects: "1920s" }

Page 39: Webinar: Data Processing and Aggregation Options

Back to the pub!

•  http://www.offwestend.com/index.php/theatres/pastshows/71

Page 40: Webinar: Data Processing and Aggregation Options

Popular Pub Names >var popular_pub_names = [

{ $match : location: { $within: { $centerSphere: [[-0.12, 51.516], 2 / 3959]}}}

},

{ $group : { _id: “$name” value: {$sum: 1} }

},

{ $sort : {value: -1} },

{ $limit : 10 }

Page 41: Webinar: Data Processing and Aggregation Options

> db.pubs.aggregate(popular_pub_names)

{

"result" : [ { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Green Man", "value" : 5 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "The Red Lion", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 }

],

"ok" : 1

}

Results

Page 42: Webinar: Data Processing and Aggregation Options

Aggregation Framework Benefits

•  Real-time

•  Simple yet powerful interface

•  Declared in JSON, executes in C++

•  Runs inside MongoDB on local data

− Adds load to your DB

− Limited Operators

− Data output is limited

Page 43: Webinar: Data Processing and Aggregation Options

Analyzing MongoDB Data in External Systems

Page 44: Webinar: Data Processing and Aggregation Options

MongoDB with Hadoop • 

MongoDB

Page 45: Webinar: Data Processing and Aggregation Options

Hadoop MongoDB Connector

•  MongoDB or BSON files as input/output

•  Source data can be filtered with queries

•  Hadoop Streaming support –  For jobs written in Python, Ruby, Node.js

•  Supports Hadoop tools such as Pig and Hive

Page 46: Webinar: Data Processing and Aggregation Options

#!/usr/bin/env python from pymongo_hadoop import BSONMapper def mapper(documents): bounds = get_bounds() # ~2 mile polygon for doc in documents: geo = get_geo(doc["location"]) # Convert the geo type if not geo: continue if bounds.intersects(geo): yield {'_id': doc['name'], 'count': 1} BSONMapper(mapper) print >> sys.stderr, "Done Mapping."

Map Pub Names in Python

Page 47: Webinar: Data Processing and Aggregation Options

#!/usr/bin/env python

from pymongo_hadoop import BSONReducer

def reducer(key, values):

_count = 0

for v in values: _count += v['count']

return {'_id': key, 'value': _count}

BSONReducer(reducer)

Reduce Pub Names in Python

Page 48: Webinar: Data Processing and Aggregation Options

hadoop jar target/mongo-hadoop-streaming-assembly-1.1.0-rc0.jar \

-mapper examples/pub/map.py \

-reducer examples/pub/reduce.py \

-mongo mongodb://127.0.0.1/demo.pubs \

-outputURI mongodb://127.0.0.1/demo.pub_names

Execute MapReduce

Page 49: Webinar: Data Processing and Aggregation Options

> db.pub_names.find().sort({value: -1}).limit(10) { "_id" : "All Bar One", "value" : 11 } { "_id" : "The Slug & Lettuce", "value" : 7 } { "_id" : "The Coach & Horses", "value" : 6 } { "_id" : "The Kings Arms", "value" : 5 } { "_id" : "Corney & Barrow", "value" : 4 } { "_id" : "O'Neills", "value" : 4 } { "_id" : "Pitcher & Piano", "value" : 4 } { "_id" : "The Crown", "value" : 4 } { "_id" : "The George", "value" : 4 } { "_id" : "The Green Man", "value" : 4 }

Popular Pub Names Nearby

Page 50: Webinar: Data Processing and Aggregation Options

MongoDB with Hadoop • 

MongoDB warehouse

Page 51: Webinar: Data Processing and Aggregation Options

MongoDB with Hadoop • 

MongoDB ETL

Page 52: Webinar: Data Processing and Aggregation Options

Limitations

•  Batch processing

•  Requires synchronization between data store and processor

•  Adds complexity to infrastructure

Page 53: Webinar: Data Processing and Aggregation Options

Advantages

•  Processing decoupled from data store

•  Parallel processing

•  Leverage existing infrastructure

•  Java has rich set of data processing libraries –  And other languages if using Hadoop Streaming

Page 54: Webinar: Data Processing and Aggregation Options

Storm

Page 55: Webinar: Data Processing and Aggregation Options

Storm

Page 56: Webinar: Data Processing and Aggregation Options

Storm MongoDB connector

•  Spout for MongoDB oplog or capped collections –  Filtering capabilities –  Threaded and non-blocking

•  Output to new or existing documents –  Insert/update bolt

Page 57: Webinar: Data Processing and Aggregation Options

Aggregating MongoDB’s Data Processing Options

Page 58: Webinar: Data Processing and Aggregation Options

Data Processing with MongoDB

•  Process in MongoDB using Map/Reduce

•  Process in MongoDB using Aggregation Framework

•  Also: Storing pre-aggregated data –  An exercise in schema design

•  Process outside MongoDB using Hadoop and other external tools

Page 59: Webinar: Data Processing and Aggregation Options

External Tools

Page 60: Webinar: Data Processing and Aggregation Options

Questions?

Page 61: Webinar: Data Processing and Aggregation Options

References

•  Map Reduce docs –  http://docs.mongodb.org/manual/core/map-reduce/

•  Aggregation Framework –  Examples

http://docs.mongodb.org/manual/applications/aggregation –  SQL Comparison

http://docs.mongodb.org/manual/reference/sql-aggregation-comparison/

•  Multi Threaded Map Reduce: http://edgystuff.tumblr.com/post/54709368492/how-to-speed-up-mongodb-map-reduce-by-20x

Page 62: Webinar: Data Processing and Aggregation Options

Thanks!

Consulting Engineer, MongoDB

Achille Brighton