mongodb et hadoop

Tugdual GrallTechnical Evangelisttug@mongodb.com@tgrall

MongoDB & Hadoop

Agenda

Evolving Data Landscape

MongoDB & Hadoop Use Cases

MongoDB Connector Features

Evolving Data Landscape

• Terabyte and Petabyte datasets• Data warehousing• Advanced analytics

Hadoop

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.”

http://hadoop.apache.org

‹#›

Enterprise IT Stack

‹#›

Operational vs. Analytical: Enrichment

Warehouse, AnalyticsApplications, Interactions

Operational: MongoDB

First-Level Analytics

Internet of Things

Social

Mobile Apps

Product/Asset Catalog

Security & Fraud

Single View

Customer Data Management

Churn Analysis

Risk Modeling

Sentiment Analysis

Trade Surveillance

Recommender

Warehouse & ETL

Ad Targeting

Predictive Analytics

Analytical: Hadoop

Internet of Things

Social

Mobile Apps

Security & Fraud

Single View

Churn Analysis

Risk Modeling

Sentiment Analysis

Trade Surveillance

Recommender

Warehouse & ETL

Ad Targeting

Operational & Analytical: Lifecycle

Internet of Things

Social

Mobile Apps

Security & Fraud

Single View

Churn Analysis

Risk Modeling

Sentiment Analysis

Trade Surveillance

Recommender

Warehouse & ETL

Ad Targeting

MongoDB & Hadoop Use Cases

Commerce

Applicationspowered by

Analysispowered by

Products & Inventory

Recommended products

Customer profile

Session management

Elastic pricing

Recommendation models

Predictive analytics

Clickstream history

MongoDB Connectorfor Hadoop

Insurance

Applicationspowered by

Analysispowered by

Customer profiles

Insurance policies

Session data

Call center data

Customer action analysis

Churn analysis

Churn prediction

Policy rates

Fraud Detection

Payments Nightly Analysis

3rd Party

Data Sources

Results CacheFraud

Detection

Query Only

MongoDB Connector for Hadoop

‹#›

Connector Overview

• Read/Write MongoDB• Read/Write BSON

• MapReduce• Pig• Hive• Spark

PLATFORMS

• Apache Hadoop• Cloudera CDH• Hortonworks HDP• MapR• Amazon EMR

‹#›

Connector Features and Functionality

• Computes splits to read data• Single Node, Replica Sets, Sharded Clusters

• Mappings for Pig and Hive• MongoDB as a standard data source/destination

• Support for• Filtering data with MongoDB queries• Authentication• Reading from Replica Set tags• Appending to existing collections

‹#›

MapReduce Configuration

• MongoDB input/outputmongo.job.input.format = com.mongodb.hadoop.MongoInputFormatmongo.input.uri = mongodb://mydb:27017/db1.collection1mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormatmongo.output.uri = mongodb://mydb:27017/db1.collection2

• BSON input/outputmongo.job.input.format = com.hadoop.BSONFileInputFormatmapred.input.dir = hdfs:///tmp/database.bsonmongo.job.output.format = com.hadoop.BSONFileOutputFormatmapred.output.dir = hdfs:///tmp/output.bson

‹#›

Pig Mappings

• Input: BSONLoader and MongoLoaderdata = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader

• Output: BSONStorage and MongoInsertStorageSTORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage

‹#›

Hive Support

• Access collections as Hive tables• Use with MongoStorageHandler or BSONStorageHandler

CREATE TABLE mongo_users (id int, name string, age int)STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

‹#›

• Use with MapReduce input/output formats

• Create Configuration objects with input/output formats and data URI

• Load/save data using SparkContext Hadoop file API

‹#›

Data Movement

Dynamic queries to MongoDB vs. BSON snapshots in HDFS

Dynamic queries with most recent data

Puts load on operational database

Snapshots move load to Hadoop

Snapshots add predictable load to MongoDB

Demo : Recommendation Platform

‹#›

Movie Web

‹#›

MovieWeb Web Application

• Browse - Top movies by ratings count- Top genres by movie count

• Log in to - See My Ratings- Rate movies

• Recommendations- Movies You May Like- Recommendations

‹#›

MovieWeb Components

• MovieLens dataset– 10M ratings, 10K movies, 70K users– http://grouplens.org/datasets/movielens/

• Python web app to browse movies, recommendations– Flask, PyMongo

• Spark app computes recommendations– MLLib collaborative filter

• Predicted ratings are exposed in web app– New predictions collection

‹#›

Spark Recommender

• Apache Hadoop (2.3) - HDFS & YARN- Top genres by movie count

• Spark (1.0)- Execute within YARN- Assign executor resources

• Data- From HDFS, MongoDB- To MongoDB

‹#›

MovieWeb Workflow

Snapshot dbas BSON

Predict ratings for all pairings

Write Prediction to MongoDB collection

Store BSON in HDFS

Read BSON into Spark App

Create user movie pairing

Web Application exposes

recommendationsRepeat Process

Train Model from existing ratings

‹#›

Execution

$ spark-submit --master local \ --driver-memory 2G --executor-memory 2G \ --jars mongo-hadoop-core.jar,mongo-java-driver.jar \ --class com.mongodb.workshop.SparkExercise \ ./target/spark-1.0-SNAPSHOT.jar \ hdfs://localhost:9000 \ mongodb://127.0.0.1:27017/movielens \ predictions \

Should I use MongoDB or Hadoop?

‹#›

Business First!

Internet of Things

Social

Mobile Apps

Product/Asset

Catalog

Security & Fraud

Single View

Customer Data

Management

Churn Analysis

Risk Modeling

Sentiment Analysis

Trade Surveillance

Recommender

Warehouse & ETL

Ad Targeting

What/Why How

‹#›

The good tool for the task

• Dataset size• Data processing complexity• Continuous improvement

‹#›

The good tool for the task

• Dataset size• Data processing complexity• Continuous improvement

‹#›

Resources / Questions

• MongoDB Connector for Hadoop- http://github.com/mongodb/mongo-hadoop

• Getting Started with MongoDB and Hadoop - http://docs.mongodb.org/ecosystem/tutorial/getting-s

tarted-with-hadoop/

• MongoDB-Spark Demo- https://github.com/crcsmnky/mongodb-hadoop-work

MongoDB & Hadoop

mongodb et hadoop

Technology

hadoop - mongodb webinar june 2014

mongodb & hadoop: flexible hourly batch processing model

lightning talk: why and how to integrate mongodb and nosql...

mongodb for time series data part 2: analyzing time series...

big data brief - national oceanic and atmospheric ... ·...

mongodb hadoop connector whitepaper-2

optimizing mongodb-hadoop performance with …...oprruizrng...

mihai pintea. 2 agenda hadoop and mongodb datadirect driver...

mongodb hadoop integration

analytics with mongodb aggregation framework and hadoop...

quand utiliser mongodb… et quand vous en passer…

deploying mongodb and hadoop to amazon web services ·...

scm dashobard using hadoop, mongodb, django

red hat. cassandra and mongodb on encryption for hadoop

mongodb iot city tour london: hadoop and the future of data...

webinar: mongodb and hadoop - essential tools for your big...

big data analytics with hadoop, mongodb and sql server

mongodb iot city tour stuttgart: hadoop and future data...

mongodb & hadoop, sittin' in a tree

conexión de mongodb con hadoop - luis alberto giménez -...