mongodb et hadoop

35
Paris Tugdual Grall Technical Evangelist [email protected] @tgrall

Upload: mongodb

Post on 30-Jun-2015

606 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: MongoDB et Hadoop

Paris

Tugdual GrallTechnical [email protected]@tgrall

Page 2: MongoDB et Hadoop

MongoDB & Hadoop

Tugdual GrallTechnical [email protected]@tgrall

Page 3: MongoDB et Hadoop

Agenda

Evolving Data Landscape

MongoDB & Hadoop Use Cases

MongoDB Connector Features

Demo

Page 4: MongoDB et Hadoop

Evolving Data Landscape

Page 5: MongoDB et Hadoop

• Terabyte and Petabyte datasets• Data warehousing• Advanced analytics

Hadoop

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.”

http://hadoop.apache.org

Page 6: MongoDB et Hadoop

‹#›

Enterprise IT Stack

Page 7: MongoDB et Hadoop

‹#›

Operational vs. Analytical: Enrichment

Warehouse, AnalyticsApplications, Interactions

Page 8: MongoDB et Hadoop

Operational: MongoDB

First-Level Analytics

Internet of Things

Social

Mobile Apps

Product/Asset Catalog

Security & Fraud

Single View

Customer Data Management

Churn Analysis

Risk Modeling

Sentiment Analysis

Trade Surveillance

Recommender

Warehouse & ETL

Ad Targeting

Predictive Analytics

Page 9: MongoDB et Hadoop

Analytical: Hadoop

First-Level Analytics

Internet of Things

Social

Mobile Apps

Product/Asset Catalog

Security & Fraud

Single View

Customer Data Management

Churn Analysis

Risk Modeling

Sentiment Analysis

Trade Surveillance

Recommender

Warehouse & ETL

Ad Targeting

Predictive Analytics

Page 10: MongoDB et Hadoop

Operational & Analytical: Lifecycle

First-Level Analytics

Internet of Things

Social

Mobile Apps

Product/Asset Catalog

Security & Fraud

Single View

Customer Data Management

Churn Analysis

Risk Modeling

Sentiment Analysis

Trade Surveillance

Recommender

Warehouse & ETL

Ad Targeting

Predictive Analytics

Page 11: MongoDB et Hadoop

MongoDB & Hadoop Use Cases

Page 12: MongoDB et Hadoop

Commerce

Applicationspowered by

Analysispowered by

Products & Inventory

Recommended products

Customer profile

Session management

Elastic pricing

Recommendation models

Predictive analytics

Clickstream history

MongoDB Connectorfor Hadoop

Page 13: MongoDB et Hadoop

Insurance

Applicationspowered by

Analysispowered by

Customer profiles

Insurance policies

Session data

Call center data

Customer action analysis

Churn analysis

Churn prediction

Policy rates

MongoDB Connectorfor Hadoop

Page 14: MongoDB et Hadoop

Fraud Detection

MongoDB Connectorfor Hadoop

Payments Nightly Analysis

3rd Party

Data Sources

Results CacheFraud

Detection

Qu

ery

On

ly

Query Only

Page 15: MongoDB et Hadoop

MongoDB Connector for Hadoop

Page 16: MongoDB et Hadoop

‹#›

Connector Overview

DATA

• Read/Write MongoDB• Read/Write BSON

TOOLS

• MapReduce• Pig• Hive• Spark

PLATFORMS

• Apache Hadoop• Cloudera CDH• Hortonworks HDP• MapR• Amazon EMR

Page 17: MongoDB et Hadoop

‹#›

Connector Features and Functionality

• Computes splits to read data• Single Node, Replica Sets, Sharded Clusters

• Mappings for Pig and Hive• MongoDB as a standard data source/destination

• Support for• Filtering data with MongoDB queries• Authentication• Reading from Replica Set tags• Appending to existing collections

Page 18: MongoDB et Hadoop

‹#›

MapReduce Configuration

• MongoDB input/outputmongo.job.input.format = com.mongodb.hadoop.MongoInputFormatmongo.input.uri = mongodb://mydb:27017/db1.collection1mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormatmongo.output.uri = mongodb://mydb:27017/db1.collection2

• BSON input/outputmongo.job.input.format = com.hadoop.BSONFileInputFormatmapred.input.dir = hdfs:///tmp/database.bsonmongo.job.output.format = com.hadoop.BSONFileOutputFormatmapred.output.dir = hdfs:///tmp/output.bson

Page 19: MongoDB et Hadoop

‹#›

Pig Mappings

• Input: BSONLoader and MongoLoaderdata = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader

• Output: BSONStorage and MongoInsertStorageSTORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage

Page 20: MongoDB et Hadoop

‹#›

Hive Support

• Access collections as Hive tables• Use with MongoStorageHandler or BSONStorageHandler

CREATE TABLE mongo_users (id int, name string, age int)STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

Page 21: MongoDB et Hadoop

‹#›

Spark

• Use with MapReduce input/output formats

• Create Configuration objects with input/output formats and data URI

• Load/save data using SparkContext Hadoop file API

Page 22: MongoDB et Hadoop

‹#›

Data Movement

Dynamic queries to MongoDB vs. BSON snapshots in HDFS

Dynamic queries with most recent data

Puts load on operational database

Snapshots move load to Hadoop

Snapshots add predictable load to MongoDB

Page 23: MongoDB et Hadoop

Demo : Recommendation Platform

Page 24: MongoDB et Hadoop

‹#›

Movie Web

Page 25: MongoDB et Hadoop

‹#›

MovieWeb Web Application

• Browse - Top movies by ratings count- Top genres by movie count

• Log in to - See My Ratings- Rate movies

• Recommendations- Movies You May Like- Recommendations

Page 26: MongoDB et Hadoop

‹#›

MovieWeb Components

• MovieLens dataset– 10M ratings, 10K movies, 70K users– http://grouplens.org/datasets/movielens/

• Python web app to browse movies, recommendations– Flask, PyMongo

• Spark app computes recommendations– MLLib collaborative filter

• Predicted ratings are exposed in web app– New predictions collection

Page 27: MongoDB et Hadoop

‹#›

Spark Recommender

• Apache Hadoop (2.3) - HDFS & YARN- Top genres by movie count

• Spark (1.0)- Execute within YARN- Assign executor resources

• Data- From HDFS, MongoDB- To MongoDB

Page 28: MongoDB et Hadoop

‹#›

MovieWeb Workflow

Snapshot dbas BSON

Predict ratings for all pairings

Write Prediction to MongoDB collection

Store BSON in HDFS

Read BSON into Spark App

Create user movie pairing

Web Application exposes

recommendationsRepeat Process

Train Model from existing ratings

Page 29: MongoDB et Hadoop

‹#›

Execution

$ spark-submit --master local \ --driver-memory 2G --executor-memory 2G \ --jars mongo-hadoop-core.jar,mongo-java-driver.jar \ --class com.mongodb.workshop.SparkExercise \ ./target/spark-1.0-SNAPSHOT.jar \ hdfs://localhost:9000 \ mongodb://127.0.0.1:27017/movielens \ predictions \

Page 30: MongoDB et Hadoop

Should I use MongoDB or Hadoop?

Page 31: MongoDB et Hadoop

‹#›

Business First!

First-Level Analytics

Internet of Things

Social

Mobile Apps

Product/Asset

Catalog

Security & Fraud

Single View

Customer Data

Management

Churn Analysis

Risk Modeling

Sentiment Analysis

Trade Surveillance

Recommender

Warehouse & ETL

Ad Targeting

Predictive Analytics

What/Why How

Page 32: MongoDB et Hadoop

‹#›

The good tool for the task

• Dataset size• Data processing complexity• Continuous improvement

V1.0

Page 33: MongoDB et Hadoop

‹#›

The good tool for the task

• Dataset size• Data processing complexity• Continuous improvement

V2.0

Page 34: MongoDB et Hadoop

‹#›

Resources / Questions

• MongoDB Connector for Hadoop- http://github.com/mongodb/mongo-hadoop

• Getting Started with MongoDB and Hadoop - http://docs.mongodb.org/ecosystem/tutorial/getting-s

tarted-with-hadoop/

• MongoDB-Spark Demo- https://github.com/crcsmnky/mongodb-hadoop-work

shop

Page 35: MongoDB et Hadoop

MongoDB & Hadoop

Tugdual GrallTechnical [email protected]@tgrall