mongodb and hadoop: driving business insights

42

Click here to load reader

Upload: mongodb

Post on 25-Jun-2015

1.156 views

Category:

Technology


1 download

DESCRIPTION

MongoDB and Hadoop can work together to solve big data problems facing today's enterprises. We will take an in-depth look at how the two technologies complement and enrich each other with complex analyses and greater intelligence. We will take a deep dive into the MongoDB Connector for Hadoop and how it can be applied to enable new business insights with MapReduce, Pig, and Hive, and demo a Spark application to drive product recommendations.

TRANSCRIPT

Page 1: MongoDB and Hadoop: Driving Business Insights

MongoDB and Hadoop:Driving Business Insights

@crcsmnky

Senior Solutions Architect, MongoDB

Sandeep Parikh

#mongodb #mongodbdays #hadoop

Page 2: MongoDB and Hadoop: Driving Business Insights

Agenda

• Introduction

• Use Cases

• Components

• Connector

• Demo

• Questions

Page 3: MongoDB and Hadoop: Driving Business Insights

Introduction

Page 4: MongoDB and Hadoop: Driving Business Insights

Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

• Terabyte and Petabtye datasets

• Data warehousing

• Advanced analytics

Page 5: MongoDB and Hadoop: Driving Business Insights

Enterprise IT Stack

EDW

Man

ag

em

en

t &

Mon

itori

ng

Secu

rity &

Au

ditin

g

RDBMS

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

RDBMS

Applications

Infrastructure

Data Management

Operational Analytical

Page 6: MongoDB and Hadoop: Driving Business Insights

Operational vs. Analytical: Enrichment

Applications, Interactions Warehouse, Analytics

Page 7: MongoDB and Hadoop: Driving Business Insights

Operational: MongoDB

First-level Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile Apps Customer Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL Risk Modeling

Trade Surveillance

Predictive Analytics

Ad Targeting Sentiment Analysis

Page 8: MongoDB and Hadoop: Driving Business Insights

Analytical: Hadoop

First-level Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile Apps Customer Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL Risk Modeling

Trade Surveillance

Predictive Analytics

Ad Targeting Sentiment Analysis

Page 9: MongoDB and Hadoop: Driving Business Insights

Operational vs. Analytical: Lifecycle

First-level Analytics

Product/Asset Catalogs

Security & Fraud

Internet of Things

Mobile Apps Customer Data Mgmt

Single View Social

Churn Analysis Recommender

Warehouse & ETL Risk Modeling

Trade Surveillance

Predictive Analytics

Ad Targeting Sentiment Analysis

Page 10: MongoDB and Hadoop: Driving Business Insights

Use Cases

Page 11: MongoDB and Hadoop: Driving Business Insights

Commerce

Applicationspowered by

Analysispowered by

• Products & Inventory• Recommended products• Customer profile• Session management

• Elastic pricing• Recommendation models• Predictive analytics• Clickstream history

MongoDB Connector for Hadoop

Page 12: MongoDB and Hadoop: Driving Business Insights

Insurance

Applicationspowered by

Analysispowered by

• Customer profiles• Insurance policies• Session data• Call center data

• Customer action analysis• Churn analysis• Churn prediction• Policy rates

MongoDB Connector for Hadoop

Page 13: MongoDB and Hadoop: Driving Business Insights

Fraud Detection

Payments

Fraud modeling

Nightly Analysis

MongoDB Connector for

Hadoop

Results Cache

Online payments processing

3rd Party Data Sources

Fraud Detection

query

only

query

only

Page 14: MongoDB and Hadoop: Driving Business Insights

Components

Page 15: MongoDB and Hadoop: Driving Business Insights

Overview

HDFS

YARN

MapReduce

Pig Hive

Spark

Page 16: MongoDB and Hadoop: Driving Business Insights

HDFS and YARN

• Hadoop Distributed File System– Distributed file-system that stores data on

commodity machines in a Hadoop cluster

• YARN– Resource management platform responsible for

managing and scheduling compute resources in a Hadoop cluster

Page 17: MongoDB and Hadoop: Driving Business Insights

MapReduce

• Paralell, distributed computation across a Hadoop cluster

• Process and/or generate large datasets

• Simplistic model for individual tasks

Map(k1, v1)

list(k2,v2)

Reduce(k2, list(v2))

list(v3)

Page 18: MongoDB and Hadoop: Driving Business Insights

Pig

• High-level platform for creating MapReduce

• Pig Latin abstracts Java into easier-to-use notation

• Executed as a series of MapReduce applications

• Supports user-defined functions (UDFs)

Page 19: MongoDB and Hadoop: Driving Business Insights

Hive

• Data warehouse infrastructure built on top of Hadoop

• Provides data summarization, query, and analysis

• HiveQL is a subset of SQL

• Support for user-defined functions (UDFs)

Page 20: MongoDB and Hadoop: Driving Business Insights

Spark

• Powerful built-in transformations and actions– map, reduceByKey, union, distinct, sample,

intersection, and more– foreach, count, collect, take, and many more

Spark is a fast and powerful engine for processing Hadoop data. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.

Page 21: MongoDB and Hadoop: Driving Business Insights

MongoDB Connector for Hadoop

Page 22: MongoDB and Hadoop: Driving Business Insights

Data

Read/Write MongoDB

Read/WriteBSON

Tools

MapReduce

Pig

Hive

Spark

Platforms

Apache Hadoop

Cloudera CDH

Hortonworks HDP

Amazon EMR

Connector Overview

Page 23: MongoDB and Hadoop: Driving Business Insights

Features and Functionality

• MongoDB and BSON– Input and Output formats

• Computes splits to read data

• Support for– Filtering data with MongoDB queries– Authentication– Reading directly from shard Primaries– ReadPreferences and Replica Set tags– Appending to existing collections

Page 24: MongoDB and Hadoop: Driving Business Insights

MapReduce Configuration

• MongoDB input– mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat

– mongo.input.uri = mongodb://mydb:27017/db1.collection1

• MongoDB output– mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat

– mongo.output.uri = mongodb://mydb:27017/db1.collection2

• BSON input/output– mongo.job.input.format = com.hadoop.BSONFileInputFormat– mapred.input.dir = hdfs:///tmp/database.bson– mongo.job.output.format = com.hadoop.BSONFileOutputFormat– mapred.output.dir = hdfs:///tmp/output.bson

Page 25: MongoDB and Hadoop: Driving Business Insights

Mapper Example

public class Map extends Mapper<Object, BSONObject, Text, IntWritable> {

public void map(Object key, BSONObject doc, Context context) {

List<String> genres = (List<String>)doc.get("genres");for(String genre : genres) { context.write(new Text(genre), new IntWritable(1));}

}

}

{ _id: ObjectId(…), title: “Toy Story”, genres: [“Animation”, “Children”] }

{ _id: ObjectId(…), title: “Goldeneye”, genres: [“Action”, “Crime”, “Thriller”]

}

{ _id: ObjectId(…), title: “Jumanji”, genres: [“Adventure”, “Children”,

“Fantasy”] }

Page 26: MongoDB and Hadoop: Driving Business Insights

Reducer Example

{ _id: ObjectId(…), genre: “Action”, count: 1370 }

{ _id: ObjectId(…), genre: “Adventure”, count: 957 }

{ _id: ObjectId(…), genre: “Animation”, count: 258 }

public class Reduce extends Reducer<Text, IntWritable, NullWritable, BSONWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for(IntWritable value : values) { sum += value.get(); } DBObject object = new BasicDBObjectBuilder().start() .add("genre", key.toString()) .add("count", sum) .get(); BSONWritable doc = new BSONWritable(object); context.write(NullWritable.get(), doc); }}

Page 27: MongoDB and Hadoop: Driving Business Insights

Pig – Mappings

Read:

– BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader

– Map schema, _id, datatypesInsert:

– BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage

– Map output id, schemaUpdate:

– MongoUpdateStorage– Specify query, update operations, schema, update options

Page 28: MongoDB and Hadoop: Driving Business Insights

Pig Specifics

• Fixed or dynamic schema with Loader

• Types auto-mapped– Embedded documents → Map– Arrays → Tuple

• Supply alias for “_id”– not a legal Pig variable name

Page 29: MongoDB and Hadoop: Driving Business Insights

Hive – Tables

CREATE TABLE mongo_users (id int, name string, age int)

STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"

WITH SERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”)

TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

• Access collections as Hive tables

• Use with MongoStorageHandler or

BSONStorageHandler

Page 30: MongoDB and Hadoop: Driving Business Insights

Hive Particulars

• Queries are not (currently) pushed down to MongoDB

• WHERE predicates are evaluated after reading data from MongoDB

• Types auto-mapped– Embedded documents (mixed types) → STRUCT– Embedded documents (single type) → MAP– Arrays → ARRAY– ObjectId → STRUCT

• Use EXTERNAL when creating tables otherwise dropping Hive table drops underlying collection

Page 31: MongoDB and Hadoop: Driving Business Insights

Spark Usage

• Use with MapReduce

input/output formats

• Create Configuration

objects with input/output

formats and data URI

• Load/save data using

SparkContext Hadoop file

or RDD APIs

Page 32: MongoDB and Hadoop: Driving Business Insights

Spark Input Example

Configuration bsonDataConfig = new Configuration();

bsonDataConfig.set("mongo.job.input.format”,

"BSONFileInputFormat.class");

JavaPairRDD<Object,BSONObject> bsonData = sc.newAPIHadoopFile(

"hdfs://namenode:9000/data/test/foo.bson",

BSONFileInputFormat.class, Object.class,

BSONObject.class, bsonDataConfig);

Configuration inputDataConfig = new Configuration();

inputDataConfig.set("mongo.job.input.format”,

"MongoInputFormat.class");

inputDataConfig.set(“mongo.input.uri”,

“mongodb://127.0.0.1/test.foo”);

JavaPairRDD<Object,BSONObject> inputData = sc.newAPIHadoopRDD(

inputDataConfig MongoInputFormat.class, Object.class,

BSONObject.class);

MongoDB

BSON

Page 33: MongoDB and Hadoop: Driving Business Insights

Data Movement

Dynamic queries with most recent data

Puts load on operational database

Snapshots move load to Hadoop

Snapshots add predictable load to

MongoDB

Dynamic queries to MongoDB vs. BSON snapshots in HDFS

Page 34: MongoDB and Hadoop: Driving Business Insights

Demo

Page 35: MongoDB and Hadoop: Driving Business Insights

MovieWeb

Page 36: MongoDB and Hadoop: Driving Business Insights

MovieWeb Components

• MovieLens dataset– 10M ratings, 10K movies, 70K users

• Python web app to browse movies, recommendations– Flask, PyMongo

• Spark app computes recommendations– MLLib collaborative filter

• Predicted ratings are exposed in web app– New predictions collection

Page 37: MongoDB and Hadoop: Driving Business Insights

MovieWeb Web Application

• Browse – Top movies by ratings count– Top genres by movie count

• Log in to – See My Ratings– Rate movies

• What’s missing?– Movies You May Like– Recommendations

Page 38: MongoDB and Hadoop: Driving Business Insights

Spark Recommender

• Apache Hadoop 2.3.0– HDFS

• Spark 1.0– Execute locally– Assign executor

resources

• Data– From HDFS– To MongoDB

Page 39: MongoDB and Hadoop: Driving Business Insights

Snapshot database as

BSON

Store BSON in HDFS

Read BSON into Spark app

Train model from existing

ratings

Create user-movie pairings

Predict ratings for all pairings

Write predictions to

MongoDB collection

Web application exposes

recommendations

Repeat the process

MovieWeb Workflow

Page 40: MongoDB and Hadoop: Driving Business Insights

$ bin/spark-submit

--master local

--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar

--jars mongo-java-2.12.3.jar,mongo-hadoop-core-1.3.0.jar

--driver-memory 2G

--executor-memory 1G

[insert job args here]

Execution

Page 41: MongoDB and Hadoop: Driving Business Insights

Questions?

• MongoDB Connector for Hadoop– http://github.com/mongodb/mongo-hadoop

• Getting Started with MongoDB and Hadoop– http://docs.mongodb.org/ecosystem/tutorial/gettin

g-started-with-hadoop/

• MongoDB-Spark Demo– http://github.com/crcsmnky/mongodb-spark-demo

Page 42: MongoDB and Hadoop: Driving Business Insights

Thank You

@crcsmnky

Senior Solutions Architect, MongoDB

Sandeep Parikh

#mongodb #mongodbdays #hadoop