mongodb and hadoop: driving business insights
Post on 25-Jun-2015
1.156 Views
Preview:
DESCRIPTION
TRANSCRIPT
MongoDB and Hadoop:Driving Business Insights
@crcsmnky
Senior Solutions Architect, MongoDB
Sandeep Parikh
#mongodb #mongodbdays #hadoop
Agenda
• Introduction
• Use Cases
• Components
• Connector
• Demo
• Questions
Introduction
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
• Terabyte and Petabtye datasets
• Data warehousing
• Advanced analytics
Enterprise IT Stack
EDW
Man
ag
em
en
t &
Mon
itori
ng
Secu
rity &
Au
ditin
g
RDBMS
CRM, ERP, Collaboration, Mobile, BI
OS & Virtualization, Compute, Storage, Network
RDBMS
Applications
Infrastructure
Data Management
Operational Analytical
Operational vs. Analytical: Enrichment
Applications, Interactions Warehouse, Analytics
Operational: MongoDB
First-level Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile Apps Customer Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL Risk Modeling
Trade Surveillance
Predictive Analytics
Ad Targeting Sentiment Analysis
Analytical: Hadoop
First-level Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile Apps Customer Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL Risk Modeling
Trade Surveillance
Predictive Analytics
Ad Targeting Sentiment Analysis
Operational vs. Analytical: Lifecycle
First-level Analytics
Product/Asset Catalogs
Security & Fraud
Internet of Things
Mobile Apps Customer Data Mgmt
Single View Social
Churn Analysis Recommender
Warehouse & ETL Risk Modeling
Trade Surveillance
Predictive Analytics
Ad Targeting Sentiment Analysis
Use Cases
Commerce
Applicationspowered by
Analysispowered by
• Products & Inventory• Recommended products• Customer profile• Session management
• Elastic pricing• Recommendation models• Predictive analytics• Clickstream history
MongoDB Connector for Hadoop
Insurance
Applicationspowered by
Analysispowered by
• Customer profiles• Insurance policies• Session data• Call center data
• Customer action analysis• Churn analysis• Churn prediction• Policy rates
MongoDB Connector for Hadoop
Fraud Detection
Payments
Fraud modeling
Nightly Analysis
MongoDB Connector for
Hadoop
Results Cache
Online payments processing
3rd Party Data Sources
Fraud Detection
query
only
query
only
Components
Overview
HDFS
YARN
MapReduce
Pig Hive
Spark
HDFS and YARN
• Hadoop Distributed File System– Distributed file-system that stores data on
commodity machines in a Hadoop cluster
• YARN– Resource management platform responsible for
managing and scheduling compute resources in a Hadoop cluster
MapReduce
• Paralell, distributed computation across a Hadoop cluster
• Process and/or generate large datasets
• Simplistic model for individual tasks
Map(k1, v1)
list(k2,v2)
Reduce(k2, list(v2))
list(v3)
Pig
• High-level platform for creating MapReduce
• Pig Latin abstracts Java into easier-to-use notation
• Executed as a series of MapReduce applications
• Supports user-defined functions (UDFs)
Hive
• Data warehouse infrastructure built on top of Hadoop
• Provides data summarization, query, and analysis
• HiveQL is a subset of SQL
• Support for user-defined functions (UDFs)
Spark
• Powerful built-in transformations and actions– map, reduceByKey, union, distinct, sample,
intersection, and more– foreach, count, collect, take, and many more
Spark is a fast and powerful engine for processing Hadoop data. It is designed to perform both general data processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
MongoDB Connector for Hadoop
Data
Read/Write MongoDB
Read/WriteBSON
Tools
MapReduce
Pig
Hive
Spark
Platforms
Apache Hadoop
Cloudera CDH
Hortonworks HDP
Amazon EMR
Connector Overview
Features and Functionality
• MongoDB and BSON– Input and Output formats
• Computes splits to read data
• Support for– Filtering data with MongoDB queries– Authentication– Reading directly from shard Primaries– ReadPreferences and Replica Set tags– Appending to existing collections
MapReduce Configuration
• MongoDB input– mongo.job.input.format = com.mongodb.hadoop.MongoInputFormat
– mongo.input.uri = mongodb://mydb:27017/db1.collection1
• MongoDB output– mongo.job.output.format = com.mongodb.hadoop.MongoOutputFormat
– mongo.output.uri = mongodb://mydb:27017/db1.collection2
• BSON input/output– mongo.job.input.format = com.hadoop.BSONFileInputFormat– mapred.input.dir = hdfs:///tmp/database.bson– mongo.job.output.format = com.hadoop.BSONFileOutputFormat– mapred.output.dir = hdfs:///tmp/output.bson
Mapper Example
public class Map extends Mapper<Object, BSONObject, Text, IntWritable> {
public void map(Object key, BSONObject doc, Context context) {
List<String> genres = (List<String>)doc.get("genres");for(String genre : genres) { context.write(new Text(genre), new IntWritable(1));}
}
}
{ _id: ObjectId(…), title: “Toy Story”, genres: [“Animation”, “Children”] }
{ _id: ObjectId(…), title: “Goldeneye”, genres: [“Action”, “Crime”, “Thriller”]
}
{ _id: ObjectId(…), title: “Jumanji”, genres: [“Adventure”, “Children”,
“Fantasy”] }
Reducer Example
{ _id: ObjectId(…), genre: “Action”, count: 1370 }
{ _id: ObjectId(…), genre: “Adventure”, count: 957 }
{ _id: ObjectId(…), genre: “Animation”, count: 258 }
public class Reduce extends Reducer<Text, IntWritable, NullWritable, BSONWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int sum = 0; for(IntWritable value : values) { sum += value.get(); } DBObject object = new BasicDBObjectBuilder().start() .add("genre", key.toString()) .add("count", sum) .get(); BSONWritable doc = new BSONWritable(object); context.write(NullWritable.get(), doc); }}
Pig – Mappings
Read:
– BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader
– Map schema, _id, datatypesInsert:
– BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage
– Map output id, schemaUpdate:
– MongoUpdateStorage– Specify query, update operations, schema, update options
Pig Specifics
• Fixed or dynamic schema with Loader
• Types auto-mapped– Embedded documents → Map– Arrays → Tuple
• Supply alias for “_id”– not a legal Pig variable name
Hive – Tables
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping”="_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
• Access collections as Hive tables
• Use with MongoStorageHandler or
BSONStorageHandler
Hive Particulars
• Queries are not (currently) pushed down to MongoDB
• WHERE predicates are evaluated after reading data from MongoDB
• Types auto-mapped– Embedded documents (mixed types) → STRUCT– Embedded documents (single type) → MAP– Arrays → ARRAY– ObjectId → STRUCT
• Use EXTERNAL when creating tables otherwise dropping Hive table drops underlying collection
Spark Usage
• Use with MapReduce
input/output formats
• Create Configuration
objects with input/output
formats and data URI
• Load/save data using
SparkContext Hadoop file
or RDD APIs
Spark Input Example
Configuration bsonDataConfig = new Configuration();
bsonDataConfig.set("mongo.job.input.format”,
"BSONFileInputFormat.class");
JavaPairRDD<Object,BSONObject> bsonData = sc.newAPIHadoopFile(
"hdfs://namenode:9000/data/test/foo.bson",
BSONFileInputFormat.class, Object.class,
BSONObject.class, bsonDataConfig);
Configuration inputDataConfig = new Configuration();
inputDataConfig.set("mongo.job.input.format”,
"MongoInputFormat.class");
inputDataConfig.set(“mongo.input.uri”,
“mongodb://127.0.0.1/test.foo”);
JavaPairRDD<Object,BSONObject> inputData = sc.newAPIHadoopRDD(
inputDataConfig MongoInputFormat.class, Object.class,
BSONObject.class);
MongoDB
BSON
Data Movement
Dynamic queries with most recent data
Puts load on operational database
Snapshots move load to Hadoop
Snapshots add predictable load to
MongoDB
Dynamic queries to MongoDB vs. BSON snapshots in HDFS
Demo
MovieWeb
MovieWeb Components
• MovieLens dataset– 10M ratings, 10K movies, 70K users
• Python web app to browse movies, recommendations– Flask, PyMongo
• Spark app computes recommendations– MLLib collaborative filter
• Predicted ratings are exposed in web app– New predictions collection
MovieWeb Web Application
• Browse – Top movies by ratings count– Top genres by movie count
• Log in to – See My Ratings– Rate movies
• What’s missing?– Movies You May Like– Recommendations
Spark Recommender
• Apache Hadoop 2.3.0– HDFS
• Spark 1.0– Execute locally– Assign executor
resources
• Data– From HDFS– To MongoDB
Snapshot database as
BSON
Store BSON in HDFS
Read BSON into Spark app
Train model from existing
ratings
Create user-movie pairings
Predict ratings for all pairings
Write predictions to
MongoDB collection
Web application exposes
recommendations
Repeat the process
MovieWeb Workflow
$ bin/spark-submit
--master local
--class com.mongodb.hadoop.demo.Recommender demo-1.0.jar
--jars mongo-java-2.12.3.jar,mongo-hadoop-core-1.3.0.jar
--driver-memory 2G
--executor-memory 1G
[insert job args here]
Execution
Questions?
• MongoDB Connector for Hadoop– http://github.com/mongodb/mongo-hadoop
• Getting Started with MongoDB and Hadoop– http://docs.mongodb.org/ecosystem/tutorial/gettin
g-started-with-hadoop/
• MongoDB-Spark Demo– http://github.com/crcsmnky/mongodb-spark-demo
Thank You
@crcsmnky
Senior Solutions Architect, MongoDB
Sandeep Parikh
#mongodb #mongodbdays #hadoop
top related