mongodb + pig on hadoop (mongosv 2012)
DESCRIPTION
Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig. Jeremy's presentation covered the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo. This talk will demonstrate its concepts with Mortar, which has contributed to the Mongo Hadoop connector, extending it to work with Pig.TRANSCRIPT
![Page 1: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/1.jpg)
Jeremy Karn - co-founder, MortarMongoDB + Pig
![Page 2: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/2.jpg)
OF THIS SESSIONOverview
Intro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: store data from Pig to MongoDB
![Page 3: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/3.jpg)
RAPID OVERVIEWHadoop
MapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
![Page 4: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/4.jpg)
RAPID OVERVIEWHadoop
![Page 5: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/5.jpg)
RAPID OVERVIEWHadoop
Hadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
![Page 6: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/6.jpg)
STRENGTHSHadoop
ScalableOpen sourceLots of momentumVery broadly applicable
![Page 7: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/7.jpg)
Social Graph
![Page 8: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/8.jpg)
Predict
![Page 9: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/9.jpg)
Detect
![Page 10: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/10.jpg)
Genetics
![Page 11: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/11.jpg)
PROBLEMSHadoop
DifficultBatch only (...or it was)
![Page 12: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/12.jpg)
FUTUREHadoop
YarnMapReduce optionalGeneric management + distributed appsImpala
![Page 13: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/13.jpg)
Alternatives to Hadoop
Write MapReduce in Javascript• Javascript is not fast• Has limited data types• Hard to use complex analytic libsAdds load to data store
MONGODB NATIVE MAPREDUCE
![Page 14: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/14.jpg)
Hadoop has libs for• Machine learning• ETL• Can access any JVM analytic libsAnd many organizations already use Hadoop
Alternatives to HadoopMONGODB NATIVE MAPREDUCE
![Page 15: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/15.jpg)
Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK
Great when• Doing SQL-style aggregation• Do not require external data libs• Users will learn framework
![Page 16: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/16.jpg)
Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK
But you may want Hadoop when• Doing sophisticated aggregation• Require external data libs• Users unwilling to learn framework• Need to transfer workload off datastore
![Page 17: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/17.jpg)
ON HADOOPPig
Less codeExpressive code
![Page 18: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/18.jpg)
BRIEF, EXPRESSIVELIKE PROCEDURAL SQL
Pig
(thanks: twitter hadoop world presentation)
![Page 19: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/19.jpg)
FOR SERIOUSThe Same Script, In MapReduce
![Page 20: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/20.jpg)
ON HADOOPPig
Less code Expressive codeCompiles to MRInsulates from APIPopular (LinkedIn, Twitter, Salesforce, Yahoo, Stanford
![Page 21: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/21.jpg)
MOTIVATIONSMongoDB + Pig
Data storage and data processing are often separate concerns
Hadoop is built for scalable processing of large datasets
![Page 22: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/22.jpg)
SIMILAR STANCE MongoDB, Pig
Poly-structured data• MongoDB: stores data, regardless of
structure• Pig: reads data, regardless of structure
(got its name because Pigs are omnivorous)
![Page 23: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/23.jpg)
JSON-PIG DATA TYPE MAPPINGMongoDB, Pig
JSON Pig
string chararrayinteger intboolean booleandouble doublearray bagobject map/tuplenull null
![Page 24: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/24.jpg)
MONGODB-PIG DATA TYPE MAPPINGMongoDB, Pig
MongoDB Pig
date datetimeobject id chararraybinary data
bytearrayregexp chararraycode chararray
![Page 25: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/25.jpg)
MortarFAST INTRO
Open-source code-based dev framework for data, built on Hadoop and Pig
Inspired by Rails
Self-contained, organized, executable projects
![Page 26: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/26.jpg)
> gem install mortar
![Page 27: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/27.jpg)
> mortar new my_project
![Page 28: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/28.jpg)
![Page 29: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/29.jpg)
MortarFAST INTRO
Our service hosts and executes mortar projects
![Page 30: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/30.jpg)
> mortar jobs:run your_pigscript --clustersize 5
![Page 31: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/31.jpg)
![Page 32: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/32.jpg)
MortarFAST INTRO
Browser-only interface, great for demonstrating Hadoop
![Page 33: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/33.jpg)
LOADING DATAMongoDB, Pig
One requirement:• Must specify top level fields to load from
the mongoDB collection.
Optional:• Specify a subset of embedded fields• Data type for any/all fields
![Page 34: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/34.jpg)
LOADING DATA - ENRON DATAMongoDB, Pig
{ "body": "the ... person...", "subFolder": "notes_inbox", "mailbox": "bass-e", "filename": "450.", "headers": { "From": "[email protected]", "To": "[email protected]", “Subject”: “Subject” "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)", }}
![Page 35: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/35.jpg)
SCRIPT DEMOMongoDB, Pig
![Page 36: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/36.jpg)
STORE STATEMENTMongoDB, Pig
The MongoStorage function takes an optional list of arguments of two types:• A single set of keys to base updating on.
This has three options: None, update, or multi.
• Multiple indexes to ensure in the same format as db.col.ensureIndex().
![Page 37: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/37.jpg)
ILLUSTRATEPig
Auto-select dataset
Exercise every execution path
Step-by-step execution
![Page 38: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/38.jpg)
WHY ILLUSTRATEPig
Write correct code quickly
Understand others’ code
Test every execution path, every step
![Page 39: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/39.jpg)
USER-DEFINED FUNCTIONS (UDF)Pig
Pig is like procedural SQL
UDFs for rich data manipulation
UDFs: Java-based language
We made Pig work with CPython (NumPy, etc)
![Page 40: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/40.jpg)
WITHOUT MORTARMongoDB + Pig
Get the mongo-hadoop connector:http://github.com/mongodb/mongo-hadoop
![Page 41: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/41.jpg)
SUMMARYMongoDB + Pig
Hadoop and friends are maturingMongoDB and Pig are philosophically alignedReading and writing to Pig is straightforwardOnce in Pig (Hadoop)• massive batch calcs / analytics possible • work is offloaded• external libraries available