Download - MongoDB + Pig on Hadoop (MongoSV 2012)
![Page 1: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/1.jpg)
Jeremy Karn - co-founder, MortarMongoDB + Pig
![Page 2: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/2.jpg)
OF THIS SESSIONOverview
Intro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: store data from Pig to MongoDB
![Page 3: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/3.jpg)
RAPID OVERVIEWHadoop
MapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
![Page 4: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/4.jpg)
RAPID OVERVIEWHadoop
![Page 5: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/5.jpg)
RAPID OVERVIEWHadoop
Hadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
![Page 6: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/6.jpg)
STRENGTHSHadoop
ScalableOpen sourceLots of momentumVery broadly applicable
![Page 7: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/7.jpg)
Social Graph
![Page 8: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/8.jpg)
Predict
![Page 9: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/9.jpg)
Detect
![Page 10: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/10.jpg)
Genetics
![Page 11: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/11.jpg)
PROBLEMSHadoop
DifficultBatch only (...or it was)
![Page 12: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/12.jpg)
FUTUREHadoop
YarnMapReduce optionalGeneric management + distributed appsImpala
![Page 13: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/13.jpg)
Alternatives to Hadoop
Write MapReduce in Javascript• Javascript is not fast• Has limited data types• Hard to use complex analytic libsAdds load to data store
MONGODB NATIVE MAPREDUCE
![Page 14: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/14.jpg)
Hadoop has libs for• Machine learning• ETL• Can access any JVM analytic libsAnd many organizations already use Hadoop
Alternatives to HadoopMONGODB NATIVE MAPREDUCE
![Page 15: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/15.jpg)
Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK
Great when• Doing SQL-style aggregation• Do not require external data libs• Users will learn framework
![Page 16: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/16.jpg)
Alternatives to HadoopMONGODB AGGREGATION FRAMEWORK
But you may want Hadoop when• Doing sophisticated aggregation• Require external data libs• Users unwilling to learn framework• Need to transfer workload off datastore
![Page 17: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/17.jpg)
ON HADOOPPig
Less codeExpressive code
![Page 18: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/18.jpg)
BRIEF, EXPRESSIVELIKE PROCEDURAL SQL
Pig
(thanks: twitter hadoop world presentation)
![Page 19: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/19.jpg)
FOR SERIOUSThe Same Script, In MapReduce
![Page 20: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/20.jpg)
ON HADOOPPig
Less code Expressive codeCompiles to MRInsulates from APIPopular (LinkedIn, Twitter, Salesforce, Yahoo, Stanford
![Page 21: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/21.jpg)
MOTIVATIONSMongoDB + Pig
Data storage and data processing are often separate concerns
Hadoop is built for scalable processing of large datasets
![Page 22: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/22.jpg)
SIMILAR STANCE MongoDB, Pig
Poly-structured data• MongoDB: stores data, regardless of
structure• Pig: reads data, regardless of structure
(got its name because Pigs are omnivorous)
![Page 23: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/23.jpg)
JSON-PIG DATA TYPE MAPPINGMongoDB, Pig
JSON Pig
string chararrayinteger intboolean booleandouble doublearray bagobject map/tuplenull null
![Page 24: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/24.jpg)
MONGODB-PIG DATA TYPE MAPPINGMongoDB, Pig
MongoDB Pig
date datetimeobject id chararraybinary data
bytearrayregexp chararraycode chararray
![Page 25: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/25.jpg)
MortarFAST INTRO
Open-source code-based dev framework for data, built on Hadoop and Pig
Inspired by Rails
Self-contained, organized, executable projects
![Page 26: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/26.jpg)
> gem install mortar
![Page 27: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/27.jpg)
> mortar new my_project
![Page 28: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/28.jpg)
![Page 29: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/29.jpg)
MortarFAST INTRO
Our service hosts and executes mortar projects
![Page 30: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/30.jpg)
> mortar jobs:run your_pigscript --clustersize 5
![Page 31: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/31.jpg)
![Page 32: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/32.jpg)
MortarFAST INTRO
Browser-only interface, great for demonstrating Hadoop
![Page 33: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/33.jpg)
LOADING DATAMongoDB, Pig
One requirement:• Must specify top level fields to load from
the mongoDB collection.
Optional:• Specify a subset of embedded fields• Data type for any/all fields
![Page 34: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/34.jpg)
LOADING DATA - ENRON DATAMongoDB, Pig
{ "body": "the ... person...", "subFolder": "notes_inbox", "mailbox": "bass-e", "filename": "450.", "headers": { "From": "[email protected]", "To": "[email protected]", “Subject”: “Subject” "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)", }}
![Page 35: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/35.jpg)
SCRIPT DEMOMongoDB, Pig
![Page 36: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/36.jpg)
STORE STATEMENTMongoDB, Pig
The MongoStorage function takes an optional list of arguments of two types:• A single set of keys to base updating on.
This has three options: None, update, or multi.
• Multiple indexes to ensure in the same format as db.col.ensureIndex().
![Page 37: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/37.jpg)
ILLUSTRATEPig
Auto-select dataset
Exercise every execution path
Step-by-step execution
![Page 38: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/38.jpg)
WHY ILLUSTRATEPig
Write correct code quickly
Understand others’ code
Test every execution path, every step
![Page 39: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/39.jpg)
USER-DEFINED FUNCTIONS (UDF)Pig
Pig is like procedural SQL
UDFs for rich data manipulation
UDFs: Java-based language
We made Pig work with CPython (NumPy, etc)
![Page 40: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/40.jpg)
WITHOUT MORTARMongoDB + Pig
Get the mongo-hadoop connector:http://github.com/mongodb/mongo-hadoop
![Page 41: MongoDB + Pig on Hadoop (MongoSV 2012)](https://reader034.vdocuments.us/reader034/viewer/2022051412/54c671f64a7959f67d8b45df/html5/thumbnails/41.jpg)
SUMMARYMongoDB + Pig
Hadoop and friends are maturingMongoDB and Pig are philosophically alignedReading and writing to Pig is straightforwardOnce in Pig (Hadoop)• massive batch calcs / analytics possible • work is offloaded• external libraries available