mongodb hadoop and humongous data
Post on 12-Sep-2014
3.145 Views
Preview:
DESCRIPTION
TRANSCRIPT
MongoDB Hadoop
& humongous data
Tuesday, December 11, 12
Talking aboutWhat is Humongous Data
Humongous Data & You
MongoDB & Data processing
Future of Humongous DataTuesday, December 11, 12
What is humongous
data?Tuesday, December 11, 12
2000Google IncToday announced it has released the largest search engine on the Internet.
Google’s new index, comprising more than 1 billion URLs
Tuesday, December 11, 12
Our indexing system for processing links indicates that we now count 1 trillion unique URLs
(and the number of individual web pages out there is growing by several billion pages per day).
2008
Tuesday, December 11, 12
An unprecedented amount of data is being created and is accessible
Tuesday, December 11, 12
Data Growth
0
250
500
750
1000
2000 2001 2002 2003 2004 2005 2006 2007 2008
1 4 10 2455
120
250
500
1,000
Millions of URLs
Tuesday, December 11, 12
Truly Exponential Growth
Is hard for people to grasp
A BBC reporter recently: "Your current PC is more powerful than the computer they had on board the first flight to the moon".
Tuesday, December 11, 12
Moore’s LawApplies to more than just CPUs
Boiled down it is that things double at regular intervals
It’s exponential growth.. and applies to big data
Tuesday, December 11, 12
How BIG is it?
Tuesday, December 11, 12
How BIG is it?
2008
Tuesday, December 11, 12
How BIG is it?
2008
2007
20062005
20042003
20022001
Tuesday, December 11, 12
Why all this talk about BIG
Data now?
Tuesday, December 11, 12
In the past few years open source software emerged enabling ‘us’ to handle BIG Data
Tuesday, December 11, 12
The Big DataStory
Tuesday, December 11, 12
Is actually two stories
Tuesday, December 11, 12
Doers & Tellers talking about different things
http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
Tuesday, December 11, 12
TellersTuesday, December 11, 12
DoersTuesday, December 11, 12
Doers talk a lot more about actual solutions
Tuesday, December 11, 12
They know it’s a two sided story
Processing
Storage
Tuesday, December 11, 12
Take aways
MongoDB and Hadoop
MongoDB for storage & operations
Hadoop for processing & analytics
Tuesday, December 11, 12
MongoDB& Data Processing
Tuesday, December 11, 12
Applications have complex needs
MongoDB ideal operational database
MongoDB ideal for BIG data
Not a data processing engine, but provides processing functionality
Tuesday, December 11, 12
Many options for Processing Data
•Process in MongoDB using Map Reduce
•Process in MongoDB using Aggregation Framework
•Process outside MongoDB (using Hadoop)
Tuesday, December 11, 12
MongoDB Map ReduceData
Map()
emit(k,v)
Sort(k)
Group(k)
Reduce(k,values)
k,v
Finalize(k,v)
k,v
MongoDB
map iterates on documentsDocument is $this
1 at time per shard
Input matches output
Can run multiple times
Tuesday, December 11, 12
MongoDB Map Reduce
MongoDB map reduce quite capable... but with limits
- Javascript not best language for processing map reduce
- Javascript limited in external data processing libraries
- Adds load to data store
Tuesday, December 11, 12
MongoDB Aggregation
Most uses of MongoDB Map Reduce were for aggregation
Aggregation Framework optimized for aggregate queries
Realtime aggregation similar to SQL GroupBy
Tuesday, December 11, 12
MongoDB & Hadoop
Map (k1, v1, ctx)
ctx.write(k2,v2)
Map (k1, v1, ctx)
ctx.write(k2,v2)
Map (k1, v1, ctx)
ctx.write(k2,v2)
Creates a list of Input Splits
(InputFormat)
Output Format
MongoDB
single server or sharded cluster
same as Mongo's shard chunks (64mb)
each spliteach spliteach split
MongoDB
RecordReader
Runs on same thread as map
Reducer threads
Partitioner(k2)Partitioner(k2)Partitioner(k2)
Many map operations1 at time per input split
Runs once per keyReduce(k2,values3)
kf,vf
Combiner(k2,values2)
k2, v3
Combiner(k2,values2)
k2, v3
Combiner(k2,values2)
k2, v3
Sort(k2)Sort(k2)Sort(keys2)
Tuesday, December 11, 12
DEMOTIME
Tuesday, December 11, 12
DEMOInstall Hadoop MongoDB Plugin
Import tweets from twitter
Write mapper in Python using Hadoop streaming
Write reducer in Python using Hadoop streaming
Call myself a data scientist
Tuesday, December 11, 12
Installing Mongo-hadoop
hadoop_version '0.23'hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"
git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i '' "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
https://gist.github.com/1887726
Tuesday, December 11, 12
Groking Twitter
curl \https://stream.twitter.com/1/statuses/sample.json \-u<login>:<password> \| mongoimport -d test -c live
... let it run for about 2 hoursTuesday, December 11, 12
DEMO 1Tuesday, December 11, 12
Map Hashtags in Python#!/usr/bin/env python
import syssys.path.append(".")
from pymongo_hadoop import BSONMapper
def mapper(documents): for doc in documents: for hashtag in doc['entities']['hashtags']: yield {'_id': hashtag['text'], 'count': 1}
BSONMapper(mapper)print >> sys.stderr, "Done Mapping."
Tuesday, December 11, 12
Reduce hashtags in Python#!/usr/bin/env python
import syssys.path.append(".")
from pymongo_hadoop import BSONReducer
def reducer(key, values): print >> sys.stderr, "Hashtag %s" % key.encode('utf8') _count = 0 for v in values: _count += v['count'] return {'_id': key.encode('utf8'), 'count': _count}
BSONReducer(reducer)Tuesday, December 11, 12
All together
hadoop jar target/mongo-hadoop-streaming-assembly-1.0.0-rc0.jar \ -mapper examples/twitter/twit_hashtag_map.py \-reducer examples/twitter/twit_hashtag_reduce.py \-inputURI mongodb://127.0.0.1/test.live \-outputURI mongodb://127.0.0.1/test.twit_reduction \-file examples/twitter/twit_hashtag_map.py \-file examples/twitter/twit_hashtag_reduce.py
Tuesday, December 11, 12
Popular Hash Tagsdb.twit_hashtags.find().sort( {'count' : -1 })
{ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
Tuesday, December 11, 12
DEMO 2Tuesday, December 11, 12
Aggregation in Mongo 2.1db.live.aggregate(
{ $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })Tuesday, December 11, 12
Popular Hash Tagsdb.twit_hashtags.aggregate(a){
"result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, ],"ok" : 1}
Tuesday, December 11, 12
Futureof The
humongousdata
Tuesday, December 11, 12
What is BIG?
BIG today is normal tomorrow
Tuesday, December 11, 12
Data Growth
0
2250
4500
6750
9000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
1 4 10 24 55 120 250500
1,000
2,150
4,400
9,000
Millions of URLs
Tuesday, December 11, 12
Data Growth
0
2250
4500
6750
9000
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
1 4 10 24 55 120 250500
1,000
2,150
4,400
9,000
Millions of URLs
Tuesday, December 11, 12
2012Generating over 250 Millions of tweets per day
Tuesday, December 11, 12
MongoDB enables us to scale with the redefinition of BIG.
New processing tools like Hadoop & Storm are enabling us to process the new BIG.
Tuesday, December 11, 12
Hadoop is our first step
Tuesday, December 11, 12
MongoDB is committed to working
with best data tools including
Hadoop, Storm, Disco, Spark & more
Tuesday, December 11, 12
download at
Questions?
http://spf13.comhttp://github.com/spf13@spf13
github.com/mongodb/mongo-hadoop
Tuesday, December 11, 12
Tuesday, December 11, 12
top related