megadata with python and hadoop
DESCRIPTION
July 2010 Triangle Hadoop Users Group presentationTRANSCRIPT
Processing MegadataWith Python and Hadoop
July 2010 TriHUGRyan Cox
www.asciiarmor.com
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF1089919999999999999999990029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF1049919999999999999999990029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF1089919999999999999999990029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF1089919999999999999999990029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF1089919999999999999999990029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF1089919999999999999999990029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF1069919999999999999999990029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF1089919999999999999999990029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF1089919999999999999999990029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF1089919999999999999999990029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF1089919999999999999999990029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF1089919999999999999999990029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF1049919999999999999999990029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF1049919999999999999999990029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF1049919999999999999999990029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF1039919999999999999999990029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF1009919999999999999999990029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF1009919999999999999999990029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF1009919999999999999999990029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF1009919999999999999999990029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF1009919999999999999999990029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF1089919999999999999999990029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF1089919999999999999999990035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW17010029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF1089919999999999999999990029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF1089919999999999999999990029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF1089919999999999999999990029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF1009919999999999999999990029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF1009919999999999999999990029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF1009919999999999999999990029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF1009919999999999999999990029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF1009919999999999999999990029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF1009919999999999999999990029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF1049919999999999999999990029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF1039919999999999999999990029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF1009919999999999999999990029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF1009919999999999999999990029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF1009919999999999999999990029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF1009919999999999999999990029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF1009919999999999999999990029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF1009919999999999999999990029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF1009919999999999999999990029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF1089919999999999999999990029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF1089919999999999999999990035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721
~130 GB NCDC CLIMATE DATASET1901
19041907
19101913
19161919
19221925
19281931
19341937
19401943
19461949
19521955
19581961
19641967
19701973
19761979
19821985
19881991
19941997
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
How can we make this scale?( and do more interesting things )
high_temp = 0 for line in open('1901'): line = line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93])
if (temp != "+9999" and quality in "01459"): high_temp = max(high_temp,float(temp))
print high_temp
JEFFREY DEAN – GOOGLE - 2004
“Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”
def mapper(line): line = line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if (temp != "+9999" and quality in "01459"): return float(temp) return None
output = map(mapper,open('1901')) print reduce(max,output)
MAPREDUCE IN PURE PYTHON
HADOOP STREAMING
for line in sys.stdin: val = line.strip() (year, temp, q) = (val[15:19], val[87:92], val[92:93]) if (temp != "+9999" and re.match("[01459]", q)): print "%s\t%s" % (year, temp)
(last_key, max_val) = (None, 0)for line in sys.stdin: (key, val) = line.strip().split("\t") if last_key and last_key != key: print "%s\t%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val)))
if last_key: print "%s\t%s" % (last_key, max_val)
map
per.p
yre
duer
.py
cat dataFile | mapper.py | sort | reducer.py
DUMBO
DUMBO
def mapper(key,value): line = value.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if (temp != "+9999" and quality in "01459"): yield year, int(temp)
def reducer(key,values): yield key,max(values)
if __name__ == "__main__": import dumbo dumbo.run(mapper,reducer,reducer)
DUMBO
• Ability to pass around Python objects• Job / Iteration Abstraction• Counter / Status Abstraction• Simplified Joining mechanism• Ability to use non-Java combiners• Built-in library of mappers / reducers• Excellent way to model MR algorithms
ELASTIC MAP REDUCE
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
CLI – API – Web Console
ELASTIC MAP REDUCE
ELASTIC MAP REDUCE
Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’
ELASTIC MAP REDUCECloudWatch metrics
ELASTIC MAP REDUCECloudWatch metrics
QUIZ: HOW WOULD YOU DO THIS?
MAP REDUCE ALGORITHMS ARE DIFFERENT
BFS(G, s) // G is the graph and s is the starting node for each vertex u V [G] - {s}∈ do color[u] ← WHITE // color of vertex u d[u] ← ∞ // distance from source s to vertex u π[u] ← NIL // predecessor of u color[s] ← GRAY d[s] ← 0 π[s] ← NIL Q ← Ø // Q is a FIFO - queue ENQUEUE(Q, s) while Q ≠ Ø // iterates as long as there are gray vertices. do u ← DEQUEUE(Q) for each v Adj[u]∈ do if color[v] = WHITE // discover the undiscovered adjacent vertices then color[v] ← GRAY // enqueued whenever painted gray d[v] ← d[u] + 1 π[v] ← u ENQUEUE(Q, v) color[u] ← BLACK // painted black whenever dequeued
MAP REDUCE ELSEWHERE
> m = function() { emit(this.user_id, 1); } > r = function(k,vals) { return 1; } > res = db.events.mapReduce(m, r, { query : {type:'sale'} }); > db[res.result].find().limit(2) { "_id" : 8321073716060 , "value" : 1 } { "_id" : 7921232311289 , "value" : 1 }
> {ok, [R]} = Client:mapred([{<<"groceries">>, <<"mine">>}, {<<"groceries">>, <<"yours">>}], [{'map', {'qfun', Count}, 'none', false}, {'reduce', {'qfun', Merge}, 'none', true}]).
Mon
goD
BRi
ak
LEARN MORE
Definitive Guide Hadoophttp://www.hadoopbook.com
Dumbohttp://dumbotics.com/http://github.com/klbostee/dumbo/
Elastic Map Reduce http://aws.amazon.com/
Botohttp://github.com/boto
Getting Started Slideshttp://www.slideshare.net/pacoid/getting-started-on-hadoop
DEMO