![Page 1: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/1.jpg)
Hadoop and beyond:power tools for data mining
Mark Levy, 13 March 2013Cloud Computing Module
Birkbeck/UCL
![Page 2: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/2.jpg)
Hadoop and beyondOutline:
• the data I work with• Hadoop without Java• Map-Reduce unfriendly algorithms• Hadoop without Map-Reduce• alternatives in the cloud• alternatives on your laptop
![Page 3: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/3.jpg)
NB• all software mentioned is Open Source• won't cover key-value stores• I don't use all of these tools
![Page 4: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/4.jpg)
Last.fm: scrobbling
![Page 5: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/5.jpg)
Last.fm: scrobbling
![Page 6: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/6.jpg)
Last.fm: tagging
![Page 7: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/7.jpg)
Last.fm: personalised radio
![Page 8: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/8.jpg)
Last.fm: recommendations
![Page 9: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/9.jpg)
Last.fm: recommendations
![Page 10: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/10.jpg)
Last.fm datasetsCore datasets:
• 45M users, many active• 60M artists• 100M audio fingerprints• 600M tracks (hmm...)• 19M physical recordings• 3M distinct tags• 2.5M <user,item,tag> taggings per month• 1B <user,time,track> scrobbles per month• full user-track graph has ~50B edges
(more often work with ~500M edges)
![Page 11: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/11.jpg)
Problem Scenario 1Need Hadoop, don't want Java:
• need to build prototypes, fast• need to do interactive data analysis• want terse, highly readable code
• improve maintainability• improve correctness
![Page 12: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/12.jpg)
![Page 13: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/13.jpg)
Hadoop without Java Some options:• Hive (Yahoo!)• Pig (Yahoo!)• Cascading (ok it's still Java...)• Scalding (Twitter)• Hadoop streaming (various)
not to mention 11 more listed here: http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html
not to mention 11 more listed here: http://blog.matthewrathbone.com/2013/01/05/a-quick-guide-to-hadoop-map-reduce-frameworks.html
![Page 14: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/14.jpg)
Apache HiveSQL access to data on Hadooppros:
• minimal learning curve• interactive shell• easy to check correctness of code
cons:• can be inefficient• hard to fix when it is
![Page 15: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/15.jpg)
Word count in HiveCREATE TABLE input (line STRING);LOAD DATA LOCAL INPATH '/input' OVERWRITE INTO TABLE input;
SELECT word, COUNT(*) FROM input LATERAL VIEW explode(split(text, ' ')) wTable as word GROUP BY word;
[but would you use SQL to count words?]
![Page 16: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/16.jpg)
Apache PigHigh level scripting language for Hadooppros:
• more primitive operations than Hive (and UDFs)• more flexible than Hive• interactive shell
cons:• harder learning curve than Hive • tempting to write longer programs but no code
modularity beyond functions
![Page 17: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/17.jpg)
Word count in PigA = load '/input';B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;C = filter B by word matches '\\w+';D = group C by word;E = foreach D generate COUNT(C), group;store E into '/output/wordcount';
[ apply operations to "relations" (tuples) ]
![Page 18: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/18.jpg)
CascadingJava data pipelining for Hadooppros:
• as flexible as Pig• uses a real programming langauge• ideal for longer workflows
cons:• new concepts to learn ("spout","sink","tap",...) • still verbose (full wordcount ex. code > 150 lines)
![Page 19: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/19.jpg)
Word count in CascadingScheme sourceScheme = new TextLine(new Fields("line"));Tap source = new Hfs(sourceScheme, "/input");
Scheme sinkScheme = new TextLine(new Fields("word", "count"));Tap sink = new Hfs(sinkScheme, "/output/wordcount", SinkMode.REPLACE);
Pipe assembly = new Pipe("wordcount");String regex = "(?<!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)";Function function = new RegexGenerator(new Fields("word"), regex);assembly = new Each(assembly, new Fields("line"), function);assembly = new GroupBy(assembly, new Fields("word"));Aggregator count = new Count(new Fields("count"));assembly = new Every(assembly, count);
Properties properties = new Properties();FlowConnector.setApplicationJarClass(properties, Main.class);FlowConnector flowConnector = new FlowConnector(properties);Flow flow = flowConnector.connect("word-count", source, sink, assembly);flow.complete();
![Page 20: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/20.jpg)
ScaldingScala data pipelining for Hadooppros:
• as flexible as Pig• uses a real programming language• much terser than Java
cons:• community still small (but in use at Twitter)• ???
![Page 21: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/21.jpg)
Word count in Scaldingimport com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) { TextLine(args("input")) .flatMap('line -> 'word){ line: String => line.split("""\s+""") } .groupBy('word){ _.size } .write(Tsv(args("output")))}
[and a one-liner to run it]
![Page 22: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/22.jpg)
Hadoop streamingMap-reduce in any languagee.g. Dumbo wrapper for Pythonpros:
• use your favourite language for map-reduce• easy to mix local and cloud processing
cons:• limited community• limited functionality beyond map-reduce
![Page 23: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/23.jpg)
Word count in Dumbodef map(key,text):
for word in text.split():yield word,1 # ignore key
def reduce(word,counts):yield word,sum(counts)
import dumbodumbo.run(map,reduce,combiner=reduce)
[and a one-liner to run it]
![Page 24: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/24.jpg)
Problem Scenario 1bNeed Hadoop, don't want Java:
• drive native code in parallelE.g. audio analysis for:
• beat locations, bpm• key estimation• chord sequence estimation• energy• music/speech?• ...
![Page 25: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/25.jpg)
Audio AnalysisProblem:• millions of audio tracks on own dfs• long-running C++ analysis code• depends on numerous libraries• verbose output
![Page 26: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/26.jpg)
Audio AnalysisSolution:• bash + Dumbo Hadoop streaming
Outline:• build C++ code• zip up binary and libs• send zipfile and some track IDs to each machine• extract and run binary in map task with
subprocess.Popen()
![Page 27: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/27.jpg)
Audio Analysisclass AnalysisMapper: init(): extract(analyzer.tar.bz2,”bin”) map(key,trackID): file = fetch_audio_file(trackID) proc = subprocess.Popen( [“bin/analyzer”,file], stdout = subprocess.PIPE) (out,err) = proc.communicate() yield trackID,out
![Page 28: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/28.jpg)
Problem Scenario 2Map-reduce unfriendly computation:
• iterative algorithms on same data• huge mapper output ("map-increase")• curse of slowest reducer
![Page 29: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/29.jpg)
Graph RecommendationsRandom walk on user-item graph
U
t 44 44
44
44
44
44
44 44
44 44
44
44
![Page 30: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/30.jpg)
Many short routes from U to t ⇒ recommend!
Graph Recommendations
U
t 44 44
44
44
44
44
44 44
44 44
44
44
![Page 31: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/31.jpg)
Graph Recommendationsrandom walk is equivalent to • Label Propagation (Baluja et al., 2008)• belongs to family of algorithms that are easy to code in map-reduce
![Page 32: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/32.jpg)
User-track graph, edge weights = scrobbles:Label Propagation
U
V
W
X
a
b
c
d
e
f
24
411
23
5
33
4
44
44
44
44 44
44
![Page 33: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/33.jpg)
User nodes are labelled with scrobbled tracks: Label Propagation
24
411
23
5
33
4
a
b
c
d
e
f
U
V
W
X
(a,0.2)(b,0.4)(c,0.4)
(b,0.5)(d,0.5)
(b,0.2)(d,0.3)(e,0.5)
(a,0.3)(d,0.3)(e,0.4)
44
44
44
44 44
44
![Page 34: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/34.jpg)
Label Propagation
24
411
23
5
33
4
1 x (b,0.5),(d,0.5)3 x (b,0.2),(d,0.3),(e,0.5)Þ(b,0.37),d(0.47),(e,0.17)
next iteration e willpropagate to user V
a
b
c
d
e
f
U
V
W
X
Propagate, accumulate, normalise: (a,0.2)(b,0.4)(c,0.4)
(b,0.5)(d,0.5)
(b,0.2)(d,0.3)(e,0.5)
(a,0.3)(d,0.3)(e,0.4)
44
44
44
44 44
44
![Page 35: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/35.jpg)
Label PropagationAfter some iterations:
• labels at item nodes = similar items• new labels at user nodes = recommendations
![Page 36: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/36.jpg)
general approach assuming:• no global state• state at node recomputed from scratch
from incoming messages on each iteration
other examples:• breadth-first search• page rank
Map-Reduce GraphAlgorithms
![Page 37: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/37.jpg)
Map-Reduce GraphAlgorithmsinputs:
• adjacency lists, state at each nodeoutput:
• updated state at each node
U
a
b
c
24
4 U,[(a,2),(b,4),(c,4)] 44
44
44
adjacency list for node U
![Page 38: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/38.jpg)
Label Propagationclass PropagatingMapper: map(nodeID,value): # value holds label-weight pairs # and adjacency list for node labels,adj_list = value for node,weight in adj_list: # send a “stripe” of label-weight # pairs to each neighbouring node msg = [(label,prob*weight) for
label,prob in labels] yield node,msg
![Page 39: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/39.jpg)
Label Propagationclass Reducer: reduce(nodeID,msgs): # accumulate labels = defaultdict(lambda:0) for msg in msgs: for label,w in msg: labels[label] += w # normalise, prune normalise(labels,MAX_LABELS_PER_NODE) yield nodeID,labels
![Page 40: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/40.jpg)
Label PropagationNot map-reduce friendly:
• send graph over network on every iteration• huge mapper output:
• mappers soon send MAX_LABELS_PER_NODE updates along every edge
• some reducers receive huge input:• too slow if reducer streams the data,
OOM otherwise• NB can't partition real graphs to avoid this
• many natural graphs are scale-free e.g.AltaVista web graph top 1% of nodes adjacentto 53% of edges
![Page 41: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/41.jpg)
Problem Scenario 2bMap-reduce unfriendly computation:
• shared memory
Examples:• almost all machine learning:
• split training examples between machines• all machines need to read/write many shared
parameter values
![Page 42: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/42.jpg)
Hadoop without map-reduceGraph processing• Apache Giraph (Facebook)
Hadoop YARN• Knitting Boar, Iterative Reduce
http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.htmlhttp://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/strata-hadoop-world-2012-knitting-boar_slide_deck.html
• ???
![Page 43: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/43.jpg)
Alternatives in the cloudGraph Processing:• GraphLab (CMU)
Task-specific:• Yahoo! LDA
General:• HPCC• Spark (Berkeley)
![Page 44: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/44.jpg)
Spark and SharkIn-memory cluster computingpros:
• fast!! (Shark is 100x faster than Hive)• code in Scala or Java or Python• can run on Hadoop YARN or Apache Mesos• ideal for iterative algorithms, nearline analytics• includes a Pregel clone & stream processing
cons:• hardware requirements???
![Page 45: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/45.jpg)
GraphLabDistributed graph processingpros:
• vertex-centric programming model• handles true web-scale graphs• many toolkits already:
• collaborative filtering, topic modelling, graphical models,machine vision, graph analysis
cons:• new applications require non-trivial C++ coding
![Page 46: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/46.jpg)
Word count in Sparkval file = spark.textFile(“hdfs://input”) val counts = file.flatMap(line => line.split(”“)) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile(“hdfs://output/wordcount”)
![Page 47: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/47.jpg)
Logistic regression in Sparkval points = spark.textFile(…).map(parsePoint).cache() var w = Vector.random(D) // current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) – 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println(“Final separating plane: “ + w)
[ points remain in memory for all iterations ]
![Page 48: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/48.jpg)
Alternatives on your laptopGraph processing• GraphChi (CMU)
Machine learning• Sophia-ML (Google)• vowpal wabbit (Yahoo!, Microsoft)
![Page 49: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/49.jpg)
GraphChiGraph processing on your laptoppros:
• still handles graphs with billions of edges• graph structure can be modified at runtime• Java/Scala ports under active development• some toolkits available:
• collaborative filtering, graph analysis
cons:• existing C++ toolkit code is hard to extend
![Page 50: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/50.jpg)
vowpal wabbitclassification, regression, LDA, bandits, ...pros:
• handles huge ("terafeature") training datasets• very fast• state of the art algorithms• can run in distributed mode on Hadoop streaming
cons:• hard-core documentation
![Page 51: Hadoop and beyond: power tools for data mining](https://reader030.vdocuments.us/reader030/viewer/2022020115/54c6f7214a795937038b45bd/html5/thumbnails/51.jpg)
Take homesThink before you use Hadoop
• use your laptop for most problems• use a graph framework for graph data
Keep your Hadoop code simple • if you're just querying data use Hive• if not use a workflow framework
Check out the competition • Spark and HPCC look impressive