learning with hadoop – a case study on mapreduce based data mining evan xiang, hkust 1
TRANSCRIPT
Learning with Hadoop – A case study on MapReduce
based Data MiningEvan Xiang, HKUST
1
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
2
Introduction to Hadoop
Hadoop Map/Reduce is
a java based software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware
in a reliable, fault-tolerant manner.
3
Job submission node
Slave node
TaskTracker DataNode
HDFS master
JobTracker NameNode
Slave node
TaskTracker DataNode
Slave node
TaskTracker DataNode
Client
Hadoop Cluster Architecture
4From Jimmy Lin’s slides
Hadoop HDFS
5
Hadoop Cluster Rack Awareness
6
Hadoop Development Cycle
Hadoop ClusterYou
1. Scp data to cluster2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job4a. Go back to Step 3
5. Move data out of HDFS6. Scp data from cluster
7From Jimmy Lin’s slides
Divide and Conquer
“Work”
w1 w2 w3
r1 r2 r3
“Result”
“worker” “worker” “worker”
Partition
Combine
8From Jimmy Lin’s slides
High-level MapReduce pipeline
9
Detailed Hadoop MapReduce data flow
10
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
11
Word Count with MapReduce
1one 1
1two 1
1fish 2
one fish, two fishDoc 1
2red 1
2blue 1
2fish 2
red fish, blue fishDoc 2
3cat 1
3hat 1
cat in the hatDoc 3
1fish 4
1one 11two 1
2red 1
3cat 12blue 1
3hat 1
Shuffle and Sort: aggregate values by keys
Map
Reduce
12From Jimmy Lin’s slides
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
13
14
Calculating document pairwise similarity
Trivial Solution
load each vector o(N) times load each term o(dft
2) times
scalable and efficient solutionfor large collections
Goal
From Jimmy Lin’s slides
15
Better Solution
Load weights for each term onceEach term contributes o(dft
2) partial scores
Each term contributes only if appears in
From Jimmy Lin’s slides
16
reduce
Decomposition
Load weights for each term onceEach term contributes o(dft
2) partial scores
Each term contributes only if appears in
map
From Jimmy Lin’s slides
17
Standard Indexing
tokenize
tokenize
tokenize
tokenize
combine
combine
combine
doc
doc
doc
doc
posting list
posting list
posting list
Shuffling
group values by: terms
(a) Map (b) Shuffle (c) Reduce
From Jimmy Lin’s slides
Inverted Indexing with MapReduce
1one 1
1two 1
1fish 2
one fish, two fishDoc 1
2red 1
2blue 1
2fish 2
red fish, blue fishDoc 2
3cat 1
3hat 1
cat in the hatDoc 3
1fish 2 2 2
1one 11two 1
2red 1
3cat 12blue 1
3hat 1
Shuffle and Sort: aggregate values by keys
Map
Reduce
18From Jimmy Lin’s slides
19
Indexing (3-doc toy collection)Clinton
Barack
Cheney
Obama
Indexing
2
1
1
1
1
ClintonObamaClinton 1
1
ClintonCheney
ClintonBarackObama
ClintonObamaClinton
ClintonCheney
ClintonBarackObama
From Jimmy Lin’s slides
20
Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs
Clinton
Barack
Cheney
Obama
2
1
1
1
1
1
1
2
2
1
11
2
2 2
2
1
13
1
How to deal with the long list?
From Jimmy Lin’s slides
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
21
PageRank
PageRank – an information propagation model
Intensive access of neighborhood list
22
PageRank with MapReduce
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
n2 n4 n3 n5 n1 n2 n3n4 n5
n2 n4n3 n5n1 n2 n3 n4 n5
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
Map
Reduce
How to maintain the graph structure?
From Jimmy Lin’s slides
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
24
K-Means Clustering
25
K-Means Clustering with MapReduce
26
Mapper_iMapper_i-1 Mapper_i+1
Reducer_i Reducer_i+1Reducer_i-1
How to set the initial centroids is very important!Usually we set the centroids using Canopy Clustering.
Each Mapper loads a set of data samples,
and assign each sample to a nearest centroid
Each Mapper needs to keep a copy of
centroids
1 3 42
1 32
1 42
1 3 42
4
3
23 4
3 2 4
[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
27
Matrix Factorization for Link Prediction
In this task, we observe a sparse matrix X R∈ m×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U R∈ k×m and an item factor matrix V R∈ k×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:
28
Given X and V, updating U:
Similarly, given X and U, we can alternatively update V
Solving Matrix Factorization via Alternative Least Squares
29
Xm
n
ui
Vk
n
A
k k
k
k
k
k
b
k
MapReduce for ALS
30
Mapper_i
Reducer_i
Mapper_i
Reducer_i
Group rating data in X using
for item j
Group features in V
using for item j
Align ratings and features for item j, and make a copy of Vj for
each observe xij
Stage 1 Stage 2
Rating for item j
Features for item j
i-1 Vj
i Vj
i+1 Vj
Group rating data in X using
for user i
i Vj
i Vj+2
i+1 Vj
Standard ALS:Calculate A and b,
and update Ui
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
31
Cluster Coefficient
32
In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.
How to maintain the Tier-2 neighbors?
[D. J. Watts and Steven Strogatz (June 1998).
"Collective dynamics of 'small-world' networks".
Nature 393 (6684): 440–442]
Cluster Coefficient with MapReduce
33
Mapper_i
Reducer_i
Mapper_i
Reducer_i
Stage 1 Stage 2
Calculate the cluster coefficient
BFS based method need three stages, but actually we only need two!
Resource Entries to ML labsMahout
Apache’s scalable machine learning librariesJimmy Lin’s Lab
iSchool at the University of MarylandJimeng Sun & Yan Rong’s Collections
IBM TJ Watson Research CenterEdward Chang & Yi Wang
Google Beijing
34
Advanced Topics in Machine Learning with MapReduce
35
Probabilistic Graphical models
Gradient based optimization methods
Graph Mining
Others…
Some Advanced TipsDesign your algorithm with a divide and
conquer manner
Make your functional units loosely dependent
Carefully manage your memory and disk storage
Discussions…
36
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
37
Q&A
Why not MPI? Hadoop is Cheap in everything…D.P.T.H…
What’s the advantages of Hadoop? Scalability!
How do you guarantee the model equivalence? Guarantee equivalent/comparable function logics
How can you beat “large memory” solution? Clever use of Sequential Disk Access
38