![Page 1: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/1.jpg)
Learning with Hadoop – A case study on MapReduce
based Data MiningEvan Xiang, HKUST
1
![Page 2: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/2.jpg)
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
2
![Page 3: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/3.jpg)
Introduction to Hadoop
Hadoop Map/Reduce is
a java based software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware
in a reliable, fault-tolerant manner.
3
![Page 4: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/4.jpg)
Job submission node
Slave node
TaskTracker DataNode
HDFS master
JobTracker NameNode
Slave node
TaskTracker DataNode
Slave node
TaskTracker DataNode
Client
Hadoop Cluster Architecture
4From Jimmy Lin’s slides
![Page 5: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/5.jpg)
Hadoop HDFS
5
![Page 6: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/6.jpg)
Hadoop Cluster Rack Awareness
6
![Page 7: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/7.jpg)
Hadoop Development Cycle
Hadoop ClusterYou
1. Scp data to cluster2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job4a. Go back to Step 3
5. Move data out of HDFS6. Scp data from cluster
7From Jimmy Lin’s slides
![Page 8: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/8.jpg)
Divide and Conquer
“Work”
w1 w2 w3
r1 r2 r3
“Result”
“worker” “worker” “worker”
Partition
Combine
8From Jimmy Lin’s slides
![Page 9: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/9.jpg)
High-level MapReduce pipeline
9
![Page 10: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/10.jpg)
Detailed Hadoop MapReduce data flow
10
![Page 11: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/11.jpg)
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
11
![Page 12: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/12.jpg)
Word Count with MapReduce
1one 1
1two 1
1fish 2
one fish, two fishDoc 1
2red 1
2blue 1
2fish 2
red fish, blue fishDoc 2
3cat 1
3hat 1
cat in the hatDoc 3
1fish 4
1one 11two 1
2red 1
3cat 12blue 1
3hat 1
Shuffle and Sort: aggregate values by keys
Map
Reduce
12From Jimmy Lin’s slides
![Page 13: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/13.jpg)
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
13
![Page 14: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/14.jpg)
14
Calculating document pairwise similarity
Trivial Solution
load each vector o(N) times load each term o(dft
2) times
scalable and efficient solutionfor large collections
Goal
From Jimmy Lin’s slides
![Page 15: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/15.jpg)
15
Better Solution
Load weights for each term onceEach term contributes o(dft
2) partial scores
Each term contributes only if appears in
From Jimmy Lin’s slides
![Page 16: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/16.jpg)
16
reduce
Decomposition
Load weights for each term onceEach term contributes o(dft
2) partial scores
Each term contributes only if appears in
map
From Jimmy Lin’s slides
![Page 17: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/17.jpg)
17
Standard Indexing
tokenize
tokenize
tokenize
tokenize
combine
combine
combine
doc
doc
doc
doc
posting list
posting list
posting list
Shuffling
group values by: terms
(a) Map (b) Shuffle (c) Reduce
From Jimmy Lin’s slides
![Page 18: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/18.jpg)
Inverted Indexing with MapReduce
1one 1
1two 1
1fish 2
one fish, two fishDoc 1
2red 1
2blue 1
2fish 2
red fish, blue fishDoc 2
3cat 1
3hat 1
cat in the hatDoc 3
1fish 2 2 2
1one 11two 1
2red 1
3cat 12blue 1
3hat 1
Shuffle and Sort: aggregate values by keys
Map
Reduce
18From Jimmy Lin’s slides
![Page 19: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/19.jpg)
19
Indexing (3-doc toy collection)Clinton
Barack
Cheney
Obama
Indexing
2
1
1
1
1
ClintonObamaClinton 1
1
ClintonCheney
ClintonBarackObama
ClintonObamaClinton
ClintonCheney
ClintonBarackObama
From Jimmy Lin’s slides
![Page 20: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/20.jpg)
20
Pairwise Similarity(a) Generate pairs (b) Group pairs (c) Sum pairs
Clinton
Barack
Cheney
Obama
2
1
1
1
1
1
1
2
2
1
11
2
2 2
2
1
13
1
How to deal with the long list?
From Jimmy Lin’s slides
![Page 21: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/21.jpg)
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
21
![Page 22: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/22.jpg)
PageRank
PageRank – an information propagation model
Intensive access of neighborhood list
22
![Page 23: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/23.jpg)
PageRank with MapReduce
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
n2 n4 n3 n5 n1 n2 n3n4 n5
n2 n4n3 n5n1 n2 n3 n4 n5
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
Map
Reduce
How to maintain the graph structure?
From Jimmy Lin’s slides
![Page 24: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/24.jpg)
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
24
![Page 25: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/25.jpg)
K-Means Clustering
25
![Page 26: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/26.jpg)
K-Means Clustering with MapReduce
26
Mapper_iMapper_i-1 Mapper_i+1
Reducer_i Reducer_i+1Reducer_i-1
How to set the initial centroids is very important!Usually we set the centroids using Canopy Clustering.
Each Mapper loads a set of data samples,
and assign each sample to a nearest centroid
Each Mapper needs to keep a copy of
centroids
1 3 42
1 32
1 42
1 3 42
4
3
23 4
3 2 4
[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]
![Page 27: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/27.jpg)
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
27
![Page 28: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/28.jpg)
Matrix Factorization for Link Prediction
In this task, we observe a sparse matrix X R∈ m×n with entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U R∈ k×m and an item factor matrix V R∈ k×n. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:
28
![Page 29: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/29.jpg)
Given X and V, updating U:
Similarly, given X and U, we can alternatively update V
Solving Matrix Factorization via Alternative Least Squares
29
Xm
n
ui
Vk
n
A
k k
k
k
k
k
b
k
![Page 30: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/30.jpg)
MapReduce for ALS
30
Mapper_i
Reducer_i
Mapper_i
Reducer_i
Group rating data in X using
for item j
Group features in V
using for item j
Align ratings and features for item j, and make a copy of Vj for
each observe xij
Stage 1 Stage 2
Rating for item j
Features for item j
i-1 Vj
i Vj
i+1 Vj
Group rating data in X using
for user i
i Vj
i Vj+2
i+1 Vj
Standard ALS:Calculate A and b,
and update Ui
![Page 31: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/31.jpg)
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
31
![Page 32: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/32.jpg)
Cluster Coefficient
32
In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.
How to maintain the Tier-2 neighbors?
[D. J. Watts and Steven Strogatz (June 1998).
"Collective dynamics of 'small-world' networks".
Nature 393 (6684): 440–442]
![Page 33: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/33.jpg)
Cluster Coefficient with MapReduce
33
Mapper_i
Reducer_i
Mapper_i
Reducer_i
Stage 1 Stage 2
Calculate the cluster coefficient
BFS based method need three stages, but actually we only need two!
![Page 34: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/34.jpg)
Resource Entries to ML labsMahout
Apache’s scalable machine learning librariesJimmy Lin’s Lab
iSchool at the University of MarylandJimeng Sun & Yan Rong’s Collections
IBM TJ Watson Research CenterEdward Chang & Yi Wang
Google Beijing
34
![Page 35: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/35.jpg)
Advanced Topics in Machine Learning with MapReduce
35
Probabilistic Graphical models
Gradient based optimization methods
Graph Mining
Others…
![Page 36: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/36.jpg)
Some Advanced TipsDesign your algorithm with a divide and
conquer manner
Make your functional units loosely dependent
Carefully manage your memory and disk storage
Discussions…
36
![Page 37: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/37.jpg)
OutlineHadoop Basics
Case StudyWord CountPairwise SimilarityPageRankK-Means ClusteringMatrix FactorizationCluster Coefficient
Resource Entries to ML labsAdvanced TopicsQ&A
37
![Page 38: Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1](https://reader036.vdocuments.us/reader036/viewer/2022062421/56649d1a5503460f949ef3fd/html5/thumbnails/38.jpg)
Q&A
Why not MPI? Hadoop is Cheap in everything…D.P.T.H…
What’s the advantages of Hadoop? Scalability!
How do you guarantee the model equivalence? Guarantee equivalent/comparable function logics
How can you beat “large memory” solution? Clever use of Sequential Disk Access
38