![Page 1: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/1.jpg)
page.1
Machine Learning: Think Big and ParallelDay 2
Inderjit S. DhillonDept of Computer Science
UT Austin
CS395T: Topics in Multicore ProgrammingOct 3, 2013
![Page 2: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/2.jpg)
page.2
Outline
Scikit-learn: Machine Learning in Python
Supervised Learning — day1
Regression: Least Squares, Lasso
Classification: kNN, SVM
Unsupervised Learning — day2
Clustering: k-means, Spectral Clustering
Dimensionality Reduction: PCA, Matrix Factorization for RecommenderSystems
![Page 3: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/3.jpg)
page.3
Clustering
![Page 4: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/4.jpg)
page.4
Clustering
![Page 5: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/5.jpg)
page.5
Clustering:
k-means Clustering
![Page 6: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/6.jpg)
page.6
Clustering
Goal is to group “similar” instances together
Given data points xi ∈ Rd , i = 1, 2, . . . ,N
But no labels – unsupervised learning
Useful for exploratory data analysis
![Page 7: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/7.jpg)
page.7
Clustering
Goal is to group “similar” instances together
Given data points xi ∈ Rd , i = 1, 2, . . . ,N
But no labels – unsupervised learning
Useful for exploratory data analysis
![Page 8: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/8.jpg)
page.8
Clustering
Need a measure of similarity (or distance) between two points x and y
Popular distance metrics:
Squared Euclidean distance d(x , y) = ‖x − y‖22
Cosine similarity d(x , y) = (xTy)/‖x‖‖y‖Manhattan distance d(x , y) = ‖x − y‖1
Clustering results are crucially dependent on the distance metric
![Page 9: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/9.jpg)
page.9
k-means Clustering
Find k clusters that minimizes the objective:
J =k∑
i=1
∑x∈Ci
‖x −mi‖22
Ci : the set of points in cluster i
mi : the mean(center) of cluster i
Objective is non-convex andproblem is NP-hard in general
Note: for k = 1, J =∑‖x −m‖2
2
⇒ solution is m∗ =1
N
∑x
![Page 10: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/10.jpg)
page.10
k-means Algorithm (Batch)
Input: data points x ∈ Rd , number of clusters kOutput: cluster assignment Ci of data points, i = 1, 2, . . . , k1: Randomly partition the data into k clusters2: while not converged do3: Compute mean of each cluster i
mi =1
ni
∑x∈Ci
x
4: For each x , find its new cluster index:
π(x) = arg min1≤i≤k
‖x −mi‖22
5: Update clusters:Ci = x |π(x) = i
6: end while
![Page 11: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/11.jpg)
page.11
k-means Clustering
![Page 12: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/12.jpg)
page.12
Convergence of k-means
Let the objective at t-th iteration be J(t) =∑k
i=1
∑x∈C(t)
i
‖x −m(t)i ‖2
J(t) =k∑
i=1
∑x∈C(t)
i
‖x −m(t)i ‖
2
≥k∑
i=1
∑x∈C(t)
i
‖x −m(t)π(x)‖
2 =k∑
i=1
∑x∈C(t+1)
i
‖x −m(t)i ‖
2
≥k∑
i=1
∑x∈C(t+1)
i
‖x −m(t+1)i ‖2 = J(t+1)
Each step decreases the objective — guaranteed to converge
But not necessarily to the global minimum
![Page 13: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/13.jpg)
page.13
k-means Algorithm (Online)
Input: data points x ∈ Rd , number of clusters kOutput: cluster assignment Ci of data points, i = 1, 2, . . . , k1: Initialize means mi and ni = 0, i = 1, 2, . . . , k2: while not converged do3: Pick a data point x and determine cluster π(x)
π(x) = arg min1≤i≤k
‖x −mi‖22
4: Update mean mπ(x)
nπ(x) = nπ(x) + 1 and mπ(x) = mπ(x) +1
nπ(x)(x −mπ(x))
5: end while
![Page 14: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/14.jpg)
page.14
k-means with Bregman Divergences
Bregman divergences:
dΦ(x , y) = Φ(x)− Φ(y)− 〈x − y,∇Φ(y)〉,
where Φ is strictly convex & differentiable
Examples of dΦ(x , y):
Squared Euclidean distance: ‖x − y‖22
KL-divergence:∑
i xi log(xiyi)
Itakura-Saito distance:∑
i
(xiyi− log( xi
yi)− 1
)For Bregman divergences, the arithmetic mean is the best predictor:
1
N
N∑i=1
xi = arg minc
N∑i=1
dΦ(xi , c)
![Page 15: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/15.jpg)
page.15
Clustering:
Spectral Clustering
![Page 16: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/16.jpg)
page.16
Spectral Clustering
Given:
Number of clusters kGraph G = (V, E)
Set of nodes: V = 1, · · · , nSet of edges: E = eij |i , j ∈ V — similarity between nodesWeighted adjacency matrix W ∈ Rn×n
Wij =
eij , if there is an edge between nodes i and j
0, otherwise
W is symmetric if G is an undirected graphDegree matrix: a diagonal matrix D where Dii =
∑nj=1 Wij
![Page 17: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/17.jpg)
page.17
Spectral Clustering
Goal:
Partition V into k disjoint clusters: V1, . . . ,VkWithin-cluster: large weights
Between-cluster: small weights
An ideal but trivial case: G has exactly k connected components
![Page 18: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/18.jpg)
page.18
Graph Cut
Small cut between clusters
cut(A,B) =1
2
∑i∈A,j∈B
Wij
Balance of cluster sizes |Vi |Objective:
RatioCut(V1, . . . ,Vk) =k∑
i=1
cut(Vi ,V \ Vi )|Vi |
Goal: minimize RatioCut(V1, . . . ,Vk)
![Page 19: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/19.jpg)
page.19
Graph Laplacian
Laplacian: L = D −W
L: symmetric and positive semi-definite
Eigenvalues: 0 ≤ λ1 ≤ λ2 ≤ · · · ≤ λn# of connected components in G = # of 0 eigenvalues of L
For all f ∈ Rn,
fTLf =1
2
n∑i ,j=1
Wij(fi − fj)2
Most importantly,
RatioCut(A1, . . . ,Ak) = trace(FTLF )
for a special F = [f1, . . . , fk ], where Fij =
1/√|Vj |, if i ∈ Vj
0, otherwise
![Page 20: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/20.jpg)
page.20
Relaxation of Cut Minimization
In general, minimizing RatioCut is NP-hard!However, based on
RatioCut(V1, . . . ,Vk) = trace(FTLF ),
we have the following relaxation:
SolveF ∗ = arg min
F∈Rn×ktrace(FTLF )
which are exactly the first k eigenvectors of LRecover V1, . . . ,Vk from F ∗ by distance-based clustering algorithms(e.g. k-means)
![Page 21: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/21.jpg)
page.21
Spectral Clustering vs. k-means
Clustering data points xi ∈ Rd , i = 1, . . . ,N
First construct kernel matrixe.g. Gaussian kernel:
Wij = K (xi , xj) = e−‖xi−xj‖2/2σ
k-means algorithm can only find lineardecision boundaries
Spectral clustering allows us to findnon-convex boundaries
![Page 22: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/22.jpg)
page.22
Variants of Graph Laplacian
Normalized Laplacian:
L = In − D−1/2WD−1/2
NormalizedCut(V1, . . . ,Vk) =∑k
i=1cut(Vi ,V\Vi )
vol(Vi ) , where
vol(Vi ) =∑
j∈Vi Djj
Signed Laplacian:
L = D −W , where Dii =∑n
j=1 |Wij |Handle “signed” similarity graphs with both positive and negative edgeweights
![Page 23: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/23.jpg)
page.23
Dimensionality Reduction
![Page 24: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/24.jpg)
page.24
Dimensionality Reduction
![Page 25: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/25.jpg)
page.25
Dimensionality Reduction:
Principal Component Analysis
![Page 26: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/26.jpg)
page.26
Principal Component Analysis
N observations: xi ∈ RD : i = 1 . . . ,NGoal:
Project data onto a space with dimensional M < D
Maximize the variance of the projected data
Example:
![Page 27: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/27.jpg)
page.27
PCA: Projection to one dimensional space (M = 1)
Empirical mean and variance of xn:
x =1
N
N∑n=1
xn
S =1
N
N∑n=1
(xn − x)(xn − x)T
w : the direction of the space
‖w‖2 = 1 as the length is not important.
Projw (xn) = wTxn, ∀n = 1, . . . ,N
Projw (x) = wT xThe variance of Projwxn:
1
N
N∑n=1
(wTxn −wT x
)2≡ wTSw .
![Page 28: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/28.jpg)
page.28
PCA: Projection to one dimensional space (M = 1)
Goal: maximize the variance of the projected data Projw (xn):
arg maxw1:‖w1‖=1
wT1 Sw1
Lagrangian L(w1, λ1) = wT1 Sw1 + λ1
(1−wT
1 w1
)∇L(w1, λ1) = 0 implies that Sw∗1 = λ1w∗1 .
w∗1 is the eigenvector of S corresponding the largest eigenvalue λ∗1, alsocalled the 1-st principal component.
In general, the k-th principal component w∗k is the eigenvector of Scorresponding to the k-th largest eigenvalue λ∗k .
Dimension reduction:
W = [w∗1 , . . . ,w∗M ]: formed by M principal components.
ProjW (x) = W Tx : the projected vector in M dimensional space.
![Page 29: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/29.jpg)
page.29
PCA: An Example
A set of digit images
The mean vector x and the first 4 principal components:
![Page 30: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/30.jpg)
page.30
PCA: An Example
Various M:
Eigenvalue Spectrum:
![Page 31: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/31.jpg)
page.31
Dimensionality Reduction:
Matrix Factorization
![Page 32: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/32.jpg)
page.32
Matrix Factorization
Matrix Factorization
A motivating example: recommender systems
Problem Formulation
Latent Feature Space
Existing Methods
![Page 33: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/33.jpg)
page.33
Recommender Systems
![Page 34: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/34.jpg)
page.34
Matrix Factorization Approach A ≈ WHT
![Page 35: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/35.jpg)
page.35
Matrix Factorization Approach A ≈ WHT
![Page 36: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/36.jpg)
page.36
Matrix Factorization Approach
minW∈Rm×k
H∈Rn×k
∑(i ,j)∈Ω
(Aij −wTi hj)
2 + λ(‖W ‖2
F + ‖H‖2F
),
Ω = (i , j) | Aij is observedRegularized terms to avoid over-fitting
Matrix factorization maps users/items to latent feature space Rk
the i th user ⇒ i th row of W , wTi ,
the j th item ⇒ j th row of H, hTj .
wTi hj : measures the interaction between i th user and j th item.
![Page 37: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/37.jpg)
page.37
Latent Feature Space
![Page 38: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/38.jpg)
page.38
Latent Feature Space
![Page 39: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/39.jpg)
page.39
Other Factorizations
Nonnegative Matrix Factorization
minW ,H‖A−WHT‖2
F + λ‖W ‖2F + λ‖H‖2
F
Each entry is positive
A is either fully or partially observed
Goal: find the nonnegative latent factors
![Page 40: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/40.jpg)
page.40
Existing Methods
![Page 41: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/41.jpg)
page.41
ALS: Alternating Least Squares
Fix either H or W and optimize the other:
LS sub-problem: minwi∈Rk
∑j∈Ωi
(Aij −wTi hj)
2 + λ‖wi‖2
it has closed form solution.
An iteration: update W /H once
O(|Ω|k2 + (m + n)k3)
wT1
wT2
wT3
A11 A12 A13
A21 A22 A23
A31 A32 A33
HT
( )
![Page 42: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/42.jpg)
page.42
SGM: Stochastic Gradient Method
SGM update: pick (i , j) ∈ Ω
Rij ← Aij −wTi hj
wi ← wi − η(λwi − Rijhj),
hj ← hj − η(λhj − Rijwi ).
wT1
wT2
wT3
A11 A12 A13
A21 A22 A23
A31 A32 A33
h1 h2 h3
( )
An iteration : |Ω| updates
Time per iteration: O(|Ω|k),
better than O(|Ω|k2) for ALS
Convergence is sensitive to the learning rate η.
![Page 43: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/43.jpg)
page.43
Coordinate Descent
Update a variable at a time:
wit ←∑
j∈Ωi(Aij −wT
i hj + withjt)hjt
λ+∑
j∈Ωih2jt
.
Subproblem is just a single-variate quadratic problem
Ωi = j : (i , j) ∈ ΩCan be done in O(|Ωi |)
Update Sequence:
Item/user-wise update:
pick a user i or an item jupdate the i-th row of W or the j-th column of H
Feature-wise update:
pick a feature index t ∈ 1, . . . , kupdate t-column of W and H alternatively
![Page 44: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/44.jpg)
page.44
Thoughts on Parallelization
![Page 45: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/45.jpg)
page.45
List of Methods in Scikit-learn
Regression:
Linear, Ridge, Lasso, Elastic Net, Bayesian Regression, Support VectorRegression, ...
Classification:
kNN, SVM, Perceptron, Logistic Regression, Naive Bayes, DecisionTrees, Random Forest, AdaBoost, ...
Clustering:
k-means, Spectral Clustering, Affinity Propagation, Mean-Shift,DBSCAN, Hierarchical Clustering, ...
Dimensionality Reduction:
(kernel/sparse) PCA, MF, NMF, Truncated SVD (LSA), DictionaryLearning, Factor Analysis, Independent Component Analysis, ...
![Page 46: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/46.jpg)
page.46
Potential Projects
Goal: A fully parallelized version of Scikit-learn
Regression:
parallel solvers for Lasso/Ridge
Classification:
parallel solvers for SVM, Logistic Regression
Clustering:
parallel k-means
Dimensionality Reduction:
parallel MF/NMF for recommender system
![Page 47: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/47.jpg)
page.47
Example: Parallel Matrix Factorizationfor Recommender Systems
![Page 48: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/48.jpg)
page.48
DSGD: Distributed SGM
wT1
wT2
wT3
h1 h2 h3
A11A12 A13
A21 A22A23
A31 A32 A33
P1 P2 P3
![Page 49: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/49.jpg)
page.49
DSGD: Distributed SGM
wT1
wT2
wT3
h1 h2 h3
A11 A12 A13
A21A22 A23
A31 A32A33
P1 P2 P3
![Page 50: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/50.jpg)
page.50
DSGD: Distributed SGM
wT1
wT2
wT3
h1 h2 h3
A11 A12
A13
A21 A22 A23
A31A32 A33
P1 P2 P3
![Page 51: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/51.jpg)
page.51
Parallel Coordinate Descent
Feature-wise Update: CCD++Rank-one decomposition:
WHT = [· · · wt · · · ][· · · ht · · · ]T =k∑
t=1
wt hTt
CCD++: picks a latent feature t and updates (wt , ht)
minu∈Rm,v∈Rn
∑(i ,j)∈Ω
(Rij − uivj
)2+ λ(‖u‖2 + ‖v‖2).
Rij = Aij −wTi hj
Rij = Rij + wti htj , ∀(i , j) ∈ Ω
(u∗, v∗) is a rank-one approximation of R
Apply the CCD iteration T times to obtain (u∗, v∗)CCD: item/user-wise update
![Page 52: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/52.jpg)
page.52
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 53: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/53.jpg)
page.53
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 54: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/54.jpg)
page.54
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 55: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/55.jpg)
page.55
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 56: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/56.jpg)
page.56
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 57: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/57.jpg)
page.57
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 58: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/58.jpg)
page.58
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 59: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/59.jpg)
page.59
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 60: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/60.jpg)
page.60
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 61: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/61.jpg)
page.61
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 62: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/62.jpg)
page.62
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 63: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/63.jpg)
page.63
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 64: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/64.jpg)
page.64
Feature-wise Update: CCD++
When T = 2
Cycle through k featuredimensions
O( 2TT+1 ) faster than CCD
netflix with k = 40
![Page 65: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/65.jpg)
page.65
Problems of Different Scales
W ,H, and R fit in the memory of a single computer
Multi-core systems are an appropriate framework.
All cores share the same memory space.
Latest variables are always available to access.
W ,H or R exceeds memory capacity of one computer
Can still run on one computer, but leads to disk swap.
Distributed systems are appropriate.
Matrices are stored in memory of the distributed system ⇒ only localdata can be accessed fast.
Require communication to access latest variables.
![Page 66: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/66.jpg)
page.66
Parallelization of CCD++
Key: to parallelize CCD to obtain (u∗, v∗).
Fact: each ui can be updated independently.
Partition u and v into p sub-vectors.
u ⇒ u1, . . . ,ur , . . . ,upv ⇒ v1, . . . , v r , . . . , vp
Run in parallel: the r th core Cr :
computes (u∗)r and (v∗)r
updates w rt and hr
t
See the paper Yu et al, 2013 for more details.
|
u
|
R11 R12 R13
R21 R22 R23
R31 R32 R33
v1 v2 v3
( )C1 C2 C3
![Page 67: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/67.jpg)
page.67
CCD++ on Distributed Systems
W ,H,R are distributed over the memory of different computers.
R11 R12 R13
R21
R31
C1R ⇒
R12
R21 R22 R23
R32
C2
R13
R23
R31 R32 R33
C3
W ⇒ W 1 W 2 W 3
( )T
H ⇒ H1 H2 H3
( )T
![Page 68: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/68.jpg)
page.68
CCD++ on Distributed Systems
Distributed update: computer Cr :
obtains (ur , v r ) using CCD:
computes ur and broadcasts itcomputes v r and broadcasts it
updates (w rt , hr
t )← (ur , v r )
|
u
|
R11 R12 R13
R21 R22 R23
R31 R32 R33
v1 v2 v3
( )C1 C2 C3
![Page 69: Machine Learning: Think Big and Parallel - Day 2 · Unsupervised Learning | day2 Clustering: k-means, Spectral Clustering Dimensionality Reduction: PCA, Matrix Factorization for Recommender](https://reader034.vdocuments.us/reader034/viewer/2022050304/5f6cb53f0ccc94670c730763/html5/thumbnails/69.jpg)
page.69
References
[1] R. Gemulla, P. J. Haas, E. Nijkamp, and Y. Sismanis Large-Scale Matrix Factorizationwith Distributed Stochastic Gradient Descent. KDD, 2011.
[2] F. Niu, B. Recht, C. Re, and S. J. Wright Hogwild: A Lock-Free Approach toParallelizing Stochastic Gradient Descent. NIPS, 2011.
[3] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin A Fast Parallel SGD for MatrixFactorization in Shared Memory Systems. RecSys, 2013.
[4] H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon Parallel Matrix Factorization for
Recommender Systems. KAIS, 2013.