scalable task parallel sgd on matrix factorization in multicore architectures
TRANSCRIPT
![Page 1: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/1.jpg)
15/05/29 ParLearning15 1
Scalable Task-Parallel SGDon Matrix Factorization
in Multicore ArchitecturesGraduate School of Information Science and Technology
The University of TokyoYusuke Nishioka, Kenjiro Taura
![Page 2: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/2.jpg)
15/05/29 ParLearning15 2
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
![Page 3: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/3.jpg)
15/05/29 ParLearning15 3
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
![Page 4: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/4.jpg)
15/05/29 ParLearning15 4
Introduction Recommendation is an important technique especially in e-commerce services
e.g. Amazon, Netflix, …
Service providers use recommendation as a way to help users to find their preferable items
(https://amazon.com)
![Page 5: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/5.jpg)
15/05/29 ParLearning15 5
Recommendation
There are two approaches for recommendation: content filtering and collaborative filtering
content filtering based on information of users and items
collaborative filtering● based on correlation between users and items
Collaborative filtering has been getting popular accurate prediction with less amount of data
![Page 6: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/6.jpg)
15/05/29 ParLearning15 6
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Each entry in the table denotes the rating 5 means great, 1 means terrible
![Page 7: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/7.jpg)
15/05/29 ParLearning15 7
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Both user A and C seem to like war movies
![Page 8: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/8.jpg)
15/05/29 ParLearning15 8
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Both user A and C seem to like war movies Why not recommend other war movies ?
(4~5?)
(4~5?)
![Page 9: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/9.jpg)
15/05/29 ParLearning15 9
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Both user B and D seem to like Disney movies
![Page 10: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/10.jpg)
15/05/29 ParLearning15 10
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Both user A and C seem to like Disney movies Why not recommend other Disney movies ?
(4~5?)
(4~5?)
![Page 11: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/11.jpg)
15/05/29 ParLearning15 11
Our contribution is to: propose an alternative approach for parallelmatrix factorization
analyze the scalability problem of the past work achieve better scalability than other methods
Contribution
![Page 12: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/12.jpg)
15/05/29 ParLearning15 12
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
![Page 13: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/13.jpg)
15/05/29 ParLearning15 13
Matrix factorization
One of collaborative filtering algorithms is matrix factorization (MF)
infers user-item relations from ratingswhich users have given to items
decomposes the rating matrix into two low-rank latent matrices
![Page 14: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/14.jpg)
15/05/29 ParLearning15 14
What's matrix factorization ?
table
user A
itemA
itemB
itemC
user Buser Cuser D
R
n items
m users
matrix
![Page 15: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/15.jpg)
15/05/29 ParLearning15 15
What's matrix factorization ?
The rating matrix R is modeled as matrix multiplication of PT and Q
R
Q
PT
musers
n items k
k
R≃PTQ
m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n
![Page 16: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/16.jpg)
15/05/29 ParLearning15 16
What's matrix factorization ?
Each entry ru,v in the rating matrix R is modeled as inner products of corresponding row and column:
R
Q
PT
musers
n items
ru,v
qv
puT
k
k
m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n
^ru , v≃puT qv
u
v
![Page 17: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/17.jpg)
15/05/29 ParLearning15 17
Optimization problem of MF
The objective is to reduce the error between predicted value and real value of existing ratings● The objective function is as follows:
minP ,Q ∑(u , v )∈R
( ^ru , v−puT qv)
2+λ P‖P‖F2 +λ Q‖Q‖F
2
ru , v :existingrating ,
puT qv : predicted rating ,
λ P ,λ Q :regularized coefficients‖‖F :Frobeniusnorm
![Page 18: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/18.jpg)
15/05/29 ParLearning15 18
Gradient Descent Method An approach to this optimization problem is gradient descent (GD) method
Modify P and Q in the direction opposite togradients of the objective function
pu⇐ pu+ γ ( ∑(u , v)∈R
(ru , v− puT qv)qv−λ P∑
u=1
m
pu)qv⇐qv+γ ( ∑
(u , v )∈R(ru ,v−pu
T qv) pu−λ Q∑v=1
n
qv) GD needs to calculate gradients in a batch
slow convergence
![Page 19: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/19.jpg)
15/05/29 ParLearning15 19
● Stochastic gradient descent (SGD) method updates corresponding pu and qv in an online manner
faster convergence with low memory consumption pu and qv are updated as follows:
SGD is inherently sequential Difficult to parallelize over multi-core CPU
pu⇐ pu+γ ((ru , v−puT qv)qv−λ P pu)
qv⇐qv+γ ((ru , v−puT qv) pu−λ Qqv)
Stochastic Gradient Descent Method
![Page 20: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/20.jpg)
15/05/29 ParLearning15 20
SGD
GD vs SGD
GD
● Comparably easy to parallelize
pseudocode
strong point
weak point ● slow convergence
● fast convergence
● Difficult to parallelize
![Page 21: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/21.jpg)
15/05/29 ParLearning15 21
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
![Page 22: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/22.jpg)
15/05/29 ParLearning15 22
We review some parallel SGD algorithms for solving matrix factorization:
HogWild! DSGD FPSGD
Related work
![Page 23: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/23.jpg)
15/05/29 ParLearning15 23
HogWild! is a lock-free approach to parallelize MF workers process ratings in R independently guarantee convergence in spite of the possibilitythat updates may be overwritten by other workers
HogWild! [Niu et al. 2011]
worker1worker2worker3
overwrite byother workers
R R
![Page 24: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/24.jpg)
15/05/29 ParLearning15 24
Problems of HogWild!
Random rating selection in whole R Cannot take advantage of hardware prefetch
PT
Q
R1
21
1
32
3
4
3 4 2
4
![Page 25: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/25.jpg)
15/05/29 ParLearning15 25
DSGD divides R into t * t blocks and assigns them to t workers
workers go on to the next block simultaneously synchronous parallel
DSGD [Gemulla et al. 2011]
![Page 26: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/26.jpg)
15/05/29 ParLearning15 26
Random rating selection in a block Cannot take advantage of hardware prefetch
Problems of DSGD
worker1
synchronizebarrier
worker2
worker3
worker4
process block
Synchronous parallel performance depends on the slowest worker
![Page 27: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/27.jpg)
15/05/29 ParLearning15 27
FPSGD is a state-of-the-art parallel algorithm for MF intends to overcome the bottleneck of HogWild! and DSGD by introducing following two techniques:
conflict-free scheduling partial random method
FPSGD [Zhuang et al. 2013]
![Page 28: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/28.jpg)
15/05/29 ParLearning15 28
Take a look at a specific example: 6*6 blocks, 4 workers T0 ~ T3
First, the scheduler assigns a block which shares neither the same row nor the same column
Conflict-free schedulingR 0
0
1
T0
2
3
4
5
T0T1
T2T3
1 2 3 4 5
![Page 29: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/29.jpg)
15/05/29 ParLearning15 29
The row and column processed by other workers are blocked not to update the same row and column When worker T0 finishes processing the block, the scheduler assigns blocks with a circle
Conflict-free schedulingR 0
0
1
T0 ○
○○
○○
○○
○
2
3
4
5
T0T1
T2T3
2 3 541
![Page 30: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/30.jpg)
15/05/29 ParLearning15 30
Worker T0 moves to block from (0,0) to (4,5)
Conflict-free schedulingR 0
0
1
T0
2
3
4
5
T0
T1T2
T3
1 2 3 4 5
![Page 31: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/31.jpg)
15/05/29 ParLearning15 31
In conflict-free scheduling, the rating matrix should be divided into at least (t+1)*(t+1) blocks ( t is the number of workers )
so that the scheduler can assign a “free” block Workers can process blocks asynchronously because of this scheduling
Conflict-free scheduling
![Page 32: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/32.jpg)
15/05/29 ParLearning15 32
Select ratings in a block orderly Select blocks randomly
Partial random method
![Page 33: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/33.jpg)
15/05/29 ParLearning15 33
Ordered rating selection means loading at least either row or column onto the cache orderly
workers can utilize hardware prefetch
Partial random method
14
1
1
2
3
2 3 4
2
4
3
ordered
![Page 34: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/34.jpg)
15/05/29 ParLearning15 34
Summary of related work
parallel cache utilizationblock selection rating selection
HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordered
![Page 35: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/35.jpg)
15/05/29 ParLearning15 35
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
![Page 36: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/36.jpg)
15/05/29 ParLearning15 36
FPSGD has a scalability problem● it cannot scale with higher core counts
The main reasons of the limited scalability is: Locking problem Poor data localities across blocks
Scalability of FPSGD
![Page 37: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/37.jpg)
15/05/29 ParLearning15 37
In get_job function, the scheduler gets a free block and mark the row and column “being processed” In put_job function, the worker returns processed block and unmark the row and column as “free” Both gets the single lock → poor scalability
Locking problem
T0
mark as “being processed”
![Page 38: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/38.jpg)
15/05/29 ParLearning15 38
Workers randomly select blocks little opportunity to reuse P and Q across blocks
Poor data localities across blocks
![Page 39: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/39.jpg)
15/05/29 ParLearning15 39
The traffic for processing all blocks once is:
Poor data localities across blocks
T fpsgd=(nrb2
+k nub
+k nib
)×b2×d
=d (nr+bk (nu+ni))
k
k
nu/b
ni/b
ratingsusersitemsblocks
word size
nu
b×b
nr
ni
d
![Page 40: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/40.jpg)
15/05/29 ParLearning15 40
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
![Page 41: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/41.jpg)
15/05/29 ParLearning15 41
We propose dcMF, a scalable divide-and-conquer method for MF using task parallel model
Proposal
![Page 42: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/42.jpg)
15/05/29 ParLearning15 42
In task parallel model, parallelism can be expressed via two operations:
create_task create a task
sync_tasks wait for the calling task's child tasks to finish
Task parallel model
![Page 43: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/43.jpg)
15/05/29 ParLearning15 43
Regard each function call as a task Tasks are automatically distributed among workers
Example using task parallel (fibonacci)
![Page 44: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/44.jpg)
15/05/29 ParLearning15 44
In each recursion,● divide the blockinto 2-by-2 sub-blocks
processes two sub-blockson one diagonal line
processes two sub-blockson another diagonal line
Overview of dcMF
![Page 45: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/45.jpg)
15/05/29 ParLearning15 45
Tasks created along recursion forms a tree structure Tasks which locates in the different column or row can be processed in parallel
Execution flow
![Page 46: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/46.jpg)
15/05/29 ParLearning15 46
Overview of dcMF
The runtime system automatically distribute tasks to workers
R
Worker 1 Worker 2running tasks
waitingtasks
tasktask
task task
![Page 47: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/47.jpg)
15/05/29 ParLearning15 47
divide the block into two-by-two sub-blocks
Overview of dcMF
![Page 48: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/48.jpg)
15/05/29 ParLearning15 48
Tasks on one diagonal line are created
Overview of dcMF
divide the block into two-by-two sub-blocks
![Page 49: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/49.jpg)
15/05/29 ParLearning15 49
Overview of dcMF
Tasks on one diagonal line are created
divide the block into two-by-two sub-blocks
Tasks on one another diagonal line are created
![Page 50: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/50.jpg)
15/05/29 ParLearning15 50
When the block gets small enough, pu and qv will be updated
Overview of dcMF
Tasks on one diagonal line are created
divide the block into two-by-two sub-blocks
Tasks on one another diagonal line are created
![Page 51: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/51.jpg)
15/05/29 ParLearning15 51
There are two advantages in dcMF: get rid of locking time
Task parallel systems can handle task migrationwithout a centralized data structure
reduce cache miss counts
Advantage of task parallel model
![Page 52: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/52.jpg)
15/05/29 ParLearning15 52
Reducing cache miss counts
worker1
worker2
worker3 worker4
R
R
Task parallel systems split the task tree near its root, and assign them to workers
![Page 53: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/53.jpg)
15/05/29 ParLearning15 53
R
If there is no load balancing, each worker process t blocks whose size is ( 1/t ) × ( 1/t )
Reducing cache miss counts
worker1
worker2
worker3 worker4
R
![Page 54: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/54.jpg)
15/05/29 ParLearning15 54
R
Think of blocks around the leaf All of these blocks are processed by the same worker
load ontocache
workerloadonto
cache
Reducing cache miss counts
![Page 55: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/55.jpg)
15/05/29 ParLearning15 55
R
After processing blocks on the same diagonal line, corresponding pu and qv will be on cache
worker
done
load ontocache
loadonto
cache
already oncache
alreadyon
cache
Reducing cache miss counts
![Page 56: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/56.jpg)
15/05/29 ParLearning15 56
R
Processing remaining blocks don't require access to memory
worker
done
done
alreadyon
cachealready
oncache
already oncache
already oncache
Reducing cache miss counts
![Page 57: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/57.jpg)
15/05/29 ParLearning15 57
Advantage of task parallel model
T dcmf=d (nr+tk (nu+ni))(T fpsgd=d (nr+bk (nu+ni)))
● The traffic for processing all blocks once is:
The total traffic of dcMF is less than that of FPSGD when b > t :
T fpsgd−T dcmf=d (nr+bk (nu+ni))−d (nr+tk (nu+ni))=d (b−t )(nu+ni)>0
ratingsusersitems
threadsblocks
word size
nu
b×b
nr
nit
d
![Page 58: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/58.jpg)
15/05/29 ParLearning15 58
Comparison with related works
parallel cache utilizationblock selection rating selection
HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordereddcMF async partially ordered ordered
![Page 59: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/59.jpg)
15/05/29 ParLearning15 59
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
![Page 60: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/60.jpg)
15/05/29 ParLearning15 60
CPU: four AMD Opteron 8354 2.50 GHz sockets each socket has 2 NUMA nodeseach NUMA node has 4 moduleseach module has 2 cores
total core count is 64 L3 cache (6MB) is shared by 4 modulesL2 cache (2MB) is shared by 2 coresL1 cache (16KB) is shared by 1 core
Evaluation environment
![Page 61: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/61.jpg)
15/05/29 ParLearning15 61
Implementation: we used the same code base as FPSGD Task parallel library: MassiveThreads
Evaluation environment
![Page 62: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/62.jpg)
15/05/29 ParLearning15 62
We use three kinds of dataset: MovieLens10M, Netflix and Yahoo!Music
We use the same parameter value as FPSGDfor fair comparison
Evaluation dataset
![Page 63: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/63.jpg)
15/05/29 ParLearning15 63
We conducted three comparative experiments: scalability convergence speed
we use RMSE ( root mean square error )for a convergence metric
L2 fill/writeback counts The papi native eventL2_CACHE_FILL_WRITEBACK is used
Experiments
√ 1nr ∑(u , v )∈R
(ru ,v− ^ru , v)2
![Page 64: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/64.jpg)
15/05/29 ParLearning15 64
We implemented FPSGD++ to overcome the scalability problem of FPSGD
Setting multiple locks on both rows and columns instead of the single lock of FPSGD
FPSGD++
![Page 65: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/65.jpg)
15/05/29 ParLearning15 65
Pseudo code of FPSGD++
Workers can enter the same scope of the procedures
Getting blocks canbe done withoutsingle giant lock
![Page 66: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/66.jpg)
15/05/29 ParLearning15 66
dcMF scales with up to 64 cores FPSGD does not scale in high core counts especially with MovieLens10M and Netflix
FPSGD++ scales better than FPSGD
Scalability comparison
(a) MovieLens10M (b) Netflix (c) Yahoo!Music
Good
Bad
![Page 67: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/67.jpg)
15/05/29 ParLearning15 67
Lock waiting time of FPSGD gets longer with high core counts
This long locking time is the main cause of poor scalability
Scalability comparisonBad
Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music
![Page 68: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/68.jpg)
15/05/29 ParLearning15 68
Convergence speed comparison
dcMF converges to a certain RMSE faster than FPSGD
![Page 69: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/69.jpg)
15/05/29 ParLearning15 69
Measure counts with MovieLens10M of dcMf is less than that of FPSGD
But none of the measured fits the estimated
L2 fill/writeback countsBad
Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music
![Page 70: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/70.jpg)
15/05/29 ParLearning15 70
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
![Page 71: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/71.jpg)
15/05/29 ParLearning15 71
We proposed dcMF to parallelize matrix factorization using task parallel model We show that our dcMF surpasses FPSGD in terms of:
scalability convergence speed
We implemented dcMF at https://github.com/xxthermidorxx/cpmf
Conclusion
![Page 72: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/72.jpg)
15/05/29 ParLearning15 72
Need to find out why there is no significant difference in L2_CACHE_FILL_WRITEBACK in dcMF with Netflix Investigate the effect of the number of blocks
Theoretically, smaller blocks may contributethe performance of dcMF
Find a way to apply dcMF in a distributed environment
Future Work
![Page 73: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures](https://reader031.vdocuments.us/reader031/viewer/2022021801/58adfc941a28abf0628b5ef9/html5/thumbnails/73.jpg)
15/05/29 ParLearning15 73
Thank you for listening!