thread cluster memory schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · conclusion 42...
TRANSCRIPT
![Page 1: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/1.jpg)
Thread Cluster Memory Scheduling:
Exploiting Differences in Memory Access Behavior
Yoongu KimMichael PapamichaelOnur MutluMor Harchol-Balter
![Page 2: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/2.jpg)
Motivation
• Memory is a shared resource
• Threads’ requests contend for memory
– Degradation in single thread performance
– Can even lead to starvation
• How to schedule memory requests to increase both system throughput and fairness?
2
Core Core
Core CoreMemory
![Page 3: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/3.jpg)
1
3
5
7
9
11
13
15
17
8 8.2 8.4 8.6 8.8 9
Max
imu
m S
low
do
wn
Weighted Speedup
FRFCFS
STFM
PAR-BS
ATLAS
Previous Scheduling Algorithms are Biased
3
System throughput bias
Fairness bias
No previous memory scheduling algorithm provides both the best fairness and system throughput
Better system throughput
Bet
ter
fair
ne
ss
![Page 4: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/4.jpg)
Take turns accessing memory
Why do Previous Algorithms Fail?
4
Fairness biased approach
thread C
thread B
thread A
less memory intensive
higherpriority
Prioritize less memory-intensive threads
Throughput biased approach
Good for throughput
starvation unfairness
thread C thread Bthread A
Does not starve
not prioritized reduced throughput
Single policy for all threads is insufficient
![Page 5: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/5.jpg)
thread
thread
thread
thread
Insight: Achieving Best of Both Worlds
5
thread
higherpriority
thread
thread
thread
Prioritize memory-non-intensive threads
For Throughput
Unfairness caused by memory-intensive being prioritized over each other
• Shuffle threads
Memory-intensive threads have different vulnerability to interference
• Shuffle asymmetrically
For Fairness
![Page 6: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/6.jpg)
OutlineMotivation & Insights
Overview
Algorithm
Bringing it All Together
Evaluation
Conclusion
6
![Page 7: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/7.jpg)
Overview: Thread Cluster Memory Scheduling
1. Group threads into two clusters2. Prioritize non-intensive cluster3. Different policies for each cluster
7
thread
Threads in the system
thread
thread
thread
thread
thread
thread
Non-intensive cluster
Intensive cluster
thread
thread
thread
Memory-non-intensive
Memory-intensive
Prioritized
higherpriority
higherpriority
Throughput
Fairness
![Page 8: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/8.jpg)
OutlineMotivation & Insights
Overview
Algorithm
Bringing it All Together
Evaluation
Conclusion
8
![Page 9: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/9.jpg)
TCM Outline
9
1. Clustering
![Page 10: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/10.jpg)
Clustering Threads
Step1 Sort threads by MPKI (misses per kiloinstruction)
10
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
thre
ad
higher MPKI
Tα < 10%
ClusterThreshold
Intensive clusterαT
Non-intensivecluster
T = Total memory bandwidth usage
Step2 Memory bandwidth usage αT divides clusters
![Page 11: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/11.jpg)
TCM Outline
11
1. Clustering
2. Between Clusters
![Page 12: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/12.jpg)
Prioritize non-intensive cluster
• Increases system throughput
– Non-intensive threads have greater potential for making progress
• Does not degrade fairness
– Non-intensive threads are “light”
– Rarely interfere with intensive threads
Prioritization Between Clusters
12
>priority
![Page 13: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/13.jpg)
TCM Outline
13
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
Throughput
![Page 14: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/14.jpg)
Prioritize threads according to MPKI
• Increases system throughput
– Least intensive thread has the greatest potential for making progress in the processor
Non-Intensive Cluster
14
thread
thread
thread
thread
higherpriority lowest MPKI
highest MPKI
![Page 15: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/15.jpg)
TCM Outline
15
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
4. Intensive Cluster
Throughput
Fairness
![Page 16: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/16.jpg)
Periodically shuffle the priority of threads
• Is treating all threads equally good enough?
• BUT: Equal turns ≠ Same slowdown
Intensive Cluster
16
Increases fairness
Most prioritizedhigherpriority
thread
thread
thread
![Page 17: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/17.jpg)
02468
101214
random-access streamingSl
ow
do
wn
Case Study: A Tale of Two ThreadsCase Study: Two intensive threads contending
1. random-access
2. streaming
17
Prioritize random-access Prioritize streaming
random-access thread is more easily slowed down
02468
101214
random-access streaming
Slo
wd
ow
n
7xprioritized
1x
11x
prioritized1x
Which is slowed down more easily?
![Page 18: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/18.jpg)
Why are Threads Different?
18
Bank 1 Bank 2 Bank 3 Bank 4 Memory
rows
![Page 19: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/19.jpg)
Why are Threads Different?
19
random-access
Bank 1 Bank 2 Bank 3 Bank 4 Memory
rows
•All requests parallel•High bank-level parallelism
activated row
reqreq
reqreq
![Page 20: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/20.jpg)
Why are Threads Different?
20
streaming
Bank 1 Bank 2 Bank 3 Bank 4 Memory
rows
req
req
reqreq
•All requests Same row•High row-buffer locality
random-access
•All requests parallel•High bank-level parallelism
activated row
![Page 21: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/21.jpg)
Why are Threads Different?
21
random-access streaming
Bank 1 Bank 2 Bank 3 Bank 4 Memory
rows
•All requests parallel•High bank-level parallelism
•All requests Same row•High row-buffer locality
stuck
Vulnerable to interference
reqreq
req
req
req
req
reqreq
![Page 22: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/22.jpg)
TCM Outline
22
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
4. Intensive Cluster
Fairness
Throughput
![Page 23: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/23.jpg)
Niceness
How to quantify difference between threads?
23
Vulnerability to interference
Bank-level parallelism
Causes interference
Row-buffer locality
+ Niceness -
NicenessHigh Low
![Page 24: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/24.jpg)
Shuffling: Round-Robin vs. Niceness-Aware
1. Round-Robin shuffling
2. Niceness-Aware shuffling
24
What can go wrong?
![Page 25: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/25.jpg)
Shuffling: Round-Robin vs. Niceness-Aware
1. Round-Robin shuffling
2. Niceness-Aware shuffling
25
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
What can go wrong?
A
B
C
D
D A B C D
A
B
D
C
B
C
A
D
C
D
B
A
D
A
C
B
GOOD: Each thread prioritized once
![Page 26: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/26.jpg)
Shuffling: Round-Robin vs. Niceness-Aware
1. Round-Robin shuffling
2. Niceness-Aware shuffling
26
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice thread
What can go wrong?
A
B
C
D
D A B C D
A
B
D
C
B
C
A
D
C
D
B
A
D
A
C
B
BAD: Nice threads receive lots of interference
GOOD: Each thread prioritized once
![Page 27: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/27.jpg)
Shuffling: Round-Robin vs. Niceness-Aware
1. Round-Robin shuffling
2. Niceness-Aware shuffling
27
![Page 28: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/28.jpg)
Shuffling: Round-Robin vs. Niceness-Aware
1. Round-Robin shuffling
2. Niceness-Aware shuffling
28
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice threadA
B
C
D
D C B A D
D
A
C
B
B
A
C
D
A
D
B
C
D
A
C
B
GOOD: Each thread prioritized once
![Page 29: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/29.jpg)
Shuffling: Round-Robin vs. Niceness-Aware
1. Round-Robin shuffling
2. Niceness-Aware shuffling
29
Most prioritized
ShuffleInterval
Priority
Time
Nice thread
Least nice threadA
B
C
D
D C B A D
D
A
C
B
B
A
C
D
A
D
B
C
D
A
C
B
GOOD: Each thread prioritized once
GOOD: Least nice thread stays mostly deprioritized
![Page 30: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/30.jpg)
TCM Outline
30
1. Clustering
2. Between Clusters
3. Non-Intensive Cluster
4. Intensive Cluster
Fairness
Throughput
![Page 31: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/31.jpg)
OutlineMotivation & Insights
Overview
Algorithm
Bringing it All Together
Evaluation
Conclusion
31
![Page 32: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/32.jpg)
Quantum-Based Operation
32
Time
Previous quantum (~1M cycles)
During quantum:• Monitor thread behavior
1. Memory intensity2. Bank-level parallelism3. Row-buffer locality
Beginning of quantum:• Perform clustering• Compute niceness of
intensive threads
Current quantum(~1M cycles)
Shuffle interval(~1K cycles)
![Page 33: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/33.jpg)
TCM Scheduling Algorithm
1. Highest-rank: Requests from higher ranked threads prioritized
• Non-Intensive cluster > Intensive cluster
• Non-Intensive cluster: lower intensity higher rank
• Intensive cluster: rank shuffling
2.Row-hit: Row-buffer hit requests are prioritized
3.Oldest: Older requests are prioritized
33
![Page 34: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/34.jpg)
Implementation Costs
Required storage at memory controller (24 cores)
• No computation is on the critical path
34
Thread memory behavior Storage
MPKI ~0.2kb
Bank-level parallelism ~0.6kb
Row-buffer locality ~2.9kb
Total < 4kbits
![Page 35: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/35.jpg)
OutlineMotivation & Insights
Overview
Algorithm
Bringing it All Together
Evaluation
Conclusion
35
Fairness
Throughput
![Page 36: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/36.jpg)
Metrics & Methodology
• MetricsSystem throughput
36
i
alone
i
shared
i
IPC
IPCSpeedupWeighted
shared
i
alone
iiIPC
IPCSlowdownMaximum max
Unfairness
• Methodology– Core model
• 4 GHz processor, 128-entry instruction window
• 512 KB/core L2 cache
– Memory model: DDR2
– 96 multiprogrammed SPEC CPU2006 workloads
![Page 37: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/37.jpg)
Previous Work
FRFCFS [Rixner et al., ISCA00]: Prioritizes row-buffer hits
– Thread-oblivious Low throughput & Low fairness
STFM [Mutlu et al., MICRO07]: Equalizes thread slowdowns
– Non-intensive threads not prioritized Low throughput
PAR-BS [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests while preserving bank-level parallelism
– Non-intensive threads not always prioritized Low throughput
ATLAS [Kim et al., HPCA10]: Prioritizes threads with less memory service
– Most intensive thread starves Low fairness
37
![Page 38: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/38.jpg)
Results: Fairness vs. Throughput
FRFCFS
STFM
PAR-BS
ATLAS
TCM
4
6
8
10
12
14
16
7.5 8 8.5 9 9.5 10
Max
imu
m S
low
do
wn
Weighted Speedup
38
Better system throughput
Bet
ter
fair
ne
ss
5%
39%
8%
5%
TCM provides best fairness and system throughput
Averaged over 96 workloads
![Page 39: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/39.jpg)
Results: Fairness-Throughput Tradeoff
FRFCFS
2
4
6
8
10
12
12 13 14 15 16
Max
imu
m S
low
do
wn
Weighted Speedup
39
When configuration parameter is varied…
Adjusting ClusterThreshold
TCM allows robust fairness-throughput tradeoff
STFM
PAR-BS
ATLAS
TCM
Better system throughput
Bet
ter
fair
ne
ss
![Page 40: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/40.jpg)
Operating System Support
• ClusterThreshold is a tunable knob
– OS can trade off between fairness and throughput
• Enforcing thread weights
– OS assigns weights to threads
– TCM enforces thread weights within each cluster
40
![Page 41: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/41.jpg)
OutlineMotivation & Insights
Overview
Algorithm
Bringing it All Together
Evaluation
Conclusion
41
Fairness
Throughput
![Page 42: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/42.jpg)
Conclusion
42
• No previous memory scheduling algorithm provides both high system throughput and fairness
– Problem: They use a single policy for all threads
• TCM groups threads into two clusters
1. Prioritize non-intensive cluster throughput
2. Shuffle priorities in intensive cluster fairness
3. Shuffling should favor nice threads fairness
• TCM provides the best system throughput and fairness
![Page 43: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/43.jpg)
THANK YOU
43
![Page 44: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/44.jpg)
Thread Cluster Memory Scheduling:
Exploiting Differences in Memory Access Behavior
Yoongu KimMichael PapamichaelOnur MutluMor Harchol-Balter
![Page 45: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/45.jpg)
Thread Weight Support
• Even if heaviest weighted thread happens to be the most intensive thread…
– Not prioritized over the least intensive thread
45
![Page 46: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/46.jpg)
Harmonic Speedup
46
Better system throughput
Bet
ter
fair
ne
ss
![Page 47: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/47.jpg)
Shuffling Algorithm Comparison
• Niceness-Aware shuffling
– Average of maximum slowdown is lower
– Variance of maximum slowdown is lower
47
Shuffling Algorithm
Round-Robin Niceness-Aware
E(Maximum Slowdown) 5.58 4.84
VAR(Maximum Slowdown) 1.61 0.85
![Page 48: Thread Cluster Memory Schedulingomutlu/pub/kim_micro10_talk.pdf · 2010. 12. 22. · Conclusion 42 •No previous memory scheduling algorithm provides both high system throughput](https://reader035.vdocuments.us/reader035/viewer/2022071417/611461d8991e8103e76babc4/html5/thumbnails/48.jpg)
Sensitivity Results
48
ShuffleInterval (cycles)
500 600 700 800
System Throughput 14.2 14.3 14.2 14.7
Maximum Slowdown 6.0 5.4 5.9 5.5
Number of Cores
4 8 16 24 32
System Throughput(compared to ATLAS)
0% 3% 2% 1% 1%
Maximum Slowdown(compared to ATLAS)
-4% -30% -29% -30% -41%