Download - Stall-Time Fair Memory Access Scheduling
![Page 1: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/1.jpg)
Stall-Time Fair Memory Access
Scheduling
Onur Mutlu and Thomas MoscibrodaComputer Architecture Group
Microsoft Research
![Page 2: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/2.jpg)
2
Multi-Core Systems
CORE 0 CORE 1 CORE 2 CORE 3
L2 CACHE
L2 CACHE
L2 CACHE
L2 CACHE
DRAM MEMORY CONTROLLER
DRAM Bank 0
DRAM Bank 1
DRAM Bank 2
DRAM Bank 7
. . .
Shared DRAMMemory System
Multi-CoreChip
unfairness
![Page 3: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/3.jpg)
3
DRAM Bank Operation
Row Buffer
Access Address (Row 0, Column 0)
Row
dec
oder
Column decoder
Row address 0
Column address 0
Data
Row 0Empty
Access Address (Row 0, Column 1)
Column address 1
Access Address (Row 0, Column 9)
Column address 9
Access Address (Row 1, Column 0)
HITHIT
Row address 1
Row 1
Column address 0
CONFLICT !
Columns
Row
s
![Page 4: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/4.jpg)
4
DRAM Controllers A row-conflict memory access takes significantly longer
than a row-hit access
Current controllers take advantage of the row buffer Commonly used scheduling policy (FR-FCFS) [Rixner, ISCA’00]
(1) Row-hit (column) first: Service row-hit memory accesses first(2) Oldest-first: Then service older accesses first
This scheduling policy aims to maximize DRAM throughput But, it is unfair when multiple threads share the DRAM system
![Page 5: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/5.jpg)
5
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
![Page 6: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/6.jpg)
6
The Problem
Multiple threads share the DRAM controller DRAM controllers are designed to maximize DRAM
throughput
DRAM scheduling policies are thread-unaware and unfair Row-hit first: unfairly prioritizes threads with high row
buffer locality Streaming threads Threads that keep on accessing the same row
Oldest-first: unfairly prioritizes memory-intensive threads
![Page 7: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/7.jpg)
7
The Problem
Row BufferR
ow d
ecod
er
Column decoder
Data
Row 0
T0: Row 0
Row 0
T1: Row 16
T0: Row 0T1: Row 111
T0: Row 0T0: Row 0T1: Row 5
T0: Row 0T0: Row 0T0: Row 0T0: Row 0T0: Row 0
Request Buffer
T0: streaming threadT1: non-streaming thread
Row size: 8KB, cache block size: 64B128 requests of T0 serviced before T1
![Page 8: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/8.jpg)
8
DRAM is the only shared resource
Consequences of Unfairness in DRAM
Vulnerability to denial of service [Moscibroda & Mutlu, Usenix Security’07]
System throughput loss Priority inversion at the system/OS level Poor performance predictability
1.051.85
4.72
7.74
![Page 9: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/9.jpg)
9
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
![Page 10: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/10.jpg)
10
Fairness in Shared DRAM Systems A thread’s DRAM performance dependent on its inherent
Row-buffer locality Bank parallelism
Interference between threads can destroy either or both A fair DRAM scheduler should take into account all
factors affecting each thread’s DRAM performance Not solely bandwidth or solely request latency
Observation: A thread’s performance degradation due to interference in DRAM mainly characterized by the extra memory-related stall-time due to contention with other threads
![Page 11: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/11.jpg)
11
Stall-Time Fairness in Shared DRAM Systems A DRAM system is fair if it slows down equal-priority threads equally
Compared to when each thread is run alone on the same system Fairness notion similar to SMT [Cazorla, IEEE Micro’04][Luo, ISPASS’01],
SoEMT [Gabor, Micro’06], and shared caches [Kim, PACT’04]
Tshared: DRAM-related stall-time when the thread is running with other threads
Talone: DRAM-related stall-time when the thread is running alone Memory-slowdown = Tshared/Talone
The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize Memory-slowdown for all threads, without sacrificing performance Considers inherent DRAM performance of each thread
![Page 12: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/12.jpg)
12
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
![Page 13: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/13.jpg)
13
STFM Scheduling Algorithm (1) During each time interval, for each thread, DRAM
controller Tracks Tshared Estimates Talone
At the beginning of a scheduling cycle, DRAM controller Computes Slowdown = Tshared/Talone for each thread with an
outstanding legal request Computes unfairness = MAX Slowdown / MIN Slowdown
If unfairness < Use DRAM throughput oriented baseline scheduling policy
(1) row-hit first (2) oldest-first
![Page 14: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/14.jpg)
14
STFM Scheduling Algorithm (2)
If unfairness ≥ Use fairness-oriented scheduling policy
(1) requests from thread with MAX Slowdown first (2) row-hit first (3) oldest-first
Maximizes DRAM throughput if it cannot improve fairness
Does NOT waste useful bandwidth to improve fairness If a request does not interfere with any other, it is
scheduled
![Page 15: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/15.jpg)
15
How Does STFM Prevent Unfairness?
Row Buffer
Data
Row 0
T0: Row 0
Row 0
T1: Row 16
T0: Row 0
T1: Row 111
T0: Row 0T0: Row 0
T1: Row 5
T0: Row 0T0: Row 0
T0: Row 0
T0 Slowdown
T1 Slowdown 1.00
1.00
1.00Unfairness
1.03
1.03
1.06
1.06
1.05
1.03
1.061.031.041.08
1.04
1.041.11
1.06
1.07
1.04
1.101.14
1.03
Row 16Row 111
![Page 16: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/16.jpg)
16
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
![Page 17: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/17.jpg)
17
Implementation Tracking Tshared
Relatively easy The processor increases a counter if the thread cannot
commit instructions because the oldest instruction requires DRAM access
Estimating Talone More involved because thread is not running alone Difficult to estimate directly Observation:
Talone = Tshared - Tinterference Estimate Tinterference: Extra stall-time due to
interference
![Page 18: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/18.jpg)
18
Estimating Tinterference(1) When a DRAM request from thread C is scheduled
Thread C can incur extra stall time: The request’s row buffer hit status might be affected by
interference Estimate the row that would have been in the row buffer if
the thread were running alone Estimate the extra bank access latency the request incurs
Extra Bank Access LatencyTinterference(C) +=
# Banks Servicing C’s Requests
Extra latency amortized across outstanding accesses of thread C (memory level parallelism)
![Page 19: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/19.jpg)
19
Estimating Tinterference(2) When a DRAM request from thread C is scheduled
Any other thread C’ with outstanding requests incurs extra stall time
Interference in the DRAM data bus
Interference in the DRAM bank (see paper)
Bus Transfer Latency of Scheduled RequestTinterference(C’) +=
Bank Access Latency of Scheduled RequestTinterference(C’) +=
# Banks Needed by C’ Requests * K
![Page 20: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/20.jpg)
20
Hardware Cost
<2KB storage cost for 8-core system with 128-entry memory request buffer
Arithmetic operations approximated Fixed point arithmetic Divisions using lookup tables
Not on the critical path Scheduler makes a decision only every DRAM cycle
More details in paper
![Page 21: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/21.jpg)
21
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
![Page 22: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/22.jpg)
22
Support for System Software Supporting system-level thread weights/priorities
Thread weights communicated to the memory controller Larger-weight threads should be slowed down less
Each thread’s slowdown is scaled by its weight Weighted slowdown used for scheduling
Favors threads with larger weights OS can choose thread weights to satisfy QoS requirements
: Maximum tolerable unfairness set by system software Don’t need fairness? Set large. Need strict fairness? Set close to 1. Other values of : trade-off fairness and throughput
![Page 23: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/23.jpg)
23
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
![Page 24: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/24.jpg)
24
Evaluation Methodology 2-, 4-, 8-, 16-core systems
x86 processor model based on Intel Pentium M 4 GHz processor, 128-entry instruction window 512 Kbyte per core private L2 caches
Detailed DRAM model based on Micron DDR2-800 128-entry memory request buffer 8 banks, 2Kbyte row buffer Row-hit round-trip latency: 35ns (140 cycles) Row-conflict latency: 70ns (280 cycles)
Benchmarks SPEC CPU2006 and some Windows Desktop applications 256, 32, 3 benchmark combinations for 4-, 8-, 16-core
experiments
![Page 25: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/25.jpg)
25
Comparison with Related Work Baseline FR-FCFS [Rixner et al., ISCA’00]
Unfairly penalizes non-intensive threads with low-row-buffer locality FCFS
Low DRAM throughput Unfairly penalizes non-intensive threads
FR-FCFS+Cap Static cap on how many younger row-hits can bypass older accesses Unfairly penalizes non-intensive threads
Network Fair Queueing (NFQ) [Nesbit et al., Micro’06] Per-thread virtual-time based scheduling
A thread’s private virtual-time increases when its request is scheduled Prioritizes requests from thread with the earliest virtual-time Equalizes bandwidth across equal-priority threads Does not consider inherent performance of each thread
Unfairly prioritizes threads with non-bursty access patterns (idleness problem)
Unfairly penalizes threads with unbalanced bank usage (in paper)
![Page 26: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/26.jpg)
26
Idleness/Burstiness Problem in Fair Queueing
Thread 1’s virtual time increases even though no other thread needs DRAMOnly Thread 2 serviced in interval [t1,t2] since its virtual time is smaller than Thread 1’sOnly Thread 3 serviced in interval [t2,t3] since its virtual time is smaller than Thread 1’sOnly Thread 4 serviced in interval [t3,t4] since its virtual time is smaller than Thread 1’s
Non-bursty thread suffers large performance loss even though it fairly utilized DRAM when no other thread needed it
Serviced
Serviced
Serviced
Serviced
![Page 27: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/27.jpg)
27
Unfairness on 4-, 8-, 16-core Systems
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
4-core 8-core 16-core
Unf
airn
ess
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown
1.27X 1.81X1.26X
![Page 28: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/28.jpg)
28
System Performance
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
4-core 8-core 16-core
Nor
mal
ized
Wei
ghte
d Sp
eedu
p
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
5.8% 4.1% 4.6%
![Page 29: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/29.jpg)
29
Hmean-speedup (Throughput-Fairness Balance)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
4-core 8-core 16-core
Nor
mal
ized
Hm
ean
Spee
dup
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
10.8% 9.5% 11.2%
![Page 30: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/30.jpg)
30
Outline
The Problem Unfair DRAM Scheduling
Stall-Time Fair Memory Scheduling Fairness definition Algorithm Implementation System software support
Experimental Evaluation Conclusions
![Page 31: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/31.jpg)
31
Conclusions A new definition of DRAM fairness: stall-time fairness
Equal-priority threads should experience equal memory-related slowdowns
Takes into account inherent memory performance of threads
New DRAM scheduling algorithm enforces this definition Flexible and configurable fairness substrate Supports system-level thread priorities/weights QoS policies
Results across a wide range of workloads and systems show: Improving DRAM fairness also improves system throughput STFM provides better fairness and system performance than
previously-proposed DRAM schedulers
![Page 32: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/32.jpg)
Thank you. Questions?
![Page 33: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/33.jpg)
Stall-Time Fair Memory Access
Scheduling
Onur Mutlu and Thomas MoscibrodaComputer Architecture Group
Microsoft Research
![Page 34: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/34.jpg)
Backup
![Page 35: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/35.jpg)
35
Structure of the STFM Controller
![Page 36: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/36.jpg)
36
Comparison using NFQ QoS Metrics Nesbit et al. [MICRO’06] proposed the following
target for quality of service: A thread that is allocated 1/Nth of the memory system
bandwidth will run no slower than the same thread on a private memory system running at 1/Nth of the frequency of the shared physical memory system
Baseline with memory bandwidth scaled down by N
We compared different DRAM schedulers’ effectiveness using this metric Number of violations of the above QoS target Harmonic mean of IPC normalized to the above baseline
![Page 37: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/37.jpg)
37
Violations of the NFQ QoS Target
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
55%
60%
4-core 8-core 16-core
% W
orkl
oads
whe
re Q
oS O
bjec
tive
NO
T Sa
tisfie
d
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
![Page 38: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/38.jpg)
38
Hmean Normalized IPC using NFQ Baseline
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
4-core 8-core 16-core
Hm
ean
of N
orm
aliz
ed IP
C (u
sing
Nes
bit's
bas
elin
e)
FR-FCFSFCFSFR-FCFS+CapNFQSTFM
10.3% 9.1% 7.8%7.3% 5.9% 5.1%
![Page 39: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/39.jpg)
39
Shortcomings of the NFQ QoS Target Low baseline (easily achievable target) for equal-priority
threads N equal-priority threads a thread should do better than on a
system with 1/Nth of the memory bandwidth This target is usually very easy to achieve
Especially when N is large
Unachievable target in some cases Consider two threads always accessing the same bank in an
interleaved fashion too much interference
Baseline performance very difficult to determine in a real system Cannot scale memory frequency arbitrarily Not knowing baseline performance makes it difficult to set
thread priorities (how much bandwidth to assign to each thread)
![Page 40: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/40.jpg)
40
A Case Study
0
1
2
3
4
5
6
7
8
FR-FCFS FCFS FR-FCFS+Cap NFQ STFM
Nor
mal
ized
Mem
ory
Stal
l Tim
e
mcflibquantumGemsFDTDastar
Unfairness: 7.28 2.07 2.08 1.87 1.27
Mem
ory
Slow
dow
n
![Page 41: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/41.jpg)
41
Windows Desktop Workloads
![Page 42: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/42.jpg)
42
Enforcing Thread Weights
![Page 43: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/43.jpg)
43
Effect of
![Page 44: Stall-Time Fair Memory Access Scheduling](https://reader036.vdocuments.us/reader036/viewer/2022062301/56815d46550346895dcb4da9/html5/thumbnails/44.jpg)
44
Effect of Banks and Row Buffer Size