![Page 1: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/1.jpg)
1
Memory Management inNUMA Multicore Systems:Trapped between Cache Contention and Interconnect Overhead
Zoltan Majo and Thomas R. Gross
Department of Computer ScienceETH Zurich
![Page 2: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/2.jpg)
2
NUMA multicores
DRAM memory
32
MC
Cache
10
MC
DRAM memory
Cache
IC ICMC
DRAM memory DRAM memory
MCIC IC
Processor 0 Processor 1
![Page 3: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/3.jpg)
3
10
MC
DRAM memory
Cache
DRAM memory
32
MC
Cache
NUMA multicores
Two problems:
• NUMA:interconnect overhead
BA
MA MB
IC IC
Processor 0 Processor 1
![Page 4: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/4.jpg)
4
DRAM memory
32
MC
Cache
10
MC
DRAM memory
Cache
NUMA multicores
BA
MA MB
Cache
Two problems:
• NUMA:interconnect overhead
• multicore:cache contention
IC IC
Processor 0 Processor 1
![Page 5: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/5.jpg)
5
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
![Page 6: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/6.jpg)
6
Multi-clone experiments
• Intel Xeon E5520
• 4 clones of soplex (SPEC CPU2006)
– local clone
– remote clone
DRAM memory
MC
Cache
0
MC
DRAM memory
Cache
IC IC
1 32 4 6 75
• Memory behavior of unrelated programs
M M M M
C C C C
C C C C
C
C
Processor 0 Processor 1
![Page 7: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/7.jpg)
1 2
3
4 57
Cache
C
DRAM
Cache
C C C
Local bandwidth: 100%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 80%
M MMM
Cache
C
DRAM
Cache
C CC
Local bandwidth: 57%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 32%
M MMM
Cache
C
DRAM
Cache
C C C
Local bandwidth: 0%
M MMM
![Page 8: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/8.jpg)
8
Performance of schedules
• Which is the best schedule?
• Baseline: single-program execution mode
Cache
C
Cache
M
![Page 9: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/9.jpg)
9
0% 25% 50% 75% 100%1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Local memory bandwidth
Execution time
local clones
remote clones
average
Slowdown relative to baseline
C
C
C
![Page 10: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/10.jpg)
10
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
![Page 11: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/11.jpg)
11
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
![Page 12: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/12.jpg)
12
Step 1: Maximum-local mapping
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
B MB
A MA
C MC
D MD
Processor 0 Processor 1
![Page 13: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/13.jpg)
13
Default OS scheduling
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75BA D
MBMA MC MD
C
Processor 0 Processor 1
![Page 14: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/14.jpg)
14
Two steps:
– Step 1: maximum-local mapping
– Step 2: cache-aware refinement
N-MASS(NUMA-Multicore-Aware Scheduling Scheme)
![Page 15: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/15.jpg)
15
Step 2: Cache-aware refinement
In an SMP:
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BAD C
Processor 0 Processor 1
![Page 16: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/16.jpg)
16
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BA DC
MA
In an SMP:Processor 0 Processor 1
![Page 17: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/17.jpg)
17
Step 2: Cache-aware refinement
A B C
D
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MB MC MD
BA D C
MA
A B CD
Performance degradation
In an SMP:
NUMA penalty
Processor 0 Processor 1
![Page 18: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/18.jpg)
18
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
BA C DIn a NUMA:Processor 0 Processor 1
![Page 19: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/19.jpg)
19
Step 2: Cache-aware refinement
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MBMA MC MD
A DCBIn a NUMA:Processor 0 Processor 1
![Page 20: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/20.jpg)
20
Step 2: Cache-aware refinement
A B C
D
Performance degradation
DRAM
Cache
0
DRAM
Cache
1 32 4 6 75
MB MC MDMA
BA DC
A
B
C D
NUMA allowance
In a NUMA:
NUMA penalty
Processor 0 Processor 1
![Page 21: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/21.jpg)
21
Performance factors
Two factors cause performance degradation:
1. NUMA penaltyslowdown due toremote memory access
2. cache pressure local processes:
misses / KINST (MPKI) remote processes:
MPKI x NUMA penalty 1 4 7 10 13 16 19 22 25 28
1.0
1.1
1.2
1.3
1.4
1.5
SPEC programs
NUMA penalty
![Page 22: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/22.jpg)
22
Implementation
• User-mode extension to the Linux scheduler
• Performance metrics– hardware performance counter feedback– NUMA penalty• perfect information from program traces• estimate based on MPKI
• All memory for a process allocated on one
processor
![Page 23: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/23.jpg)
23
Outline
• NUMA: experimental evaluation
• Scheduling
– N-MASS
– N-MASS evaluation
![Page 24: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/24.jpg)
24
0.00
001
0.00
01
0.00
1
0.01 0.
1 1 10 100
0.9
1.0
1.1
1.2
1.3
1.4
1.5
not used programsused programs
MPKI
Workloads
• SPEC CPU2006 subset
• 11 multi-program workloads (WL1 WL11)
4-program workloads(WL1 WL9)
8-program workloads(WL10, WL11)
NUMA penalty
CPU-bound Memory-bound
![Page 25: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/25.jpg)
25
Memory allocation setup
• Where the memory of each process is allocated influences performance
• Controlled setup: memory allocation maps
![Page 26: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/26.jpg)
26
Memory allocation maps
B MB
A C MC
D MD
MA
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
![Page 27: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/27.jpg)
27
Memory allocation maps
BA C D
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0011
MA
MB
MC
MD
![Page 28: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/28.jpg)
28
Memory allocation maps
BA C D
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0000
MA
MB
MC
MD
DRAM
Processor 1
Cache
Processor 0
DRAM
Cache
Allocation map: 0011
MA
MB
MC
MD
Unbalanced Balanced
![Page 29: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/29.jpg)
29
Evaluation
• Baseline: Linux average
– Linux scheduler non-deterministic
– average performance degradation in all possible
cases
• N-MASS with perfect NUMA penalty
information
![Page 30: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/30.jpg)
30
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux average
Allocation maps
WL9: Linux averageAverage slowdown relative to single-program mode
![Page 31: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/31.jpg)
31
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux averageN-MASS
Allocation maps
WL9: N-MASSAverage slowdown relative to single-program mode
![Page 32: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/32.jpg)
32
0000 1000 0100 0010 0001 1100 1010 10011.0
1.1
1.2
1.3
1.4
1.5
1.6
Linux averageN-MASS
Allocation maps
WL1: Linux average and N-MASSAverage slowdown relative to single-program mode
![Page 33: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/33.jpg)
33
N-MASS performance
• N-MASS reduces performance degradation by up to 22%
• Which factor more important: interconnect overhead or cache contention?
• Compare:
- maximum-local- N-MASS (maximum-local
+ cache refinement step)
![Page 34: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/34.jpg)
34
Data-locality vs. cache balancing (WL9)
0000 1000 0100 0010 0001 1100 1010 1001-10%
-5%
0%
5%
10%
15%
20%
25%
maximum-local
N-MASS (maxi-mum-local + cache refinement step)
Allocation maps
Performance improvement relative to Linux average
![Page 35: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/35.jpg)
35
Data-locality vs. cache balancing (WL1)
0000 1000 0100 0010 0001 1100 1010 1001-10%
-5%
0%
5%
10%
15%
20%
25%
maximum-local
N-MASS (maxi-mum-local + cache refinement step)
Allocation maps
Performance improvement relative to Linux average
![Page 36: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/36.jpg)
36
Data locality vs. cache balancing
• Data-locality more important than cache balancing
• Cache-balancing gives performance benefits mostly with unbalanced allocation maps
• What if information about NUMA penalty not available?
![Page 37: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/37.jpg)
37
0 10 20 30 40 501.0
1.1
1.2
1.3
1.4
1.5
MPKI
Estimating NUMA penalty
• NUMA penalty is not directly measurable
• Estimate: fit linear regression onto MPKI data
NUMA penalty
![Page 38: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/38.jpg)
38
Estimate-based N-MASS: performance
Performance improvement relative to Linux average
WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 WL10 WL11-2%
0%
2%
4%
6%
8%
maximum-local N-MASS Estimate-based N-MASS
![Page 39: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/39.jpg)
39
Conclusions
• N-MASS: NUMAmulticore-aware scheduler
• Data locality optimizations more beneficial than cache contention avoidance
• Better performance metrics needed for scheduling
![Page 40: Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer](https://reader034.vdocuments.us/reader034/viewer/2022051400/551c288c550346a34f8b5f29/html5/thumbnails/40.jpg)
40
Thank you! Questions?