![Page 1: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/1.jpg)
Memory System Performance in a NUMA Multicore Multiprocessor
Zoltan Majo and Thomas R. Gross
Department of Computer ScienceETH Zurich
1
![Page 2: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/2.jpg)
Summary
• NUMA multicore systems are unfair to local memory accesses
• Local execution sometimes suboptimal
2
![Page 3: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/3.jpg)
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
3
![Page 4: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/4.jpg)
NUMA multicores: how it happened
3210
BusC
Northbridge
MC
DRAM memory
4
0 1 2 3 7654
BusC
4 5 6 7
BusC BusC BusC BusC BusC BusC
MC
First generation: SMP
![Page 5: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/5.jpg)
NUMA multicores: how it happened
3210
BusC
Northbridge
DRAM memory
5
7654
BusC
MC MCMC
DRAM memory
BusC BusC
Next generation: NUMA
IC IC
![Page 6: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/6.jpg)
NUMA multicores: how it happened
3210
DRAM memory
6
7654
MC MC
DRAM memory
0 1 2 3 4 5 6 7
IC IC
Next generation: NUMA
![Page 7: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/7.jpg)
NUMA multicores: how it happened
3210
DRAM memory
7
7654
MC MC
DRAM memory
0 1 2 3 4 5 6 7
IC IC
Next generation: NUMA
![Page 8: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/8.jpg)
3210
DRAM memory
7654
MC MC
DRAM memory
IC IC
Bandwidth sharing
• Frequent scenario:
bandwidth shared between cores
• Sharing model for the Intel Nehalem
8
0 1 2 3 4 5 6 7
![Page 9: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/9.jpg)
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
9
![Page 10: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/10.jpg)
Evaluation system
Intel Nehalem E5520
2 x 4 cores
8 MB level 3 cache
12 GB DDR3 RAM
5.86 GT/s QPI
10
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
QPI QPI
Global Queue Global Queue
Processor 0 Processor 1
![Page 11: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/11.jpg)
Bandwidth sharing: local accesses
11
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
0
DRAM memory
3
Global Queue
Processor 0 Processor 1
![Page 12: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/12.jpg)
Bandwidth sharing: remote accesses
12
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
4
DRAM memory
5
Global Queue
0 3
Processor 0 Processor 1
![Page 13: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/13.jpg)
Bandwidth sharing: combined accesses
13
3210
DRAM memory
7654
MC MC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
4
DRAM memory
5
Global Queue
0 3
Processor 0 Processor 1
Global Queue
![Page 14: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/14.jpg)
Global Queue
• Mechanism to arbitrate between different types of memory accesses
• We look at fairness of the Global Queue:
– local memory accesses
– remote memory accesses
– combined memory accesses
14
![Page 15: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/15.jpg)
Benchmark program
• STREAM triad
for (i=0; i<SIZE; i++)
{
a[i]=b[i]+SCALAR*c[i];
}
• Multiple co-executing triad clones
15
![Page 16: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/16.jpg)
Multi-clone experiments
• All memory allocated on Processor 0
• Local clones: Remote clones:
• Example benchmark configurations:
16
C C
C C
(2L, 0R)
C C C C C C C C
(0L, 3R) (2L, 3R)
Processor 0 Processor 1 Processor 0 Processor 1
![Page 17: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/17.jpg)
GQ fairness: local accesses
17
Total bandwidth [GB/s]
3210
DRAM
7654
IMC IMC
DRAM
QPI QPI
Cache
GQ
Cache
GQ
C
DRAM memory
C
Processor 0 Processor 1
CC
![Page 18: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/18.jpg)
GQ fairness: remote accesses
18
Total bandwidth [GB/s]
3210
DRAM
7654
IMC IMC
DRAM
QPI QPI
Cache
GQ
Cache
GQ
C
DRAM memory
C
Processor 0 Processor 1
CC
![Page 19: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/19.jpg)
Global Queue fairness
• Global Queue fair when there areonly local/remote accesses in the system
• What about combined accesses?
19
![Page 20: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/20.jpg)
GQ fairness: combined accesses
Execute clones in all possible configurations
20
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
(2L, 3R)
![Page 21: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/21.jpg)
GQ fairness: combined accesses
Execute clones in all possible configurations
21
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
![Page 22: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/22.jpg)
GQ fairness: combined accesses
22
Total bandwidth [GB/s]
![Page 23: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/23.jpg)
GQ fairness: combined accesses
Execute clones in all possible configurations
23
# local clones
0 1 2 3 4
# remote clones
0
1
2
3
4
![Page 24: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/24.jpg)
Combined accesses
24
Total bandwidth [GB/s]
![Page 25: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/25.jpg)
Combined accesses
• In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone
• Remote execution can be better than local
25
![Page 26: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/26.jpg)
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
26
![Page 27: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/27.jpg)
Bandwidth sharing model
27
remotelocaltotal bandwidthbandwidthbandwidth )1(
3210
DRAM memory
7654
IMC IMC
DRAM memory
QPI QPI
Level 3 cache
Global Queue
Level 3 cache
Global Queue
DRAM memory
C C
![Page 28: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/28.jpg)
Sharing factor ()
• Characterizes the fairness of the Global Queue
• Dependence of sharing factor on contention?
28
![Page 29: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/29.jpg)
Contention affects sharing factor
29
DRAM
Processor 0 Processor 0
C
CQPI
contenders
C
C
C
![Page 30: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/30.jpg)
Contention affects sharing factor
30
Sharing factor ()
![Page 31: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/31.jpg)
Combined accesses
31
Total bandwidth [GB/s]
![Page 32: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/32.jpg)
Contention affects sharing factor
• Sharing factor decreases with contention
• With local contention remote execution becomes more favorable
32
![Page 33: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/33.jpg)
Outline
• NUMA multicores: how it happened
• Experimental evaluation: Intel Nehalem
• Bandwidth sharing model
• The next generation: Intel Westmere
33
![Page 34: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/34.jpg)
The next generation
Intel Westmere X5680
2 x 6 cores
12 MB level 3 cache
144 GB DDR3 RAM
6.4 GT/s QPI
34
3210
DRAM memory
IMC
DRAM memory
QPI
Level 3 cache
Global Queue
BA98
IMCQPI
Level 3 cache
Global Queue
764 5
![Page 35: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/35.jpg)
The next generation
35
Total bandwidth [GB/s]
![Page 36: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/36.jpg)
Conclusions
• Optimizing for data locality can be suboptimal
• Applications:
– OS scheduling (see ISMM’11 paper)
– data placement and computation scheduling36
![Page 37: Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1](https://reader035.vdocuments.us/reader035/viewer/2022070402/56649f265503460f94c3db2f/html5/thumbnails/37.jpg)
Thank you! Questions?
37