performance optimizations for numa-multicore systems zoltán majó department of computer science...
TRANSCRIPT
![Page 1: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/1.jpg)
Performance Optimizationsfor NUMA-Multicore Systems
Zoltán Majó
Department of Computer ScienceETH Zurich, Switzerland
![Page 2: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/2.jpg)
2
About me
ETH Zurich: research assistant Research: performance optimizations Assistant: lectures
TUCN Student Communications Center: network engineer Department of Computer Science: assistant
![Page 3: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/3.jpg)
3
Computing
Unlimited need for performance
![Page 4: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/4.jpg)
4
Performance optimizations
One goal: make programs run fast
Idea: pick good algorithm Reduce number of operations executed Example: sorting
![Page 5: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/5.jpg)
5
Sorting
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
n^2 n*log(n)
Input size (n)
Execution time [T]
Number of operations
![Page 6: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/6.jpg)
6
Sorting
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
Polynomial (n^2)
Column1
Input size (n)
Execution time [T]
Number of operationsbubble so
rt
![Page 7: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/7.jpg)
7
Sorting
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
Polynomial (n^2)
n*log(n)
Input size (n)
Execution time [T]
bubble sort
quicksort
Number of operations
![Page 8: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/8.jpg)
8
Sorting
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
Polynomial (n^2)
n*log(n)
Input size (n)
Execution time [T]
bubble sort
quicksort
Number of operations
11Xfaster
![Page 9: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/9.jpg)
9
Sorting
We picked good algorithm, work done
Are we really done?
Make sure our algorithm runs fast Operations take time We assumed 1 operation = 1 time unit T
![Page 10: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/10.jpg)
10
Quicksort performance
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
200
Input size (n)
Execution time [T]
1 op = 1 T
![Page 11: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/11.jpg)
11
Quicksort performance
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
200
Input size (n)
Execution time [T]
1 op = 1 T1 op = 2 T
![Page 12: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/12.jpg)
12
Quicksort performance
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
200
Input size (n)
Execution time [T]
1 op = 1 T1 op = 2 T
1 op = 4 T
![Page 13: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/13.jpg)
13
Quicksort performance
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
200
Input size (n)
Execution time [T]
1 op = 1 T1 op = 2 T
1 op = 4 T
1 op = 8 T
![Page 14: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/14.jpg)
14
Quicksort performance
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
120
140
160
180
200
Input size (n)
Execution time [T]
1 op = 1 T1 op = 2 T
1 op = 4 T
1 op = 8 T
bubble sort (
1 op = 1 T)
32%faster
![Page 15: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/15.jpg)
15
Latency of operations
Best algorithm not enough
Operations are executed on hardware
Stage 1:Dispatchoperation
Stage 2:Executeoperation
Stage 3:Retireoperation
CPU
![Page 16: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/16.jpg)
16
Latency of operations
Best algorithm not enough
Operations are executed on hardware
Hardware must be used efficiently
Stage 1:Dispatchoperation
Stage 2:Executeoperation
Stage 3:Retireoperation
CPU
![Page 17: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/17.jpg)
17
Outline
Introduction: performance optimizations
Cache-aware programming
Scheduling on multicore processors
Using run-time feedback
Data locality optimizations on NUMA-multicores
Conclusion
ETH scholarship
![Page 18: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/18.jpg)
18
RAM
Memory accessesCPU
230 cycles access latency
![Page 19: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/19.jpg)
19
RAM
230 cycles access latency
CPU
Memory accesses
Total access latency = ?Total access latency = 16 x 230 cycles = 3680 cycles
![Page 20: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/20.jpg)
20
RAM
230 cycles access latency
CPU
Caching
![Page 21: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/21.jpg)
21
Cache
RAM
CachingCPU
30 cycles access latency
200 cycles access latency
Block size:
![Page 22: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/22.jpg)
22
RAM
Cache
CachingCPU
30 cycles access latency
200 cycles access latency
Block size:
![Page 23: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/23.jpg)
23
Cache
RAM
CachingCPU
30 cycles access latency
200 cycles access latency
Block size:
![Page 24: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/24.jpg)
24
RAM
Cache
CachingCPU
30 cycles access latency
200 cycles access latency
Block size:
![Page 25: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/25.jpg)
25
Cache
RAM
CachingCPU
30 cycles access latency
200 cycles access latency
Block size:
![Page 26: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/26.jpg)
26
RAM
Cache
Hits and missesCPU
30 cycles access latency
200 cycles access latency
Cache miss: data not in cache = 230 cyclesCache hit: data in cache = 30 cycles
![Page 27: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/27.jpg)
27
RAM
Cache
Total access latencyCPU
30 cycles access latency
200 cycles access latency
Total access latency = ?Total access latency = 4 misses + 12 hits= 4 x 230 cycles + 12 * 30 cycles = 1280 cycles
![Page 28: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/28.jpg)
28
Benefits of caching
Comparison Architecture w/o cache: T = 230 cycles Architecture w/ cache: Tavg = 80 cycles → 2.7X
improvement
Do caches always help? Can you think of access pattern with bad cache usage?
![Page 29: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/29.jpg)
29
RAM
Cache
CachingCPU
35 cycles access latency
200 cycles access latency
Block size:
![Page 30: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/30.jpg)
30
Cache-aware programming
Today’s example: matrix-matrix multiplication (MMM)
Number of operations: n3
Compare naïve and optimized implementation Same number of operations
![Page 31: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/31.jpg)
31
C
MMM: naïve implementation
for (i=0; i<N; i++)for (j=0; j<N; j++) {
sum = 0.0;for (k=0; k < N; k+
+)sum += A[i]
[k]*B[k][j];C[i][j] = sum;
}
A B= Xj
i i
j
![Page 32: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/32.jpg)
32
RAM
Cache
MMMCPU
30 cycles access latency
200 cycles access latency
C A B
Cache hits Total accesses
A[][]B[][]
44
??
![Page 33: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/33.jpg)
33
RAM
A
Cache
MMMCPU
30 cycles access latency
200 cycles access latency
C B
Cache hits Total accesses
A[][]B[][]
44
??3
![Page 34: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/34.jpg)
34
RAM
A
Cache
MMMCPU
30 cycles access latency
200 cycles access latency
C B
Cache hits Total accesses
A[][]B[][]
44
??3
![Page 35: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/35.jpg)
35
RAM
A
Cache
MMMCPU
30 cycles access latency
200 cycles access latency
C B
Cache hits Total accesses
A[][]B[][]
44
??3
![Page 36: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/36.jpg)
36
RAM
BA
Cache
MMMCPU
30 cycles access latency
200 cycles access latency
C
Cache hits Total accesses
A[][]B[][]
44
??30
![Page 37: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/37.jpg)
37
MMM: Cache performance
Hit rate Accesses to A[][]: 3/4 = 75% Accesses to B[][]: 0/4 = 0% All accesses: 38%
Can we do better?
![Page 38: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/38.jpg)
38
Cache-friendly MMM
Cache-unfriendly MMM (ijk) Cache-friendly MMM (ikj)
for (i=0; i<N; i++)for (j=0; j<N; j++) {
sum = 0.0;for (k=0; k <
N; k++)sum +=
A[i][k]*B[k][j];C[i][j] += sum;
}
for (i=0; i<N; i++)for (k=0; k<N; k++) {
r = A[i][k];for (j=0; j <
N; j++)C[i][j]
+= r*B[k][j];}
C A B= Xk
iik
![Page 39: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/39.jpg)
39
RAM
BC A
Cache
MMMCPU
30 cycles access latency
200 cycles access latency
Cache hits Total accesses
C[][]B[][]
44
??33
![Page 40: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/40.jpg)
40
Cache-friendly MMM
Cache-unfriendly MMM (ijk)
A[][]: 3/4 = 75% hit rate
B[][]: 0/4 = 0% hit rate
All accesses: 38% hit rate
Cache-friendly MMM (ikj)
C[][]: 3/4 = 75% hit rate
B[][]: 3/4 = 75% hit rate
All accesses: 75% hit rate
Better performance due to cache-friendliness?
![Page 41: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/41.jpg)
41
512 1024 2048 4096 81920.01
0.1
1
10
100
1000
10000
ijk (cache-unfriendly) ikj (cache-friendly)
Matrix size
Performance of MMMExecution time [s]
![Page 42: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/42.jpg)
42
512 1024 2048 4096 81920.01
0.1
1
10
100
1000
10000
ijk (cache-unfriendly) ikj (cache-friendly)
Matrix size
Performance of MMMExecution time [s]
20X
![Page 43: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/43.jpg)
43
Cache-aware programming
Two versions of MMM: ijk and ikj Same number of operations (~n3) ikj 20X better than ijk
Good performance depends on two aspects Good algorithm Implementation that takes hardware into account
Hardware Many possibilities for inefficiencies We consider only the memory system in this lecture
![Page 44: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/44.jpg)
44
Outline
Introduction: performance optimizations
Cache-aware programming
Scheduling on multicore processors
Using run-time feedback
Data locality optimizations on NUMA-multicores
Conclusions
ETH scholarship
![Page 45: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/45.jpg)
45
CPU
Cache-based architecture
RAM
Bus Controller
L1-C
CacheL2 Cache
Memory Controller
10 cycles access latency
20 cycles access latency
200 cycles access latency
![Page 46: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/46.jpg)
46
Processor package
CPUCore
Multi-core multiprocessor
RAM
Bus Controller
L1-C
CacheL2 Cache
Memory Controller
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Processor package
CPUCore
Bus Controller
L1-C
CacheL2 Cache
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Memory Controller
![Page 47: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/47.jpg)
47
Experiment
Performance of a well-optimized program soplex from SPECCPU 2006
Multicore-multiprocessor systems are parallel Multiple programs run on the system simultaneously Contender program: milc from SPECCPU 2006
Examine 4 execution scenarios
soplex
milc
![Page 48: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/48.jpg)
48
Execution scenariosProcessor 0
L2 Cache
CPUCore
RAM
Bus Controller
L1-C
Memory Controller
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Processor 1
Core
Bus Controller
L1-C
CacheL2 Cache
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Memory Controller
soplex milc
![Page 49: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/49.jpg)
49
Execution scenariosProcessor 0
L2 Cache
CPUCore
RAM
Bus Controller
L1-C
Memory Controller
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Processor 1
Core
Bus Controller
L1-C
CacheL2 Cache
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Memory Controller
soplex milc
![Page 50: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/50.jpg)
50
Performance with sharing: soplex
0.00.40.81.21.62.0
Execution time relative to solo execution
![Page 51: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/51.jpg)
51
Performance with sharing: soplex
0.00.40.81.21.62.0
Execution time relative to solo execution
![Page 52: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/52.jpg)
52
Performance with sharing: soplex
0.00.40.81.21.62.0
Execution time relative to solo execution
![Page 53: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/53.jpg)
53
Resource sharing
Significant slowdowns due to resource sharing
Why is resource sharing so bad?Example: cache sharing
![Page 54: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/54.jpg)
54
RAM
L1 Cache
Cache sharingCore Coresoplex milc
![Page 55: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/55.jpg)
55
RAM
L1 Cache
Cache sharingCore Coresoplex milc
![Page 56: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/56.jpg)
56
Resource sharing
Does resource sharing affect all programs? So far: we considered at the performance of under contention Let us consider a different program:
soplex
namd
![Page 57: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/57.jpg)
57
Performance with sharing
0.00.40.81.21.62.0
soplexnamd
Execution time relative to solo execution
![Page 58: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/58.jpg)
58
Performance with sharing
0.00.40.81.21.62.0
soplexnamd
Execution time relative to solo execution
![Page 59: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/59.jpg)
59
Resource sharing
Significant slowdown for some programs affected significantly affected less
What do we do about it?
Scheduling can help Example workload:
soplex
namd
soplex soplexsoplexsoplex
namd namd namd namd
![Page 60: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/60.jpg)
60
Execution scenariosProcessor 0
L2 Cache
CPUCore
RAM
Bus Controller
L1-C
Memory Controller
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Processor 0
Core
Bus Controller
L1-C
CacheL2 Cache
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Memory Controller
soplex soplex soplex soplex namdnamd namd namd
![Page 61: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/61.jpg)
61
Execution scenariosProcessor 0
L2 Cache
CPUCore
RAM
Bus Controller
L1-C
Memory Controller
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Processor 0
Core
Bus Controller
L1-C
CacheL2 Cache
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Memory Controller
soplex soplex namd namdsoplex namdsoplex namd
![Page 62: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/62.jpg)
62
Challenges for a scheduler
Programs have different behaviors
Behavior not known ahead-of-time vs.
Behavior changes over time
soplex namd
![Page 63: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/63.jpg)
63
Single-phased program
![Page 64: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/64.jpg)
64
Program with multiple phases
![Page 65: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/65.jpg)
65
Outline
Introduction: performance optimizations
Cache-aware programming
Scheduling on multicore processors
Using run-time feedback
Data locality optimizations on NUMA-multicores
Conclusions
ETH scholarship
![Page 66: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/66.jpg)
66
Hardware performance counters
Special registers Programmable to monitor given hardware event (e.g., cache misses) Low-level information about hardware-software interaction Low overhead due to hardware implementation
In the past: undocumented feature
Since Intel Pentium: publicly available description Debugging tools: Intel VTune, Intel PTU, AMD CodeAnalyst
![Page 67: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/67.jpg)
67
Programming performance counters
Model-specific registers Access: RDMSR, WRMSR, and RDPMC instructions Ring 0 instructions (available only in kernel-mode)
perf_events interface Standard Linux interface since Linux 2.6.31 UNIX philosophy: performance counters are files
Simple API: Set up counters: perf_event_open() Read counters as files
![Page 68: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/68.jpg)
68
int main() {
int pid = fork();
if (pid == 0) {
exit(exec(“./my_program”, NULL));
} else {
int status; uint64_t value;
int fd = perf_event_open(...);
waitpid(pid, &status, NULL);
read(fd, &value, sizeof(uint64_t);
printf(”Cache misses: %”PRIu64”\n”, value);
}
}
Example: monitoring cache misses
![Page 69: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/69.jpg)
69
perf_event_open()
Looks simpleint sys_perf_event_open(
struct perf_event_attr *hw_event_uptr,
pid_t pid,
int cpu,
int group_fd,
unsigned long flags
);
struct perf_event_attr {__u32 type;__u32 size;__u64 config;union {
__u64 sample_period;__u64 sample_freq;
};__u64 sample_type;__u64 read_format;__u64 inherit;__u64 pinned;__u64 exclusive;__u64 exclude_user;__u64 exclude_kernel;__u64 exclude_hv;__u64 exclude_idle;__u64 mmap;
![Page 70: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/70.jpg)
70
libpfm
Open-source helper library
user programlibpfm perf_events
(1) event name
(2) set up perf_event_attr
(3) call perf_event_open()
(4) read results
![Page 71: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/71.jpg)
71
Example: measure cache misses for MMM
Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture
Look up event needed Source: Intel Architectures Software Developer's Manual
![Page 72: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/72.jpg)
72
Software Developer’s Manual
![Page 73: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/73.jpg)
73
Example: measure cache misses for MMM
Determine microarchitecture Intel Xeon E5520: Nehalem microarchitecture
Look up event needed Source: Intel Architectures Software Developer's Manual Event name: OFFCORE_RESPONSE_0:ANY_REQUEST:ANY_DRAM
![Page 74: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/74.jpg)
74
512 1024 2048 4096 819210000
100000
1000000
10000000
100000000
1000000000
10000000000
100000000000
1000000000000
ijk (cache-unfriendly) ikj (cache-friendly)
Matrix size
MMM cache misses# cache misses x 106
30X
![Page 75: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/75.jpg)
75
Single-phased program
set up performance counters read performance counters
![Page 76: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/76.jpg)
76
Program with multiple phases
set up performance counters
get sample
![Page 77: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/77.jpg)
77
Membus: multicore scheduler
1. Dynamically determine program behavior Measure # of loads/stores that cause memory traffic Hardware performance counters in sampling mode
2. Determine optimal placement based on measurements
![Page 78: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/78.jpg)
78
Evaluation
Workload with 8 processes lbm, soplex, gromacs, hmmer from SPEC CPU 2006 Two instances of each program
Experimental results
![Page 79: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/79.jpg)
79
Evaluation
lbm soplex gromacs hmmer Average0.0
0.5
1.0
1.5
2.0
2.5
3.0
Default LinuxMembus
Execution time relative to solo execution
![Page 80: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/80.jpg)
80
Evaluation
lbm soplex gromacs hmmer Average0.0
0.5
1.0
1.5
2.0
2.5
3.0
Default LinuxMembus
Execution time relative to solo execution
![Page 81: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/81.jpg)
81
Evaluation
lbm soplex gromacs hmmer Average0.0
0.5
1.0
1.5
2.0
2.5
3.0
Default LinuxMembus
Execution time relative to solo execution
16%
8%
![Page 82: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/82.jpg)
82
Summary: multicore processors
Resource sharing critical for performance
Membus: a scheduler that reduces resource sharing
Question: why wasn’t Membus able to improve more?
![Page 83: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/83.jpg)
83
Memory controller sharingProcessor 0
L2 Cache
CPUCore
RAM
Bus Controller
L1-C
Memory Controller
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Processor 1
Core
Bus Controller
L1-C
CacheL2 Cache
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Memory Controller
soplex soplex namd namdnamd soplexnamd soplex
![Page 84: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/84.jpg)
84
Memory Controller
Non-uniform memory architectureProcessor 0
L2 Cache
CPUCore
RAM
Bus Controller
L1-C
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Processor 1
Bus Controller
L1-C
CacheL2 Cache
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
RAM RAM
Core
![Page 85: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/85.jpg)
85
Non-uniform memory architectureProcessor 0
L2 Cache
CPUCore
Memory Ctrl
L1-C
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
Processor 1
Core
Memory Ctrl
L1-C
CacheL2 Cache
Core
L1-C
Core
L2 Cache
L1-C
Core
L1-C
RAM RAM
Interconnect Interconnect
![Page 86: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/86.jpg)
86
Outline
Introduction: performance optimizations
Cache-aware programming
Scheduling on multicore processors
Using run-time feedback
Data locality optimizations on NUMA-multicores
Conclusions
ETH scholarship
![Page 87: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/87.jpg)
87
Non-uniform memory architecture
Processor 1
Core 4 Core 5
Core 6 Core 7
IC MC
DRAM
Processor 0
Core 0 Core 1
Core 2 Core 3
MC IC
DRAM
![Page 88: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/88.jpg)
88
Non-uniform memory architecture
Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles
Processor 1
Core 4 Core 5
Core 6 Core 7
IC MC
DRAM
Processor 0
Core 0 Core 1
Core 2 Core 3
MC IC
DRAM
T
Data
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
![Page 89: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/89.jpg)
89
Non-uniform memory architecture
Local memory accessesbandwidth: 10.1 GB/slatency: 190 cycles
Remote memory accessesbandwidth: 6.3 GB/slatency: 310 cycles
Processor 1
Core 4 Core 5
Core 6 Core 7
IC MC
DRAM
Processor 0
Core 0 Core 1
Core 2 Core 3
MC IC
DRAM
T
Data
Key to good performance: data locality
All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
![Page 90: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/90.jpg)
90
Data locality in multithreaded programs
cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%
10%
20%
30%
40%
50%
60%
NAS Parallel Benchmarks
Remote memory references / total memory references [%]
![Page 91: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/91.jpg)
91
Data locality in multithreaded programs
cg. B lu.C ft.B ep.C bt.B sp.B is.B mg.C0%
10%
20%
30%
40%
50%
60%
NAS Parallel Benchmarks
Remote memory references / total memory references [%]
![Page 92: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/92.jpg)
92
First-touch page placement policy
Processor 1
DRAM
Processor 0
DRAM
T0 T1
P0
R/W
![Page 93: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/93.jpg)
93
First-touch page placement policy
Processor 1
DRAM
Processor 0
DRAM
T0 T1
P1
R/W
P0
![Page 94: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/94.jpg)
94
Automatic page placement
First-touch page placement Often high number of remote accesses
Data address profiling Profile-based page-placement Supported by hardware performance counters many architectures
![Page 95: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/95.jpg)
95
Profile-based page placementBased on the work of Marathe et al. [JPDC 2010, PPoPP 2006]
Processor 1
DRAM
Processor 0
DRAM
T0
Profile P0 : accessed 1000 times by
P1 : accessed3000 times by
T0T1
T1P1
P0
![Page 96: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/96.jpg)
96
Automatic page placement
Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping
![Page 97: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/97.jpg)
97
Profile-based page placement
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%Performance improvement over first-touch [%]
![Page 98: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/98.jpg)
98
Profile-based page placement
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%Performance improvement over first-touch [%]
![Page 99: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/99.jpg)
99
Profile-based page placement
Performance improvement over first-touch in some cases No performance improvement in many cases
Why?
![Page 100: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/100.jpg)
100
Inter-processor data sharing
Processor 1
DRAM
Processor 0
DRAM
T0
Profile P0 : accessed 1000 times by
P1 : accessed 3000 times by
T0T1
T1
P0 P1
P2 : accessed 4000 times by
accessed 5000 times by
T0
T1
P2
P2: inter-processor shared
![Page 101: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/101.jpg)
101
Inter-processor data sharing
Processor 1
DRAM
Processor 0
DRAM
T0
Profile P0 : accessed 1000 times by
P1 : accessed 3000 times by
T0T1
T1
P0 P1
P2 : accessed 4000 times by
accessed 5000 times by
T0
T1P2
P2: inter-processor shared
![Page 102: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/102.jpg)
102
Inter-processor data sharing
cg.B lu.C bt.B ft.B sp.B0%
10%
20%
30%
40%
50%
60%
Inter-processor shared heap relative to total heap
Shared heap / total heap [%]
![Page 103: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/103.jpg)
103
Inter-processor data sharing
cg.B lu.C bt.B ft.B sp.B0%
10%
20%
30%
40%
50%
60%
Inter-processor shared heap relative to total heap
Shared heap / total heap [%]
![Page 104: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/104.jpg)
104
Inter-processor data sharing
cg.B lu.C bt.B ft.B sp.B0%
10%
20%
30%
40%
50%
60%
0%
5%
10%
15%
20%
25%
30%
Inter-processor shared heap relative to total heapPerformance improvement over first-touch
Shared heap / total heap [%] Performance improvement [%]
![Page 105: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/105.jpg)
105
Inter-processor data sharing
cg.B lu.C bt.B ft.B sp.B0%
10%
20%
30%
40%
50%
60%
0%
5%
10%
15%
20%
25%
30%
Inter-processor shared heap relative to total heapPerformance improvement over first-touch
Shared heap / total heap [%] Performance improvement [%]
![Page 106: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/106.jpg)
106
Automatic page placement
Profile-based page placement often ineffective Reason: inter-processor data sharing
Inter-processor data sharing is a program property
We propose program transformations No time for details now, see results
![Page 107: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/107.jpg)
107
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%
Profile-based allocation Program transformations
EvaluationPerformance improvement over first-touch [%]
![Page 108: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/108.jpg)
108
cg.B lu.C bt.B ft.B sp.B0%
5%
10%
15%
20%
25%
Profile-based allocation Program transformations
EvaluationPerformance improvement over first-touch [%]
![Page 109: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/109.jpg)
109
Conclusions
Performance optimizations Good algorithm + hardware-awareness Example: cache-aware matrix multiplication
Hardware awareness Resource sharing in multicore processors Data placement in non-uniform memory architectures
A lot remains to be done...
...and you can be part of it!
![Page 110: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/110.jpg)
110
ETH scholarship for masters students...
...to work on their master thesisIn the Laboratory of Software Technology
Prof. Thomas R. GrossPhD. Stanford University, MIPS project, supervisor John L.
HennessyCarnegie Mellon: Warp, iWarp, Fx projects
ETH offers to you Monthly scholarship of CHF 1500– 1700 (EUR 1200–1400) Assistance with finding housing Thesis topic
![Page 111: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/111.jpg)
111
Possible Topics
Michael Pradel: Automatic bug finding
Luca Della Toffola: Performance optimizations for Java
Me: Hardware-aware performance optimizations
![Page 112: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/112.jpg)
A
B DC
E
Call Graph
A B C D E
A B C
Memory
… …
…
Cache
OO code positioning
![Page 113: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/113.jpg)
A
B DC
E
Call Graph
Profiling Hot Path
![Page 114: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/114.jpg)
A
B DC
E
Call Graph
A B C D E
A B C
Memory
… …
…
Cache
Miss
![Page 115: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/115.jpg)
A
B DC
E
Call Graph
A B CDE
A B E
Memory
… …
…
Cache
Hit
• JVM• No Profiling• Constructors
![Page 116: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/116.jpg)
• Linked list traversal• Looking for the youngest/oldest person
Person
next
name
surname
age
Person
next
name
surname
age
Person
next
name
surname
age
Person
next
name
surname
age
null
![Page 117: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/117.jpg)
Cache
next name agesurname
next name agesurname
next name agesurname
next name agesurname
next name agesurname next name agesurname
![Page 118: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/118.jpg)
Cache
next name agesurname
next name agesurname
next name agesurname
next name agesurname
next name agesurname next name agesurname
![Page 119: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/119.jpg)
Cache
next age next age next age next age
next age next age next age next age
next age next age next age next age
![Page 120: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/120.jpg)
Aa1a2
a3
a4
a5
Aa1: 10a2: 100
a3: 1000
a4: 30
a5: 2000
Aa3a5
Class$Cold
A$Colda1a2
a4
hot field
cold field
Profiling Splitting
# field accesses
• Jikes RVM• Splitting strategies• Garbage collection optimizations• Allocation optimizations
![Page 121: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/121.jpg)
121
If interested and motivated
Apply @ Prof. Rodica Potolea Until August 2012
Come to Zurich Start in February 2013 Work 4-6 months on the thesis
If you have questions Send e-mail to me [email protected] Talk to Prof. Rodica Potolea
![Page 122: Performance Optimizations for NUMA-Multicore Systems Zoltán Majó Department of Computer Science ETH Zurich, Switzerland](https://reader036.vdocuments.us/reader036/viewer/2022062408/56649e235503460f94b116f3/html5/thumbnails/122.jpg)
122
Thank you for your attention!