erlangen regional computing center performance engineering … · erlangen regional computing...
TRANSCRIPT
![Page 1: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/1.jpg)
ERLANGEN REGIONAL
COMPUTING CENTER
Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein
Erlangen Regional Computing Center & Dept. of Computer Science
Friedrich-Alexander-University Erlangen-Nuremberg, Germany
Performance Engineering for Stencil Updates
on Modern Processors
Partially funded by DFG Priority Programme1648
![Page 2: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/2.jpg)
2
• There are so many
• potential stencil structures (2D/3D, long-/short-range,…),
• text book optimizations (register / spatial / temporal blocking,…),
• parameters for optimization (blocking / unrolling factor,…),
• parameters for execution (OpenMP schedule, clock speed, #cores).
• Basic questions addressed by ECM model
• What is the bottleneck of my stencil implementation?
Choose appropriate optimization technique
• What is the next bottleneck I will hit and “how far” is it from the first?
This yields the performance potential of the optimization
• Impact of processor frequency and #cores used on performance
Motivation
Performance Engineering for Stencil Kernels
![Page 3: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/3.jpg)
3
Agenda
ECM model – basics
STREAM like kernel
Why does a single core not get full BW?
Case study 1: 5-pt 2D stencil
Case study 2: Long range 3D stencil
Case study 3: 3D stencil with DIVide
Summary
Performance Engineering for Stencil Kernels
![Page 4: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/4.jpg)
4
Execution-Cache-Memory (ECM) model
Assumptions
Single core execution time is determined by
In-core execution time &
Data delay in memory hierarchy
Socket scaling in socket is linear until
relevant shared bottleneck is hit
Insights
Single-core performance & socket scaling
Relevant bottlenecks & performance impact
Input
Similar as for roofline (e.g. STREAM)
Data transfer times of all memory levels
Registers
L1
L2 / L3
Arithmetic
Memory
”d
ata
de
lay”
”c
ore
ti
me”
4.3cy/CL (40GB/s @2.7GHz)
2cy/CL
2cy/CL
Performance Engineering for Stencil Kernels
![Page 5: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/5.jpg)
5
REPEAT[ A(:) = B(:) + C(:) * D(:)] @ double precision
Analysis: Sandy Bridge core w/ AVX (unit of work: 1 Cache Line)
Example: Schönauer Vector Triad in L2 cache
1 LD/cy + 0.5 ST/cy
Registers
L1
L2
32 B/cy (2 cy/CL)
Machine characteristics:
Arithmetic:
1 ADD/cy+ 1 MULT/cy
Triad analysis (per CL):
Registers
L1
L2
6 cy/CL
10 cy/CL
Arithmetic:
AVX: 2 cy/CL
Roofline prediction: 16/10 F/cy
LD LD LD WA ST
Measurement: ≈ 16 /17 F/cy
Performance Engineering for Stencil Kernels
Timeline:
LD
LD
ST/2
LD
ST/2
LD
LD
ST/2
LD
ST/2
ADD
MULT
ADD
MULT
16 Flops/CL
(AVX)
![Page 6: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/6.jpg)
6
Data
in Model
Measure
ment
L1 6 cy 6.04 cy
L2 16 cy 17.2 cy
L3 26 cy 26.3 cy
MEM 50 cy 52.3 cy
Example: ECM model for Schönauer Vector Triad A(:)=B(:)+C(:)*D(:) with AVX
LOAD/STORE CL
transfer
Write-allocate
CL transfer
Performance Engineering for Stencil Kernels
Full socket
stream BW
LD
6 c
y L2
-L1
1
0cy
L3
-L2
1
0cy
M
-L3
2
4cy
ST 4cy
Hard
ware
specific
ation
Time [cycles]
for single Cache Line Update
![Page 7: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/7.jpg)
7
Identify relevant bandwidth (shared) bottlenecks
L3 cache
Memory interface
Scale single-thread performance until first bottleneck is hit:
Multicore scaling in the ECM model
. . . Sandy Bridge: Scalable L3
Bottleneck (Proof): Memory
P(t)=min(t*P0,Proof), with Proof=min(Pmax,l *bS)
Performance Engineering for Stencil Kernels
![Page 8: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/8.jpg)
8
Saturation point (# cores) well predicted
Measurement: scaling not perfect
Single core performance accurately
predicted
ECM model works well for this
architecture and this benchmark!
ECM prediction vs. measurements for A(:)=B(:)+C(:)*D(:)
Performance Engineering for Stencil Kernels
Intel Sandy Bridge EP
Only „measurement“: full socket
STREAM BW
![Page 9: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/9.jpg)
9
Case study 1
ECM modelling for stencil structures:
2D Jacobi (double precision) with SSE2 on SNB
Performance Engineering for Stencil Kernels
![Page 10: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/10.jpg)
10
Case study 1: 2D Jacobi in DP with SSE2 on SNB
Instruction count
- 13 LOAD
- 4 STORE
- 12 ADD
- 4 MUL
4-way unrolling
8 LUP / iteration
Performance Engineering for Stencil Kernels
![Page 11: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/11.jpg)
11
Case study 1: 2D Jacobi in DP with SSE2 on SNB
Code characteristics
(SSE instructions per CL)
- 13 LOAD
- 4 STORE
- 12 ADD
- 4 MUL
Machine characteristics
(SSE instructions per cycle)
- 2 LOAD || (1 LOAD + 1 STORE)
- 1 ADD
- 1 MUL
- Clock speed: 3.5 GHz
LD LD LD LD 2
LD
2
LD
2
LD
2
LD
L
ST ST ST ST
+ + + + + + + + + + + +
* * * *
Core execution
12 cycles
Bottleneck
ADD
Port utilization in cycles
CP
U P
ort
s
Performance Engineering for Stencil Kernels
![Page 12: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/12.jpg)
12
Case study 1: 2D Jacobi in DP with SSE2 on SNB
Situation 1: Data set fits into L1 cache
ECM prediction:
(8 LUP / 12 cy) * 3.5 GHz = 2.3 GLUP/s
Measurement: 2.2 GLUP/s
Situation 2: Data set fits into L2 cache
3 transfer streams from L2 to L1
(LD T0 + LD/ST T1)
ECM prediction:
(8 LUP / (12+6) cy) * 3.5 GHz = 1.5 GLUP/s
Measurement: 1.9 GLUP/s What‘s
wrong??
12 cy
6 cy
Performance Engineering for Stencil Kernels
![Page 13: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/13.jpg)
13
Case study 1: 2D Jacobi in DP with SSE2 on SNB
LD LD LD LD 2LD 2LD 2LD 2LD L
ST ST ST ST
+ + + + + + + + + + + +
* * * * core execution: 12 cycles
ECM prediction:
(8 LUP / (8.5+6) cy) * 3.5 GHz =1.9 GLUP/s
Measurement: 1.9 GLUP/s
Register L1: 8.5 cycles
8.5 cy
6 cy
“If the model fails, we learn something”
Performance Engineering for Stencil Kernels
L1 L2: 6 cycles
L1 / L2 transfer
overlaps with arithmetic
No concurrent data transfer between memory levels
Data transfer may overlap with other in-core instructions
![Page 14: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/14.jpg)
14
Case study 2: Long range 3D stencil
ECM modelling for stencil structures: Long range 3D 25-pt stencil
Background:
Three-dimensional Seismic simulations in oil industry
Finite-Difference-Time-Domain (FDTD) discretization
Stencil operator is
Isotropic (symmetric),
With constant coefficients,
2nd order in time
8th order in space
Performance Engineering for Stencil Kernels
Innermost loop shown only
Collaboration with
D. Keyes & T. Malas
(KAUST)
![Page 15: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/15.jpg)
15
Case study 2: Long range 3D stencil
Data delay below L1 assuming
spatial blocking easy to calculate: 65 cy/CL
Data transfer L1 – register for V depends
on compiler
How to calculate / estimate core
execution time and actual load / stores?
Use Intel Architecture Code Analyzer (IACA))
Performance Engineering for Stencil Kernels
Core execution time
LD cycles: L1 Reg.
Core bottleneck
V?
![Page 16: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/16.jpg)
16
Case study 2: Long range 3D stencil
Core execution (Non-LD/ST cycles) Data delay
L1-R
EG (
Load
) 6
4cy
L2
-L1
2
4cy
L3
-L2
2
4cy
M
-L3
1
7cy
12
9c
y
FrontEnd stalls
overlap:
2*(34.1-32) =4.2cy
AD
D
52
cy
MU
LT
39
cy
Re
g-R
eg
tran
sfe
rs
48
cy
Stores 4cy
Single-core performance (ECM)
2.7GHz / (129cy / 16LUP) = 335 MLUP/s
Measurement: 320 MLUP/s
IACA throughput
34.1cy/8LUP
1 CL 16 LUP (sp)
Performance Engineering for Stencil Kernels
![Page 17: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/17.jpg)
17
Case study 3: 3D stencil with DIVide
ECM modelling for stencil structures: Impact of complex arithmetic
Background:
Three-dimensional Seismic simulations
Unelastic Wave Propagation Code
(http://hpgeoc.sdsc.edu/personnel.html)
Finite-Difference-Time-Domain (FDTD) discretization
Performance Engineering for Stencil Kernels
Collaboration with
O. Schenk
(USI Lugano)
DIV operation
![Page 18: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/18.jpg)
18
Case study 3: 3D stencil with DIV
Data delay
L1-R
EG
38
cy
L2-L
1
20
cy
L3-L
2
20
cy
MEM
-L3
2
6cy
Double Precision
Core execution
DIV
ide
r
84
cy
10
4cy
Single Precision
Core execution
FrontEnd
stalls
Single-core performance (SP)
ECM (104cy ): 415 MLUP/s
Measurement: 406 MLUP/s
Single-core performance (DP)
ECM (104cy ): 208 MLUP/s
Measurement: 179 MLUP/s
Compiler
avoids
DIV instruction
DIV becomes
bottleneck for
temporal
blocking!
Performance Engineering for Stencil Kernels
![Page 19: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/19.jpg)
19
Conclusions
ECM allows good estimate/prediction of single core performance
and socket scalability for “streaming kernels”
ECM reveals computational/hardware bottlenecks and quantifies
their impact on performance
ECM allows to quantify impact of clock speed or vectorization
ECM could be integrated into autotuning frameworks or DSLs
Work is funded by
Performance Engineering for Stencil Kernels
DFG Priority Programme1648 Bavarian Network for HPC
![Page 20: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/20.jpg)
Performance Engineering for Stencil Kernels
High Performance Computing
is Computing at a
Bottleneck!*
*Georg Hager
![Page 21: ERLANGEN REGIONAL COMPUTING CENTER Performance Engineering … · ERLANGEN REGIONAL COMPUTING CENTER Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein Erlangen Regional](https://reader033.vdocuments.us/reader033/viewer/2022042310/5ed78f2067b53e06555d2158/html5/thumbnails/21.jpg)
21
References – ECM model basics & applications
1. G. Hager, J. Treibig, J. Habich, and G. Wellein: Exploring performance and power
properties of modern multicore chips via simple machine models. (accepted)
Concurrency and Computation: Practice and Experience,
DOI: 10.1002/cpe.3180 (2013). Preprint: arXiv:1208.2908
2. M. Wittmann, G. Hager, T. Zeiser, J. Treibig, and G. Wellein: Chip-level and multi-
node analysis of energy-optimized lattice-Boltzmann CFD simulations. Submitted.
Preprint: arXiv:1304.7664
3. J. Treibig, G. Hager, and G. Wellein: Performance patterns and hardware metrics
on modern multicore processors: Best practices for performance engineering.
Proc. 5th Workshop on Productivity and Performance (PROPER 2012) at Euro-Par
2012. Euro-Par 2012: Parallel Processing Workshops, LNCS 7640, 451-460
(2013), Springer, ISBN 978-3-642-36948-3. DOI: 10.1007/978-3-642-36949-0_50.
Preprint: arXiv:1206.3738
Performance Engineering for Stencil Kernels