computer architecture for medical applications pipelining & s ingle i nstruction m ultiple d...
DESCRIPTION
Computer Architecture for Medical Applications Pipelining & S ingle I nstruction M ultiple D ata – two driving factors of single core performance. Gerhard Wellein , Department for Computer Science and Erlangen Regional Computing Center Dietmar Fey , Department for Computer Science . - PowerPoint PPT PresentationTRANSCRIPT
1CAMA 2013 - D. Fey and G. Wellein
Computer Architecture for Medical Applications
Pipelining & SingleInstructionMultipleData – two driving factors of single core performance
Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center
Dietmar Fey, Department for Computer Science
30. April 2013
2CAMA 2013 - D. Fey and G. Wellein
A different view on computer architecture
30. April 2013
3
From high level code to macro-/microcode execution
30. April 2013 CAMA 2013 - D. Fey and G. Wellein
sum=0.d0do i=1, N sum=sum + A(i)enddo…
A(i) (incl. LD) sum in register xmm1
i (loop counter)
NADD 1st argument to 2nd argument and store result in 2nd argument
Compiler
Exec
ution
ADD Execution unit
4CAMA 2013 - D. Fey and G. Wellein
How does high level code interact with execution units
Many hardware execution units: LOAD (STORE) operands from L1 cache
(register) to register (memory)
Floating Point (FP) MULTIPLY and ADD
Various Integer units Execution units may work in parallel “Superscalar” processor
Two important concepts at hardware level: Pipelining + SIMD
30. April 2013
sum=0.d0do i=1, N sum=sum + A(i)enddo…
5CAMA 2013 - D. Fey and G. Wellein
Microprocessors – Pipelining
30. April 2013
6CAMA 2013 - D. Fey and G. Wellein30. April 2013
Introduction: Moore’s law
1965: G. Moore claimed #transistors on processor chip doubles every 12-24 months
Intel Nehalem EX: 2.3 BillionnVIDIA FERMI: 3 Billion
7CAMA 2013 - D. Fey and G. Wellein
Frequency [MHz]
0,1
1
10
100
1000
10000
Year
30. April 2013
Introduction: Moore’s law faster cycles and beyond
• Moore’s law transistors are getting smaller run them faster
• Faster clock speed Reduce complexity of instruction execution Pipelining of instructions
Intel x86 clock speed
Increasing transistor count and clock speed allows / requires architectural changes:• Pipelining• Superscalarity• SIMD / Vector ops
• Multi-Core/Threading• Complex on chip caches
8CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining of arithmetic/functional units
• Idea:– Split complex instruction into several simple / fast steps (stages)– Each step takes the same amount of time, e.g. a single cycle– Execute different steps on different instructions at the same time (in parallel)
• Allows for shorter cycle times (simpler logic circuits), e.g.: – floating point multiplication takes 5 cycles, but – processor can work on 5 different multiplications simultaneously– one result at each cycle after the pipeline is full
• Drawback: – Pipeline must be filled - startup times (#Instructions >> pipeline steps)– Efficient use of pipelines requires large number of independent instructions
instruction level parallelism– Requires complex instruction scheduling by compiler/hardware – software-
pipelining / out-of-order
• Pipelining is widely used in modern computer architectures
9CAMA 2013 - D. Fey and G. Wellein
Interlude: Possible stage for Multiply
Real numbers can be represented as mantissa and exponent in a “normalized” representation, e.g.: s*0.m * 10e with
Sign s={-1,1} Mantissa m which does not contain 0 in leading digitExponent e some positive or negative integer
Multiply two real numbers r1*r2 = r3r1=s1*0.m1 * 10e1 , r2=s2*0.m2 * 10e2 :
s1*0.m1 * 10e1 * s2*0.m2 * 10e2
(s1*s2)* (0.m1*0.m2) * 10(e1+e2)
Normalize result: s3* 0.m3 * 10e3
30. April 2013
10CAMA 2013 - D. Fey and G. Wellein30. April 2013
5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N
1
B(1) C(1)
2
B(2) C(2)
B(1) C(1)
3
B(3) C(3)
B(2) C(2)
B(1) C(1)
4
B(4) C(4)
B(3) C(3)
B(2) C(2)
A(1)
5
B(5) C(5)
B(4) C(4)
B(3) C(3)
A(2)
A(1)
6
B(6) C(6)
B(5) C(5)
B(4) C(4)
B(3) C(3)
A(2)
N+4...
A(N)
...
...
...
...
...
Cycle:
SeparateMant. / Exp.
Mult.Mantissa
Add.Exponents
Normal.Result
Insert Sign
Stage
First result is available after 5 cycles (=latency of pipeline)!After that one instruction is completed in each cycle
11CAMA 2013 - D. Fey and G. Wellein30. April 2013
5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N
Wind-up/-down phases: Empty pipeline stages
12CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: Speed-Up and Throughput
• Assume a general m-stage pipe, i.e. pipeline depth is m. Speed-up piplined vs non-pipelined execution at same clock speed
Tseq / Tpipe = (m*N) / (N+m) ~ m for large N (>>m)
• Throughput of piplined execution (=Average results per Cycle) executing N instructions in pipeline with m stages:
N / Tpipe(N) = N / (N+m) = 1 / [ 1+m/N ]
• Throughput for large N: N / Tpipe(N) ~ 1
• Number of independent operations (NC) required to achive Tp results per cycle:
Tp= 1 / [ 1+m/NC ] NC = Tp m / (1- Tp)
Tp= 0.5 NC = m
13CAMA 2013 - D. Fey and G. Wellein30. April 2013
Throughput as function of pipeline stages
m = #pipeline stages
90% pipeline efficiency
14CAMA 2013 - D. Fey and G. Wellein30. April 2013
Software pipelining
• Example:
Simple Pseudo Code:loop: load a[i]
mult a[i] = c, a[i]store a[i]branch.loop
Fortran Code:do i=1,N a(i) = a(i) * cend do
load a[i] Load operand to register (4 cycles)mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registersstore a[i] Write back result from register to mem./cache (2 cycles)branch.loop Increase loopcounter as long i less equal N (0 cycles)
Latencies
Optimized Pseudo Code:loop: load a[i+6]
mult a[i+2] = c, a[i+2]store a[i]branch.loop
Assumption:
Instructions block execution if operands are not available
15CAMA 2013 - D. Fey and G. Wellein30. April 2013
Software pipelining
Naive instruction issueCycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19
load a[1]
mult a[1]=c,a[1]
store a[1]
load a[2]
mult a[2]=c,a[2]
store a[2]
load a[3]
load a[1] load a[2] load a[3]load a[4]load a[5] mult a[1]=c,a[1] load a[6] mult a[2]=c,a[2]load a[7] mult a[3]=c,a[3] store a[1] load a[8] mult a[4]=c,a[4] store a[2]load a[9] mult a[5]=c,a[5] store a[3] load a[10] mult a[6]=c,a[6] store a[4] load a[11] mult a[7]=c,a[7] store a[5]load a[12] mult a[8]=c,a[8] store a[6]
mult a[9]=c,a[9] store a[7]mult a[10]=c,a[10] store a[8]mult a[11]=c,a[11] store a[9]mult a[12]=c,a[12] store a[10]
store a[11]store a[12]
Optimized instruction issuea[i]=a[i]*c; N=12
T= 96 cycles T= 19 cycles
Prolog
Epilog
Kernel
16CAMA 2013 - D. Fey and G. Wellein30. April 2013
Efficient use of Pipelining
• Software pipelining can be done by the compiler, but efficient reordering of the instructions requires deep insight into application (data dependencies) and processor (latencies of functional units)
• Re-ordering of instructions can also be done at runtime by out-of-order (OOO) execution
• (Potential) dependencies within loop body may prevent efficient software pipelining or OOO execution, e.g.:
Dependency:
do i=2,N a(i) = a(i-1) * cend do
No dependency:
do i=1,N a(i) = a(i) * cend do
Pseudo-Dependency:
do i=1,N-1 a(i) = a(i+1) * cend do
17CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: Data dependencies
18CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: Data dependencies
19CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: Data dependencies
Naive instruction issueCycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19
load a[1]
mult a[2]=c,a[1]
store a[2]
load a[2]
mult a[3]=c,a[2]
store a[3]
load a[3]
load a[1]
mult a[2]=c,a[1]
mult a[3]=c,a[2] store a[2]
mult a[4]=c,a[3] store a[3] mult a[5]=c,a[4] store a[4]
mult a[6]=c,a[5] store a[5]
mult a[7]=c,a[6] store a[6]
mult a[8]=c,a[7] store a[7]
mult a[9]=c,a[8] store a[8]
Optimized instruction issuea[i]=a[i-1]*c; N=12
T= 96 cycles T= 26 cycles
Prolog
Kernel
Length of MULT pipeline determines throughput
20CAMA 2013 - D. Fey and G. Wellein
Fill pipeline with independent recursive streams..
Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMTMULT Pipeline depth: 5 stages 1 F / 5 cycles for recursive update
5 independent updates on a single core!
B(2)*s
A(2)*s
E(1)*s
D(1)*s
C(1)*s
Thread 0:do i=1,N A(i)=A(i-1)*s B(i)=B(i-1)*s C(i)=C(i-1)*s D(i)=D(i-1)*s E(i)=E(i-1)*senddo
MU
LT p
ipe
30. April 2013
21CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: Beyond multiplication
• Typical number of pipeline stages: 2-5 for the hardware pipelines on modern CPUs.
• x86 processors (AMD, Intel): 1 Mult & Add unit per processor core• No hardware for div / sqrt / exp / sin … expensive instructions
• “FP costs” in cycles per instruction for Intel Core2 architecture
• Other instructions are also pipelined, e.g. LOAD operand to register (4 cycles)
Operation y=a+y (y=a*y) y=a/y y=sqrt(y) y=sin(y)
Latency 3 (5) 32 29 >100
Throughput 1 (1) 31 28 >100
Cycles/Operation 0.5* 15.5* 14* >100
22CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: Potential problems (1)
• Hidden data dependencies:
– C/C++ allows “Pointer Aliasing” , i.e. A &C[-1] ; B &C[-2] C[i] = C[i-1] + C[i-2] Dependency!
– Compiler can not resolve potential pointer aliasing conflicts on its own!
– If no “Pointer Aliasing” is used, tell it to the compiler, e.g.
• use –fno-alias switch for Intel compiler• Pass arguments as (double *restrict A,…)
(only C99 standard)
void scale_shift(double *A, double *B, double *C, int n) {
for(int i=0; i<n; ++i) C[i] = A[i] + B[i];
}
23CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: Potential problems (2)
• Simple subroutine/function calls within a loop
Inline subroutines! (can be done by compiler….)
do i=1, N call elementprod(A(i),B(i), psum) C(i)=psumenddo…function elementprod( a, b, psum)…psum=a*b
do i=1, N psum=A(i)*B(i) C(i)=psumenddo…
24CAMA 2013 - D. Fey and G. Wellein
Pipelining: Potential problems (3a)
30. April 2013
Can we use pipelining or does this cost us 8*3 cycle (assuming 3 stage ADD pipeline)
25CAMA 2013 - D. Fey and G. Wellein
Pipelining: Potential problems (3b)
More general – “reduction operations”?
Benchmark: Run above assembly language kernel with N=32,64,128,…,4096 on processor with
3.5 GHz clock speed ClSp=3500 Mcycle/s 1 pipelined ADD unit (latency 3 cycles) 1 pipelined LOAD unit (latency 4 cycles)
30. April 2013
sum=0.d0do i=1, N sum=sum + A(i)enddo…
A(i) (incl. LD) sum in register xmm1
i (loop counter) NADD 1st argument to 2nd argument and store result in 2nd argument
1 cycle per iteration(after 7 iterations)
26CAMA 2013 - D. Fey and G. Wellein
Pipelining: Potential problems (4)
Expected Performance: Throughput * ClockSpeed
Throughput: N/T(N) = N/ (L+N)Assumption: L is total latency of one iteration and one result per cycle delivered after pipeline startup. Total runtime: L+N cycles
Total latency: L = 4 cycles + 3 cycles = 7 cycles
Performance for N Iterations:3500 MHz * (N / (L+N)) Iterations/cycle
Maximum performance ():3500 Mcycle/s * 1 Iteration/cycle=
3500 Miterations/s
30. April 2013
A(i-4)
A(i-5)
A(i-6)
A(i)
A(i-1)
A(i-2)
A(i-3)
LOAD
AD
D
27CAMA 2013 - D. Fey and G. Wellein
Pipelining: Potential problems (5)
30. April 2013
Why?
s = s+A(1)
A(2)
s = s+A(2)
A(3)
Dependency on sum next instruction needs to wait for completion of previous one only 1 out of 3 stages active 3 cycles per iteration
sum=0.d0do i=1, N sum=sum + A(i)enddo… Throughput here: N/T(N) = N/ (L+3*N)
28CAMA 2013 - D. Fey and G. Wellein
Pipelining: Potential problems (6)
Increase pipeline utilization by “loop unrolling”
30. April 2013
sum1=0.d0sum2=0.d0do i=1, N, 2 sum1=sum1+A(i) sum2=sum2+A(i+1)enddosum=sum1+sum2
“2-way Modulo Variable Expansion” (N is even)
2 out of 3 pipeline stages can be filled
2 results every 3 cycles
1.5 cycle/Iteration
29CAMA 2013 - D. Fey and G. Wellein
Pipelining: Potential problems (7)
4-way Modulo Variable Expansion (MVE) to get best performance (in principle 3-way should do as well)
Sum is split up in 4 independent partial sums
Compiler can do that, if he is allowed to do so…
Computer’s floating point arithmetic is not associative!
If you require binary exact result (-fp-model strict) compiler is not allowed to do this transformation
L=(7+3*3) cycles (prev. slide)
30. April 2013
Nr=4*(N/4)sum1=0.d0sum2=0.d0sum3=0.d0sum4=0.d0do i=1, Nr, 4 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2) sum4=sum4+A(i+3)enddodo i=Nr+1, N sum1=sum1+A(i)enddosum=sum1+sum2+sum3+sum4
“4-way MVE”
Rem
aind
er
loop
30CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: The Instruction pipeline
• Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps:
Fetch Instructionfrom L1I
Decode instruction
ExecuteInstruction
Hardware Pipelining on processor (all units can run concurrently):Fetch Instruction 1
from L1IDecode
Instruction 1Execute
Instruction 1
Fetch Instruction 2from L1I
Decode Instruction 2
Decode Instruction 3
ExecuteInstruction 2
Fetch Instruction 3from L1I
Fetch Instruction 4from L1I
t
… Branches can stall this pipeline! (Speculative Execution, Predication) Each Unit is pipelined itself (cf. Execute=Multiply Pipeline)
1
2
3
4
31CAMA 2013 - D. Fey and G. Wellein30. April 2013
Pipelining: The Instruction pipeline
• Problem: Unpredictable branches to other instructions
Fetch Instruction 1from L1I
Decode Instruction 1
ExecuteInstruction 1
Fetch Instruction 2from L1I
Decode Instruction 2
Decode Instruction 3
ExecuteInstruction 2
Fetch Instruction 3from L1I
t
…
1
2
3
4
Assume: Result determines next instruction!
32CAMA 2013 - D. Fey and G. Wellein
Microprocessors – Superscalar
30. April 2013
33CAMA 2013 - D. Fey and G. Wellein30. April 2013
Superscalar Processors
• Superscalar processors provide additional hardware (i.e. transistors) to execute multiple instructions per cycle!
• Parallel hardware components / pipelines are available to– fetch / decode / issues multiple instructions
per cycle(typically 3 – 6 per cycle)
– perform multiple integer / address calculations per cycle(e.g. 6 integer units on Itanium2)
– load (store) multiple operands (results) from (to) cacheper cycle (typically one load AND one store per cycle)
– perform multiple floating point instructions per cycle(typically 2 floating point instructions per cycle, e.g. 1 MULT + 1 ADD)
• On superscalar RISC processors out-of order (OOO) execution hardware is available to optimize the usage of the parallel hardware
34CAMA 2013 - D. Fey and G. Wellein30. April 2013
Multiple units enable use of Instrucion Level Parallelism (ILP):Instruction stream is “parallelized” on the fly
Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar &
can perform 2 or 4 floating point operations per cycles
Superscalar Processors – Instruction Level Parallelism
Fetch Instruction 4from L1I
Decode Instruction 1
ExecuteInstruction 1
Fetch Instruction 2from L1I
Decode Instruction 2
Decode Instruction 3
ExecuteInstruction 2
Fetch Instruction 3from L1I
Fetch Instruction 4from L1I
Fetch Instruction 3from L1I
Decode Instruction 1
ExecuteInstruction 1
Fetch Instruction 2from L1I
Decode Instruction 2
Decode Instruction 3
ExecuteInstruction 2
Fetch Instruction 3from L1I
Fetch Instruction 4from L1I
Fetch Instruction 2from L1I
Decode Instruction 1
ExecuteInstruction 1
Fetch Instruction 2from L1I
Decode Instruction 2
Decode Instruction 3
ExecuteInstruction 2
Fetch Instruction 3from L1I
Fetch Instruction 4from L1I
Fetch Instruction 1from L1I
Decode Instruction 1
ExecuteInstruction 1
Fetch Instruction 5from L1I
Decode Instruction 5
Decode Instruction 9
ExecuteInstruction 5
Fetch Instruction 9from L1I
Fetch Instruction 13from L1I
4-way „superscalar“
t
35CAMA 2013 - D. Fey and G. Wellein
Complex register management not show (R4 contains A(i-4)) 2-way superscalar: 1 LOAD instruction + 1 ADD instruction
completed per cycle Often cited metrics for superscalar processors:
Instructions Per Cycle: IPC=2 above Cycles Per Instruction: CPI=0.5 above
Superscalar Processors – ILP in action
30. April 2013
R11=R11+R4
R12=R12+R5
R13=R13+R6
A(i)R0
A(i-1)R1
A(i-2)R2
A(i-3)R3
LOAD
ADD
sum1=0.d0 ! reg. R11sum2=0.d0 ! reg. R12Sum3=0.d0 ! reg. R13do i=1, N, 3 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2)enddosum=sum1+sum2+sum3
“3-way Modulo Variable Expansion” (N is multiple of 3)
Reg
iste
rSe
t
3630. April 2013 CAMA 2013 - D. Fey and G. Wellein
Superscalar processor – Intel Nehalem design Decode & issue a max. of 4
instructions per cycle: IPC=4Min. CPI=0.25 Cycles/Instruction
Parallel units: FP ADD & FP MULT (work in parallel) LOAD + STORE (work in parallel)
Max. FP performance:1 ADD + 1 MULT instruction per cycle
Max. performance:A(i) = r0 + r1 * B(i)
½ of max. FP performance:A(i) = r1 * B(i)
1/3 of max. FP performance:A(i) = A(i) + B(i) * C(i)
37CAMA 2013 - D. Fey and G. Wellein
Microprocessors – Single Instruction Multiple Data (SIMD)-processing Basic Idea: Apply the same instruction to multiple operands in parallel
30. April 2013
38
SIMD-processing – Basics
Single Instruction Multiple Data (SIMD) instructions allow the concurrent execution of the same operation on “wide” registers.
x86_64 SIMD instruction sets: SSE: register width = 128 Bit 2 double (4 single) precision FP operands AVX: register width = 256 Bit 4 double (8 single) precision FP operands “Scalar” (non-SIMD) execution: 1 single/double operand, i.e. only lower
64 Bit (32 Bit) of registers are used.
Integer operands: SSE can be configured very flexible: 1 x 128 bit,…,16 x 8 bit AXV: No support for using the 256 bit register width for integer operations
SIMD-execution vector execution If compiler has vectorized loop SIMD instructions are used
30. April 2013 CAMA 2013 - D. Fey and G. Wellein
39CAMA 2013 - D. Fey and G. Wellein
SIMD-processing – Basics
Example: Adding two registers holding double precision floating point operands using 256 Bit register (AVX)
If 128 Bit SIMD instructions (SSE) are executed only half of the registers width is used
A[0]
A[1]
A[2]
A[3]
B[0]
B[1]
B[2]
B[3]
C[0]
C[1]
C[2]
C[3]
A[0]
B[0]
C[0]64 Bit
256 Bit
+ +
+
+
+
R0 R1 R2 R0 R1 R2
Scalar execution:R2 ADD [R0,R1]
SIMD execution:V64ADD [R0,R1] R2
30. April 2013
40
SIMD-processing – Basics
Steps (done by the compiler) for “SIMD-processing”
30. April 2013 CAMA 2013 - D. Fey and G. Wellein
for(int i=0; i<n;i++) C[i]=A[i]+B[i];
for(int i=0; i<n;i+=4){ C[i] =A[i] +B[i];
C[i+1]=A[i+1]+B[i+1];C[i+2]=A[i+2]+B[i+2];C[i+3]=A[i+3]+B[i+3];}
//remainder loop handling
LABEL1:VLOAD R0 A[i]VLOAD R1 B[i]V64ADD[R0,R1] R2VSTORE R2 C[i]ii+4i<(n-4)? JMP LABEL1
//remainder loop handling
“Loop unrolling”
“Pseudo-Assembler”
Load 256 Bits starting from address of A[i] to register R0
Add the corresponding 64 Bit entries in R0 and R1 and store the 4 results to R2
Store R2 (256 Bit) to address starting at C[i]
41
SIMD-processing – Basics
No SIMD-processing for loops with data dependencies
“Pointer aliasing” may prevent compiler from SIMD-processing
C/C++ allows that A &C[-1] and B &C[-2] C[i] = C[i-1] + C[i-2]: dependency No SIMD-processing
If no “Pointer aliasing” is used, tell it to the compiler, e.g. use –fno-alias switch for Intel compiler SIMD-processing
30. April 2013 CAMA 2013 - D. Fey and G. Wellein
for(int i=0; i<n;i++) A[i]=A[i-1]*s;
void scale_shift(double *A, double *B, double *C, int n) {for(int i=0; i<n; ++i) C[i] = A[i] + B[i];
}
42
s0=0.0;s1=0.0;S2=0.0;S3=0.0;for(int i=0; i<n;i+=4){ s0 = s0+ A[i] ; s1 = s1+ A[i+1]; s2 = s2+ A[i+2]; s3 = s3+ A[i+3];}//remainders=s0+s1+s2+s3
SIMD-processing – Basics
SIMD-processing of a vector sum
30. April 2013 CAMA 2013 - D. Fey and G. Wellein
double s=0.0;for(int i=0; i<n;i++)
s = s + A[i];
…V64ADD(R0,R1) R0…
R0 R1
Data dependency on s must be resolved for SIMD-processing(assume AVX)Compiler does transformation (Modulo Variable Expansion) –
if programmer allows it to do so!(e.g. use –O3 instead of –O1)
“Horizontal” ADD: Sum up the 4 64 Bit entries of R0
R0 (0.d0, 0.d0, 0.d0, 0.d0)
43CAMA 2013 - D. Fey and G. Wellein
R0=R0+A(1:4)R0=R0+A(5:8)
A(5:8)
SIMD-processing: What about pipelining?!
30. April 2013
A(9:12)
R0 (0.0,0.0,0.0, 0.0)do i=1, N, 4 VLOAD A(i:i+3) R1 V64ADD(R0,R1) R0
enddosum HorizontalADD(R0)
R0 (0.0,0.0,0.0,0.0)R1 (0.0,0.0,0.0,0.0)R2 (0.0,0.0,0.0,0.0)do i=1, N, 12 LOAD A(i:i+3) R3 LOAD A(i+4:i+7) R4 LOAD A(i+8:i+11) R5
V64ADD(R0,R3) R0 V64ADD(R1,R4) R1 V64ADD(R2,R5) R2
enddo…V64ADD(R0,R1) R0V64ADD(R0,R2) R0sum HorizontalADD(R0)
Need to do another MVE step to fill pipeline stages
“Vertical add”
“Horizontal add”
44CAMA 2013 - D. Fey and G. Wellein
SIMD-processing: What about pipelining?!
30. April 2013
Unrolling factor of vectorized code
1 AVX iteration performs 4 i-Iterations (successive)
Performance: 4x higher than “scalar” version
Start-up phase much longer…
Double Precision
45CAMA 2013 - D. Fey and G. Wellein
Compiler generated AVX code (loop body)
30. April 2013
Baseline version (“scalar”): No pipelining – no SIMD 3 cycles / Iteration
Compiler generated “AVX version” (-O3 –xAVX): SIMD processing: vaddpd %ymm8 4 dp operands(4-way unrolling)
Pipelining: 8-way MVE of SIMD code
0.25 cycles / Iteration32-way unrolling in total
SIMD processing – Vector sum (double precision) – 1 core
SIMD: Most impact if data is close to the core – other bottlenecks stay the same!
Peak
Location of “input data” (A[]) in memory hierarchy
Scalar: Code execution in core is slower than any data transfer
Plain: No SIMD but 4-way MVE
AVX/SIMD: Full benefit only if data is in L1 cache
47CAMA 2013 - D. Fey and G. Wellein30. April 2013
Data parallel SIMD processing
• Requires independent vector-like operations (“Data parallel”)
• Compiler is required to generate “vectorized” code Check compiler output
• Check for the use of “Packed SIMD” instructions at runtime (likwid) or in the assembly code
• Packed SIMD may require alignment constraint, e.g. 16-Byte alignment for efficient load/store on Intel Core2 architectures
• Check also for SIMD LOAD / STORE instructions
• Use of Packed SIMD instructions reduces the overall number of instructions (typical theoretical max. of 4 instructions / cycle)
SIMD code may improve performance but reduce CPI!
48CAMA 2013 - D. Fey and G. Wellein30. April 2013
Data parallel SIMD processing: Boosting performance
• Putting it all together: Modern x86_64 based Intel / AMD processor
– One FP MULTIPLY and one FP ADD pipeline can run in parallel and have a throughput of one FP instruction/cycle each (FMA units on AMD Interlagos)
Maximum 2 FP instructions/cycle
– Each pipeline operates on 128 (256) Bit registers for packed SSE (AVX) instructions 2 (4) double precision FP operations per SSE (AVX) instruction
4 (8) FP operations / cycle (1 MULT & 1 ADD on 2 (4) operands)
Peak performance of 3 GHz CPU (core): SSE: 12 GFlop/s or AVX: 24 GFlop/s (double precision)SSE: 24 GFlop/s or AVX: 48 GFlop/s (single precision)
BUT for “SCALAR” code: 6 GFlop/s (double and single precision)!
49
Maximum Floating Point (FP) Performance:
Pcore = F * S * n
F FP instructions per cycle: 2 (1 MULT and 1 ADD)
S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers – “AVX”)
n Clock speed : ∽2.5 GHz
P = 20 GF/s (dp) / 40 GF/s (sp)
There is no single driving force for single core performance!
Scalar (non-SIMD) execution
S = 1 FP op/instruction (dp / sp)
P = 5 GF/s (dp / sp)
30. April 2013 CAMA 2013 - D. Fey and G. Wellein
5050
SIMD registers: floating point (FP) data and beyond
Possible data types in an SSE register
AVX only applies to FP data (not to scale)
16x 8bit
8x 16bit
4x 32bit
2x 64bit
1x 128bit
inte
ger
4x 32 bit
2x 64 bit floati
ng
poin
t
8x 32 bit
4x 64 bit
51
Rules for vectorizable loops / SIMD processing
1. Countable2. Single entry and single exit3. Straight line code4. No function calls (exception intrinsic math functions)
Better performance with:5. Simple inner loops with unit stride6. Minimize indirect addressing7. Align data structures (SSE 16 bytes, AVX 32 bytes)8. In C use the restrict keyword for pointers to rule out aliasing
Obstacles for vectorization: Non-contiguous memory access Data dependencies
52
How to leverage vectorization / SIMD
The compiler does it for you (aliasing, alignment, language)
Source code directives (pragmas) to ease the compiler’s job
Alternative programming models for compute kernels (OpenCL, ispc)
Intrinsics (restricted to C/C++)
Implement directly in assembler
Com
plexity and efficiency increases
53
Vectorization and the Intel compiler
Intel compiler will try to use SIMD instructions when enabled to do so
“Poor man’s vector computing” Compiler can emit messages about vectorized loops (not by default):
plain.c(11): (col. 9) remark: LOOP WAS VECTORIZED.
Use option -vec_report3 to get full compiler output about which loops were vectorized and which were not and why (data dependencies!)
Some obstructions will prevent the compiler from applying vectorization even if it is possible (e.g. –fp-model strict for vector sum or pointer aliasing)
You can use source code directives to provide more information to the compiler
54
Vectorization compiler options
The compiler will vectorize starting with –O2.
To enable specific SIMD extensions use the –x option: -xSSE2 vectorize for SSE2 capable machinesAvailable SIMD extensions:SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX
-xAVX on Sandy Bridge processors
Recommended option: -xHost will optimize for the architecture you compile on
Compiling for AMD Opteron: use plain –O3 as the -x options may involve CPU type checks at runtime!
55
Vectorization source code directives (pragmas)
Fine-grained control of loop vectorization Use !DEC$ (Fortran) or #pragma (C/C++) sentinel to start a compiler
directive
#pragma vector alwaysvectorize even if it seems inefficient (hint!)
#pragma novectordo not vectorize even if possible
#pragma vector nontemporaluse NT stores when allowed (i.e. alignment conditions are met)
#pragma vector alignedspecifies that all array accesses are aligned to 16-byte boundaries (DANGEROUS! You must not lie about this!)
56
User mandated vectorization (pragmas)
Starting with Intel Compiler 12.0 the simd pragma is available #pragma simd enforces vectorization where the other pragmas fail Prerequesites:
Countable loop Innermost loop Must conform to for-loop style of OpenMP worksharing constructs
There are additional clauses: reduction, vectorlength, private Refer to the compiler manual for further details
NOTE: Using the #pragma simd the compiler may generate incorrect code if the loop violates the vectorization rules!
#pragma simd reduction(+:x) for (int i=0; i<n; i++) { x = x + A[i]; }
57
Basic approach to check the instruction code
Get the assembler code (Intel compiler):icc –S –O3 -xHost triad.c -o triad.s
Disassemble Executable: objdump –d ./cacheBench | less
Things to check for: Is the code vectorized? Search for pd/ps suffix. mulpd, addpd, vaddpd, vmulpd Is the data loaded with 16 byte moves? movapd, movaps, vmovupd For memory-bound code: Search for nontemporal stores: movntpd, movntps
The x86 ISA is documented in:Intel Software Development Manual (SDM) 2A and 2BAMD64 Architecture Programmer's Manual Vol. 1-5
58
Some basics of the x86-64 ISA
16 general Purpose Registers (64bit): rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15alias with eight 32 bit register set:eax, ebx, ecx, edx, esi, edi, esp, ebp
Floating Point SIMD Registers:xmm0-xmm15 SSE (128bit) alias with 256bit registersymm0-ymm15 AVX (256bit)
SIMD instructions are distinguished by:AVX (VEX) prefix: vOperation: mul, add, movModifier: non temporal (nt), unaligned (u), aligned (a), high(h)Data range: packed (p), scalar (s)Data type: single (s), double (d)
5959
Some basic single core optimizations – warnings first
“Premature optimization is the root of all evil.”Donald E. Knuth
“Parallel performance is easy, single node/core performance is difficult” Bill Gropp
60
Single core: Common sense optimizations (1)
Do less work! Reducing the work to be done is never a bad idea!
logical :: flagflag = .false.do i=1,N if(comlex_func(A(i)) < TRESHOLD) then ! Check if at flag=.true. ! Least one endif ! Is trueenddo
logical :: flagflag = .false.do i=1,N if(comlex_func(A(i)) < TRESHOLD) then ! Check if at flag=.true. ! Least one EXIT ! Is true and endif ! EXIT do-loopenddo
61
Single core: Common sense optimizations (2)
Avoid expensive operations! FP MULT & FP ADD are the fastest way to compute Avoid DIV / SQRT / SIN / COS / TAN ,… table lookup Avoid a one-to-one implementation of the algorithm, e.g.
A= A + B**2 A = A + B**2.0 (V1)(A,B float) A = A + B * B (V2)
(V1) is not a good idea: B**2.0 exp{2.0*ln(B)}1. Computing exp & ln is very expensive 2. B < 0 ?!
Most useful if data is close to CPU!
62
Single core: Common sense optimizations (3)
Shrink the working set!Working on small data sets reduces data transfer volume and increases the probability for cache hits
Analyze if appropriate data types are used, e.g.:4 different particle species have to be distinguished:
integer spec(1:N) spec(i) = {0,1,2,3} sizeof(spec(1:N)) = 4*N*byte
OR use 1-Byte integer datatype integer*1 spec(1:N) sizeof(spec(1:N)) = N*byte
OR use 2-Bit for each speciesinteger*1 spec(1:N/4) sizeof(spec(1:N)) = N/4*byte
Strongly depends on application!
63
Single core: Common sense optimizations (4a)
Elimination of common subexpressions!This often reduces MFLOP/s rate but improves runtime
In principle the compiler should do the job but do not rely on it! Associativity rules may prevent compiler to do so Compiler does not recognize (limited scope)
64
Single core: Common sense optimizations (4b)
Replace expensive functions by table lookup
30. April 2013 CAMA 2013 - D. Fey and G. Wellein
do iter = … … … do i=1,n … edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i) BF= 0.5d0*(1.d0+tanh(edelz)) … … enddo … …enddo
Entries: -1,0,1
edelz=-6,-5,-4,-3,-2,-1,1,0,1,2,3,4,5,6
Compute all 13 potential values of tanh(edelz) before and store it in a table with 13 entries: tanh_table(-6:6)
65
Single core: Common sense optimizations (4c)
Replace expensive functions by table lookup
30. April 2013 CAMA 2013 - D. Fey and G. Wellein
do i=-6,6 tanh_table(i) = tanh(i)enddodo iter = … … … do i=1,n … edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i) BF= 0.5d0*(1.d0+ tanh_table(edelz)) … … enddo … …enddo
66
Single core: Common sense optimizations (5)
Avoid branches!Support the compiler to understand and optimize your code!
Code change may enable vectorization, SIMD & other optimizations
BTW software pipelining is also much easier and no branch prediction is required