computer architecture for medical applications pipelining & s ingle i nstruction m ultiple d...

1CAMA 2013 - D. Fey and G. Wellein

Computer Architecture for Medical Applications

Pipelining & SingleInstructionMultipleData – two driving factors of single core performance

Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center

Dietmar Fey, Department for Computer Science

30. April 2013


A different view on computer architecture

30. April 2013

3

From high level code to macro-/microcode execution

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

sum=0.d0do i=1, N sum=sum + A(i)enddo…

A(i) (incl. LD) sum in register xmm1

i (loop counter)

NADD 1st argument to 2nd argument and store result in 2nd argument

Compiler

Exec

ution

ADD Execution unit


How does high level code interact with execution units

Many hardware execution units: LOAD (STORE) operands from L1 cache

(register) to register (memory)

Floating Point (FP) MULTIPLY and ADD

Various Integer units Execution units may work in parallel “Superscalar” processor

Two important concepts at hardware level: Pipelining + SIMD

30. April 2013



Microprocessors – Pipelining

30. April 2013

6CAMA 2013 - D. Fey and G. Wellein30. April 2013

Introduction: Moore’s law

1965: G. Moore claimed #transistors on processor chip doubles every 12-24 months

Intel Nehalem EX: 2.3 BillionnVIDIA FERMI: 3 Billion


Frequency [MHz]

0,1

1

10

100

1000

10000

Year

30. April 2013

Introduction: Moore’s law faster cycles and beyond

• Moore’s law transistors are getting smaller run them faster

• Faster clock speed Reduce complexity of instruction execution Pipelining of instructions

Intel x86 clock speed

Increasing transistor count and clock speed allows / requires architectural changes:• Pipelining• Superscalarity• SIMD / Vector ops

• Multi-Core/Threading• Complex on chip caches


Pipelining of arithmetic/functional units

• Idea:– Split complex instruction into several simple / fast steps (stages)– Each step takes the same amount of time, e.g. a single cycle– Execute different steps on different instructions at the same time (in parallel)

• Allows for shorter cycle times (simpler logic circuits), e.g.: – floating point multiplication takes 5 cycles, but – processor can work on 5 different multiplications simultaneously– one result at each cycle after the pipeline is full

• Drawback: – Pipeline must be filled - startup times (#Instructions >> pipeline steps)– Efficient use of pipelines requires large number of independent instructions

instruction level parallelism– Requires complex instruction scheduling by compiler/hardware – software-

pipelining / out-of-order

• Pipelining is widely used in modern computer architectures


Interlude: Possible stage for Multiply

Real numbers can be represented as mantissa and exponent in a “normalized” representation, e.g.: s*0.m * 10e with

Sign s={-1,1} Mantissa m which does not contain 0 in leading digitExponent e some positive or negative integer

Multiply two real numbers r1*r2 = r3r1=s1*0.m1 * 10e1 , r2=s2*0.m2 * 10e2 :

s1*0.m1 * 10e1 * s2*0.m2 * 10e2

(s1*s2)* (0.m1*0.m2) * 10(e1+e2)

Normalize result: s3* 0.m3 * 10e3

30. April 2013


5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N

1

B(1) C(1)

2

B(2) C(2)

B(1) C(1)

3

B(3) C(3)

B(2) C(2)

B(1) C(1)

4

B(4) C(4)

B(3) C(3)

B(2) C(2)

A(1)

5

B(5) C(5)

B(4) C(4)

B(3) C(3)

A(2)

A(1)

6

B(6) C(6)

B(5) C(5)

B(4) C(4)

B(3) C(3)

A(2)

N+4...

A(N)

...

...

...

...

...

Cycle:

SeparateMant. / Exp.

Mult.Mantissa

Add.Exponents

Normal.Result

Insert Sign

Stage

First result is available after 5 cycles (=latency of pipeline)!After that one instruction is completed in each cycle


5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N

Wind-up/-down phases: Empty pipeline stages


Pipelining: Speed-Up and Throughput

• Assume a general m-stage pipe, i.e. pipeline depth is m. Speed-up piplined vs non-pipelined execution at same clock speed

Tseq / Tpipe = (m*N) / (N+m) ~ m for large N (>>m)

• Throughput of piplined execution (=Average results per Cycle) executing N instructions in pipeline with m stages:

N / Tpipe(N) = N / (N+m) = 1 / [ 1+m/N ]

• Throughput for large N: N / Tpipe(N) ~ 1

• Number of independent operations (NC) required to achive Tp results per cycle:

Tp= 1 / [ 1+m/NC ] NC = Tp m / (1- Tp)

Tp= 0.5 NC = m


Throughput as function of pipeline stages

m = #pipeline stages

90% pipeline efficiency


Software pipelining

• Example:

Simple Pseudo Code:loop: load a[i]

mult a[i] = c, a[i]store a[i]branch.loop

Fortran Code:do i=1,N a(i) = a(i) * cend do

load a[i] Load operand to register (4 cycles)mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registersstore a[i] Write back result from register to mem./cache (2 cycles)branch.loop Increase loopcounter as long i less equal N (0 cycles)

Latencies

Optimized Pseudo Code:loop: load a[i+6]

mult a[i+2] = c, a[i+2]store a[i]branch.loop

Assumption:

Instructions block execution if operands are not available


Software pipelining

Naive instruction issueCycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19

load a[1]

mult a[1]=c,a[1]

store a[1]

load a[2]

mult a[2]=c,a[2]

store a[2]

load a[3]

load a[1] load a[2] load a[3]load a[4]load a[5] mult a[1]=c,a[1] load a[6] mult a[2]=c,a[2]load a[7] mult a[3]=c,a[3] store a[1] load a[8] mult a[4]=c,a[4] store a[2]load a[9] mult a[5]=c,a[5] store a[3] load a[10] mult a[6]=c,a[6] store a[4] load a[11] mult a[7]=c,a[7] store a[5]load a[12] mult a[8]=c,a[8] store a[6]

mult a[9]=c,a[9] store a[7]mult a[10]=c,a[10] store a[8]mult a[11]=c,a[11] store a[9]mult a[12]=c,a[12] store a[10]

store a[11]store a[12]

Optimized instruction issuea[i]=a[i]*c; N=12

T= 96 cycles T= 19 cycles

Prolog

Epilog

Kernel


Efficient use of Pipelining

• Software pipelining can be done by the compiler, but efficient reordering of the instructions requires deep insight into application (data dependencies) and processor (latencies of functional units)

• Re-ordering of instructions can also be done at runtime by out-of-order (OOO) execution

• (Potential) dependencies within loop body may prevent efficient software pipelining or OOO execution, e.g.:

Dependency:

do i=2,N a(i) = a(i-1) * cend do

No dependency:

do i=1,N a(i) = a(i) * cend do

Pseudo-Dependency:

do i=1,N-1 a(i) = a(i+1) * cend do


Pipelining: Data dependencies



Naive instruction issueCycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19

load a[1]

mult a[2]=c,a[1]

store a[2]

load a[2]

mult a[3]=c,a[2]

store a[3]

load a[3]

load a[1]

mult a[2]=c,a[1]

mult a[3]=c,a[2] store a[2]

mult a[4]=c,a[3] store a[3] mult a[5]=c,a[4] store a[4]





Optimized instruction issuea[i]=a[i-1]*c; N=12

T= 96 cycles T= 26 cycles

Prolog

Kernel

Length of MULT pipeline determines throughput


Fill pipeline with independent recursive streams..

Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMTMULT Pipeline depth: 5 stages 1 F / 5 cycles for recursive update

5 independent updates on a single core!

B(2)*s

A(2)*s

E(1)*s

D(1)*s

C(1)*s

Thread 0:do i=1,N A(i)=A(i-1)*s B(i)=B(i-1)*s C(i)=C(i-1)*s D(i)=D(i-1)*s E(i)=E(i-1)*senddo

MU

LT p

ipe

30. April 2013


Pipelining: Beyond multiplication

• Typical number of pipeline stages: 2-5 for the hardware pipelines on modern CPUs.

• x86 processors (AMD, Intel): 1 Mult & Add unit per processor core• No hardware for div / sqrt / exp / sin … expensive instructions

• “FP costs” in cycles per instruction for Intel Core2 architecture

• Other instructions are also pipelined, e.g. LOAD operand to register (4 cycles)

Operation y=a+y (y=a*y) y=a/y y=sqrt(y) y=sin(y)

Latency 3 (5) 32 29 >100

Throughput 1 (1) 31 28 >100

Cycles/Operation 0.5* 15.5* 14* >100


Pipelining: Potential problems (1)

• Hidden data dependencies:

– C/C++ allows “Pointer Aliasing” , i.e. A &C[-1] ; B &C[-2] C[i] = C[i-1] + C[i-2] Dependency!

– Compiler can not resolve potential pointer aliasing conflicts on its own!

– If no “Pointer Aliasing” is used, tell it to the compiler, e.g.

• use –fno-alias switch for Intel compiler• Pass arguments as (double *restrict A,…)

(only C99 standard)

void scale_shift(double *A, double *B, double *C, int n) {

for(int i=0; i<n; ++i) C[i] = A[i] + B[i];

}



• Simple subroutine/function calls within a loop

Inline subroutines! (can be done by compiler….)

do i=1, N call elementprod(A(i),B(i), psum) C(i)=psumenddo…function elementprod( a, b, psum)…psum=a*b

do i=1, N psum=A(i)*B(i) C(i)=psumenddo…


Pipelining: Potential problems (3a)

30. April 2013

Can we use pipelining or does this cost us 8*3 cycle (assuming 3 stage ADD pipeline)


Pipelining: Potential problems (3b)

More general – “reduction operations”?

Benchmark: Run above assembly language kernel with N=32,64,128,…,4096 on processor with

3.5 GHz clock speed ClSp=3500 Mcycle/s 1 pipelined ADD unit (latency 3 cycles) 1 pipelined LOAD unit (latency 4 cycles)

30. April 2013


A(i) (incl. LD) sum in register xmm1

i (loop counter) NADD 1st argument to 2nd argument and store result in 2nd argument

1 cycle per iteration(after 7 iterations)



Expected Performance: Throughput * ClockSpeed

Throughput: N/T(N) = N/ (L+N)Assumption: L is total latency of one iteration and one result per cycle delivered after pipeline startup. Total runtime: L+N cycles

Total latency: L = 4 cycles + 3 cycles = 7 cycles

Performance for N Iterations:3500 MHz * (N / (L+N)) Iterations/cycle

Maximum performance ():3500 Mcycle/s * 1 Iteration/cycle=

3500 Miterations/s

30. April 2013

A(i-4)

A(i-5)

A(i-6)

A(i)

A(i-1)

A(i-2)

A(i-3)

LOAD

AD

D



30. April 2013

Why?

s = s+A(1)

A(2)

s = s+A(2)

A(3)

Dependency on sum next instruction needs to wait for completion of previous one only 1 out of 3 stages active 3 cycles per iteration

sum=0.d0do i=1, N sum=sum + A(i)enddo… Throughput here: N/T(N) = N/ (L+3*N)



Increase pipeline utilization by “loop unrolling”

30. April 2013

sum1=0.d0sum2=0.d0do i=1, N, 2 sum1=sum1+A(i) sum2=sum2+A(i+1)enddosum=sum1+sum2

“2-way Modulo Variable Expansion” (N is even)

2 out of 3 pipeline stages can be filled

2 results every 3 cycles

1.5 cycle/Iteration



4-way Modulo Variable Expansion (MVE) to get best performance (in principle 3-way should do as well)

Sum is split up in 4 independent partial sums

Compiler can do that, if he is allowed to do so…

Computer’s floating point arithmetic is not associative!

If you require binary exact result (-fp-model strict) compiler is not allowed to do this transformation

L=(7+3*3) cycles (prev. slide)

30. April 2013

Nr=4*(N/4)sum1=0.d0sum2=0.d0sum3=0.d0sum4=0.d0do i=1, Nr, 4 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2) sum4=sum4+A(i+3)enddodo i=Nr+1, N sum1=sum1+A(i)enddosum=sum1+sum2+sum3+sum4

“4-way MVE”

Rem

aind

er

loop


Pipelining: The Instruction pipeline

• Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps:

Fetch Instructionfrom L1I

Decode instruction

ExecuteInstruction

Hardware Pipelining on processor (all units can run concurrently):Fetch Instruction 1

from L1IDecode

Instruction 1Execute

Instruction 1

Fetch Instruction 2from L1I

Decode Instruction 2


ExecuteInstruction 2



t

… Branches can stall this pipeline! (Speculative Execution, Predication) Each Unit is pipelined itself (cf. Execute=Multiply Pipeline)

1

2

3

4


Pipelining: The Instruction pipeline

• Problem: Unpredictable branches to other instructions









t

…

1

2

3

4

Assume: Result determines next instruction!


Microprocessors – Superscalar

30. April 2013


Superscalar Processors

• Superscalar processors provide additional hardware (i.e. transistors) to execute multiple instructions per cycle!

• Parallel hardware components / pipelines are available to– fetch / decode / issues multiple instructions

per cycle(typically 3 – 6 per cycle)

– perform multiple integer / address calculations per cycle(e.g. 6 integer units on Itanium2)

– load (store) multiple operands (results) from (to) cacheper cycle (typically one load AND one store per cycle)

– perform multiple floating point instructions per cycle(typically 2 floating point instructions per cycle, e.g. 1 MULT + 1 ADD)

• On superscalar RISC processors out-of order (OOO) execution hardware is available to optimize the usage of the parallel hardware


Multiple units enable use of Instrucion Level Parallelism (ILP):Instruction stream is “parallelized” on the fly

Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar &

can perform 2 or 4 floating point operations per cycles

Superscalar Processors – Instruction Level Parallelism





































4-way „superscalar“

t


Complex register management not show (R4 contains A(i-4)) 2-way superscalar: 1 LOAD instruction + 1 ADD instruction

completed per cycle Often cited metrics for superscalar processors:

Instructions Per Cycle: IPC=2 above Cycles Per Instruction: CPI=0.5 above

Superscalar Processors – ILP in action

30. April 2013

R11=R11+R4

R12=R12+R5

R13=R13+R6

A(i)R0

A(i-1)R1

A(i-2)R2

A(i-3)R3

LOAD

ADD

sum1=0.d0 ! reg. R11sum2=0.d0 ! reg. R12Sum3=0.d0 ! reg. R13do i=1, N, 3 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2)enddosum=sum1+sum2+sum3

“3-way Modulo Variable Expansion” (N is multiple of 3)

Reg

iste

rSe

t


Superscalar processor – Intel Nehalem design Decode & issue a max. of 4

instructions per cycle: IPC=4Min. CPI=0.25 Cycles/Instruction

Parallel units: FP ADD & FP MULT (work in parallel) LOAD + STORE (work in parallel)

Max. FP performance:1 ADD + 1 MULT instruction per cycle

Max. performance:A(i) = r0 + r1 * B(i)

½ of max. FP performance:A(i) = r1 * B(i)

1/3 of max. FP performance:A(i) = A(i) + B(i) * C(i)


Microprocessors – Single Instruction Multiple Data (SIMD)-processing Basic Idea: Apply the same instruction to multiple operands in parallel

30. April 2013

38

SIMD-processing – Basics

Single Instruction Multiple Data (SIMD) instructions allow the concurrent execution of the same operation on “wide” registers.

x86_64 SIMD instruction sets: SSE: register width = 128 Bit 2 double (4 single) precision FP operands AVX: register width = 256 Bit 4 double (8 single) precision FP operands “Scalar” (non-SIMD) execution: 1 single/double operand, i.e. only lower

64 Bit (32 Bit) of registers are used.

Integer operands: SSE can be configured very flexible: 1 x 128 bit,…,16 x 8 bit AXV: No support for using the 256 bit register width for integer operations

SIMD-execution vector execution If compiler has vectorized loop SIMD instructions are used




Example: Adding two registers holding double precision floating point operands using 256 Bit register (AVX)

If 128 Bit SIMD instructions (SSE) are executed only half of the registers width is used

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

C[0]

C[1]

C[2]

C[3]

A[0]

B[0]

C[0]64 Bit

256 Bit

+ +

+

+

+

R0 R1 R2 R0 R1 R2

Scalar execution:R2 ADD [R0,R1]

SIMD execution:V64ADD [R0,R1] R2

30. April 2013

40


Steps (done by the compiler) for “SIMD-processing”


for(int i=0; i<n;i++) C[i]=A[i]+B[i];

for(int i=0; i<n;i+=4){ C[i] =A[i] +B[i];

C[i+1]=A[i+1]+B[i+1];C[i+2]=A[i+2]+B[i+2];C[i+3]=A[i+3]+B[i+3];}

//remainder loop handling

LABEL1:VLOAD R0 A[i]VLOAD R1 B[i]V64ADD[R0,R1] R2VSTORE R2 C[i]ii+4i<(n-4)? JMP LABEL1

//remainder loop handling

“Loop unrolling”

“Pseudo-Assembler”

Load 256 Bits starting from address of A[i] to register R0

Add the corresponding 64 Bit entries in R0 and R1 and store the 4 results to R2

Store R2 (256 Bit) to address starting at C[i]

41


No SIMD-processing for loops with data dependencies

“Pointer aliasing” may prevent compiler from SIMD-processing

C/C++ allows that A &C[-1] and B &C[-2] C[i] = C[i-1] + C[i-2]: dependency No SIMD-processing

If no “Pointer aliasing” is used, tell it to the compiler, e.g. use –fno-alias switch for Intel compiler SIMD-processing


for(int i=0; i<n;i++) A[i]=A[i-1]*s;

void scale_shift(double *A, double *B, double *C, int n) {for(int i=0; i<n; ++i) C[i] = A[i] + B[i];

}

42

s0=0.0;s1=0.0;S2=0.0;S3=0.0;for(int i=0; i<n;i+=4){ s0 = s0+ A[i] ; s1 = s1+ A[i+1]; s2 = s2+ A[i+2]; s3 = s3+ A[i+3];}//remainders=s0+s1+s2+s3


SIMD-processing of a vector sum


double s=0.0;for(int i=0; i<n;i++)

s = s + A[i];

…V64ADD(R0,R1) R0…

R0 R1

Data dependency on s must be resolved for SIMD-processing(assume AVX)Compiler does transformation (Modulo Variable Expansion) –

if programmer allows it to do so!(e.g. use –O3 instead of –O1)

“Horizontal” ADD: Sum up the 4 64 Bit entries of R0

R0 (0.d0, 0.d0, 0.d0, 0.d0)


R0=R0+A(1:4)R0=R0+A(5:8)

A(5:8)

SIMD-processing: What about pipelining?!

30. April 2013

A(9:12)

R0 (0.0,0.0,0.0, 0.0)do i=1, N, 4 VLOAD A(i:i+3) R1 V64ADD(R0,R1) R0

enddosum HorizontalADD(R0)

R0 (0.0,0.0,0.0,0.0)R1 (0.0,0.0,0.0,0.0)R2 (0.0,0.0,0.0,0.0)do i=1, N, 12 LOAD A(i:i+3) R3 LOAD A(i+4:i+7) R4 LOAD A(i+8:i+11) R5

V64ADD(R0,R3) R0 V64ADD(R1,R4) R1 V64ADD(R2,R5) R2

enddo…V64ADD(R0,R1) R0V64ADD(R0,R2) R0sum HorizontalADD(R0)

Need to do another MVE step to fill pipeline stages

“Vertical add”

“Horizontal add”


SIMD-processing: What about pipelining?!

30. April 2013

Unrolling factor of vectorized code

1 AVX iteration performs 4 i-Iterations (successive)

Performance: 4x higher than “scalar” version

Start-up phase much longer…

Double Precision


Compiler generated AVX code (loop body)

30. April 2013

Baseline version (“scalar”): No pipelining – no SIMD 3 cycles / Iteration

Compiler generated “AVX version” (-O3 –xAVX): SIMD processing: vaddpd %ymm8 4 dp operands(4-way unrolling)

Pipelining: 8-way MVE of SIMD code

0.25 cycles / Iteration32-way unrolling in total

SIMD processing – Vector sum (double precision) – 1 core

SIMD: Most impact if data is close to the core – other bottlenecks stay the same!

Peak

Location of “input data” (A[]) in memory hierarchy

Scalar: Code execution in core is slower than any data transfer

Plain: No SIMD but 4-way MVE

AVX/SIMD: Full benefit only if data is in L1 cache


Data parallel SIMD processing

• Requires independent vector-like operations (“Data parallel”)

• Compiler is required to generate “vectorized” code Check compiler output

• Check for the use of “Packed SIMD” instructions at runtime (likwid) or in the assembly code

• Packed SIMD may require alignment constraint, e.g. 16-Byte alignment for efficient load/store on Intel Core2 architectures

• Check also for SIMD LOAD / STORE instructions

• Use of Packed SIMD instructions reduces the overall number of instructions (typical theoretical max. of 4 instructions / cycle)

SIMD code may improve performance but reduce CPI!


Data parallel SIMD processing: Boosting performance

• Putting it all together: Modern x86_64 based Intel / AMD processor

– One FP MULTIPLY and one FP ADD pipeline can run in parallel and have a throughput of one FP instruction/cycle each (FMA units on AMD Interlagos)

Maximum 2 FP instructions/cycle

– Each pipeline operates on 128 (256) Bit registers for packed SSE (AVX) instructions 2 (4) double precision FP operations per SSE (AVX) instruction

4 (8) FP operations / cycle (1 MULT & 1 ADD on 2 (4) operands)

Peak performance of 3 GHz CPU (core): SSE: 12 GFlop/s or AVX: 24 GFlop/s (double precision)SSE: 24 GFlop/s or AVX: 48 GFlop/s (single precision)

BUT for “SCALAR” code: 6 GFlop/s (double and single precision)!

49

Maximum Floating Point (FP) Performance:

Pcore = F * S * n

F FP instructions per cycle: 2 (1 MULT and 1 ADD)

S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers – “AVX”)

n Clock speed : ∽2.5 GHz

P = 20 GF/s (dp) / 40 GF/s (sp)

There is no single driving force for single core performance!

Scalar (non-SIMD) execution

S = 1 FP op/instruction (dp / sp)

P = 5 GF/s (dp / sp)


5050

SIMD registers: floating point (FP) data and beyond

Possible data types in an SSE register

AVX only applies to FP data (not to scale)

16x 8bit

8x 16bit

4x 32bit

2x 64bit

1x 128bit

inte

ger

4x 32 bit

2x 64 bit floati

ng

poin

t

8x 32 bit

4x 64 bit

51

Rules for vectorizable loops / SIMD processing

1. Countable2. Single entry and single exit3. Straight line code4. No function calls (exception intrinsic math functions)

Better performance with:5. Simple inner loops with unit stride6. Minimize indirect addressing7. Align data structures (SSE 16 bytes, AVX 32 bytes)8. In C use the restrict keyword for pointers to rule out aliasing

Obstacles for vectorization: Non-contiguous memory access Data dependencies

52

How to leverage vectorization / SIMD

The compiler does it for you (aliasing, alignment, language)

Source code directives (pragmas) to ease the compiler’s job

Alternative programming models for compute kernels (OpenCL, ispc)

Intrinsics (restricted to C/C++)

Implement directly in assembler

Com

plexity and efficiency increases

53

Vectorization and the Intel compiler

Intel compiler will try to use SIMD instructions when enabled to do so

“Poor man’s vector computing” Compiler can emit messages about vectorized loops (not by default):

plain.c(11): (col. 9) remark: LOOP WAS VECTORIZED.

Use option -vec_report3 to get full compiler output about which loops were vectorized and which were not and why (data dependencies!)

Some obstructions will prevent the compiler from applying vectorization even if it is possible (e.g. –fp-model strict for vector sum or pointer aliasing)

You can use source code directives to provide more information to the compiler

54

Vectorization compiler options

The compiler will vectorize starting with –O2.

To enable specific SIMD extensions use the –x option: -xSSE2 vectorize for SSE2 capable machinesAvailable SIMD extensions:SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX

-xAVX on Sandy Bridge processors

Recommended option: -xHost will optimize for the architecture you compile on

Compiling for AMD Opteron: use plain –O3 as the -x options may involve CPU type checks at runtime!

55

Vectorization source code directives (pragmas)

Fine-grained control of loop vectorization Use !DEC$ (Fortran) or #pragma (C/C++) sentinel to start a compiler

directive

#pragma vector alwaysvectorize even if it seems inefficient (hint!)

#pragma novectordo not vectorize even if possible

#pragma vector nontemporaluse NT stores when allowed (i.e. alignment conditions are met)

#pragma vector alignedspecifies that all array accesses are aligned to 16-byte boundaries (DANGEROUS! You must not lie about this!)

56

User mandated vectorization (pragmas)

Starting with Intel Compiler 12.0 the simd pragma is available #pragma simd enforces vectorization where the other pragmas fail Prerequesites:

Countable loop Innermost loop Must conform to for-loop style of OpenMP worksharing constructs

There are additional clauses: reduction, vectorlength, private Refer to the compiler manual for further details

NOTE: Using the #pragma simd the compiler may generate incorrect code if the loop violates the vectorization rules!

#pragma simd reduction(+:x) for (int i=0; i<n; i++) { x = x + A[i]; }

57

Basic approach to check the instruction code

Get the assembler code (Intel compiler):icc –S –O3 -xHost triad.c -o triad.s

Disassemble Executable: objdump –d ./cacheBench | less

Things to check for: Is the code vectorized? Search for pd/ps suffix. mulpd, addpd, vaddpd, vmulpd Is the data loaded with 16 byte moves? movapd, movaps, vmovupd For memory-bound code: Search for nontemporal stores: movntpd, movntps

The x86 ISA is documented in:Intel Software Development Manual (SDM) 2A and 2BAMD64 Architecture Programmer's Manual Vol. 1-5

58

Some basics of the x86-64 ISA

16 general Purpose Registers (64bit): rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15alias with eight 32 bit register set:eax, ebx, ecx, edx, esi, edi, esp, ebp

Floating Point SIMD Registers:xmm0-xmm15 SSE (128bit) alias with 256bit registersymm0-ymm15 AVX (256bit)

SIMD instructions are distinguished by:AVX (VEX) prefix: vOperation: mul, add, movModifier: non temporal (nt), unaligned (u), aligned (a), high(h)Data range: packed (p), scalar (s)Data type: single (s), double (d)

5959

Some basic single core optimizations – warnings first

“Premature optimization is the root of all evil.”Donald E. Knuth

“Parallel performance is easy, single node/core performance is difficult” Bill Gropp

60

Single core: Common sense optimizations (1)

Do less work! Reducing the work to be done is never a bad idea!

logical :: flagflag = .false.do i=1,N if(comlex_func(A(i)) < TRESHOLD) then ! Check if at flag=.true. ! Least one endif ! Is trueenddo

logical :: flagflag = .false.do i=1,N if(comlex_func(A(i)) < TRESHOLD) then ! Check if at flag=.true. ! Least one EXIT ! Is true and endif ! EXIT do-loopenddo

61


Avoid expensive operations! FP MULT & FP ADD are the fastest way to compute Avoid DIV / SQRT / SIN / COS / TAN ,… table lookup Avoid a one-to-one implementation of the algorithm, e.g.

A= A + B**2 A = A + B**2.0 (V1)(A,B float) A = A + B * B (V2)

(V1) is not a good idea: B**2.0 exp{2.0*ln(B)}1. Computing exp & ln is very expensive 2. B < 0 ?!

Most useful if data is close to CPU!

62


Shrink the working set!Working on small data sets reduces data transfer volume and increases the probability for cache hits

Analyze if appropriate data types are used, e.g.:4 different particle species have to be distinguished:

integer spec(1:N) spec(i) = {0,1,2,3} sizeof(spec(1:N)) = 4*N*byte

OR use 1-Byte integer datatype integer*1 spec(1:N) sizeof(spec(1:N)) = N*byte

OR use 2-Bit for each speciesinteger*1 spec(1:N/4) sizeof(spec(1:N)) = N/4*byte

Strongly depends on application!

63

Single core: Common sense optimizations (4a)

Elimination of common subexpressions!This often reduces MFLOP/s rate but improves runtime

In principle the compiler should do the job but do not rely on it! Associativity rules may prevent compiler to do so Compiler does not recognize (limited scope)

64

Single core: Common sense optimizations (4b)

Replace expensive functions by table lookup


do iter = … … … do i=1,n … edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i) BF= 0.5d0*(1.d0+tanh(edelz)) … … enddo … …enddo

Entries: -1,0,1

edelz=-6,-5,-4,-3,-2,-1,1,0,1,2,3,4,5,6

Compute all 13 potential values of tanh(edelz) before and store it in a table with 13 entries: tanh_table(-6:6)

65

Single core: Common sense optimizations (4c)

Replace expensive functions by table lookup


do i=-6,6 tanh_table(i) = tanh(i)enddodo iter = … … … do i=1,n … edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i) BF= 0.5d0*(1.d0+ tanh_table(edelz)) … … enddo … …enddo

66


Avoid branches!Support the compiler to understand and optimize your code!

Code change may enable vectorization, SIMD & other optimizations

BTW software pipelining is also much easier and no branch prediction is required

computer architecture for medical applications pipelining & s ingle i nstruction m ultiple d...

Documents

pipelining simd30

order pipelining

wellein sum

different instructions

hardware level

chip caches cama

split complex instruction

high level code