optimization and tuning techniques of lattice qcd for blue gene

© 2006 IBM Corporation

Optimization and tuningtechniques of lattice QCDfor Blue Gene

Jun DoiTokyo Research LaboratoryIBM Japan

Deep Computing | Tokyo Research Laboratory


Agenda

Part I:

– Optimization of lattice QCD program using double FPU instructions

Part II:

– Parallelization of lattice QCD and optimization of communication



Part I:

Optimization of lattice QCD program using double FPU instructions



Optimization of lattice QCD for Blue Gene

Our lattice QCD program

– Wilson’s method

– Original program is written in C++

Optimization using double FPU instructions

– We used inline assembly to optimize complex arithmetic

• We have to schedule instructions by ourselves instead of compiler



Wilson-Dirac operator

Exchanging colors by multiplying color and 3x3 gauge matrix for 4 spin for 8 directions x+,x-,y+,y-,z+,z-,t+ and t-

: 4x3 spinor• U : 3x3 gauge matrix• Ut : Hermitian matrix of U• (1+γ),(1-γ) : 4x4 projector matrix

),,,(),,,(

),,,(),,,(

),,,(),,,(

),,,(),,,(

,,,

1,,,1,,,11,,,,,,1

,1,,,1,,1,1,,,,,1

,,1,,,1,1,,1,,,,1

,,,1,,,11,,,1,,,1

,,,,,,

tzyxutmtzyxutp

tzyxuzmtzyxuzp

tzyxuymtzyxuyp

tzyxuxmtzyxuxp

tzyx

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxD

ttttt

tzzzz

tyyyy

txxxx

(x,y,z,t)

(x,y-1,z,t)

(x-1,y,z,t)

(x,y+1,z,t)

(x+1,y,z,t)

Ux(x,y,z,t)

Uy(x,y,z,t)

Uy(x,y-1,z,t)Ux(x-1,y,z,t)



uxp : Part of Wilson-Dirac operator for X plus direction

11),( xxUxUxuxp xxx

100

010

010

001

1

i

i

i

i

xProjector :

11

11

32

41

xixxUB

xixxUA

x

x

Multiplying symmetric projector , we can merge 4 spinor into 2 spinor to calculate uxp

Ai

Bi

B

A

xxUxUxuxp xxx 11),(

11

11

32

41

xixD

xixC

DxUB

CxUA

x

x

Ai

Bi

B

A

xx

I. Merging 4 spinors into 2 II. Multiplying 2 spinors and gauge III. Adding to 4 spinors



Floating point register usage for u?p, u?m 4x3 spinor to add result= 12 registers

4x3 spinor for neighboring lattice point (input) = 12 registers

3x3 gauge matrix = 9 registers

– = 33 registers are needed

– Additional registers are needed for constant values

12 registers

to add result

(Always loaded for other directions)

2 reg. for others

6 registers

for gauge matrix

12 registers

for input

and to save result

FP 0toFP 11

FP 12toFP 17

FP 18toFP 29

FP 30FP 31

6 registers

for 2 spinors6 registers

to save result

Last 3 elements are loaded after first 3 are multiplied

Merging 4 spinors into 2 spinors

Adding to 4x3 spinor



Step I : Merging 4 spinors into 2 spinors

spinor 1 Rspinor 1 Gspinor 1 B

FR18FR19FR20


FR21FR22FR23


FR24FR25FR26


FR27FR28FR29

14 x

11 x

12 x

13 x

11

11

32

41

xixD

xixC

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D

v = x + i * yRe(v) = Re(x) – Im(y)Im(v) = Im(x) + Re(y)

double unit[2] = {1,1};

LFPDX 31,unit

…

LFPDX 18,spinor1_R

…

LFPDX 27,spinor4_R

…

FXCXNPMA 18,31,27,18



FXCXNPMA instruction

Primary FPUregister

FR0~FR31

MUL

ADD

CA B S

Secondary FPUregister

FR32~FR63

MUL

ADD

CA B S

FR63 FR59FR18 x

FR63 FR27FR50

A C

x=FR18

-

FR50

B

+

= -( )


LFPDX 31,unit

…

LFPDX 18,spinor1_R

…

LFPDX 27,spinor4_R

…

FXCXNPMA 18,31,27,18



Step II : Multiplying spinor and 3x3 gauge matrix (for +directions)

y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];

y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];

y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];

u[0][0] * x[0] :　Multiplying 2 complex numbers

FXPMUL (y[0],u[0][0],x[0])

FXCXNPMA (y[0],u[0][0],x[0],y[0])

+ u[0][1] * x[1] + u[0][2] * x[2]; Using FMA instructions

FXCPMADD (y[0],u[0][1],x[1],y[0])

FXCXNPMA (y[0],u[0][1],x[1],y[0])

FXCPMADD (y[0],u[0][2],x[2],y[0])

FXCXNPMA (y[0],u[0][2],x[2],y[0])

re(y[0])=re(u[0][0])*re(x[0])

im(y[0])=re(u[0][0])*im(x[0])

re(y[0])+=-im(u[0][0])*im(x[0])

im(y[0])+=im(u[0][0])*re(x[0])

x: input spinor

y: output spinor

u : 3x3 gauge matrix



FXPMUL and FXCPMADD instruction

Primary FPUregister

FR0~FR31

MUL

ADD

CA B S

Secondary FPUregister

FR32~FR63

MUL

ADD

CA B S

FR3 FR4

FXPMUL 5,3,4

FR5 x

FR3 FR36

=

FR37

A C

x=

FR0 FR1

FXCPMADD 10,0,1,5

FR10 x

FR0 FR33

=

FR42

A C

x=

FR5+

FR37

B

+



Multiplying Hermitian gauge matrix (for -directions)

y[0] = ~u[0][0] * x[0] + ~u[1][0] * x[1] + ~u[2][0] * x[2];

y[1] = ~u[0][1] * x[0] + ~u[1][1] * x[1] + ~u[2][1] * x[2];

y[2] = ~u[0][2] * x[0] + ~u[1][2] * x[1] + ~u[2][2] * x[2];

~u[0][0] * x[0] :　Multiplying conjugate complex

FXPMUL (y[0],u[0][0],x[0])

FXCXNSMA (y[0],u[0][0],x[0],y[0])

+ ~u[1][0] * x[1] + ~u[2][0] * x[2]; Using FMA instruction

FXCPMADD (y[0],u[1][0],x[1],y[0])

FXCXNSMA (y[0],u[1][0],x[1],y[0])

FXCPMADD (y[0],u[2][0],x[2],y[0])

FXCXNSMA (y[0],u[2][0],x[2],y[0])

re(y[0])=re(u[0][0])*re(x[0])

im(y[0])=re(u[0][0])*im(x[0])

re(y[0])+=im(u[0][0])*im(x[0])

im(y[0])+=-im(u[0][0])*re(x[0])

Multiplying Hermitian matrix is as follows x: input spinor

y: output spinor

~u : conjugate complex



Optimization of instruction pipeline

y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];

y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];

y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];

FXPMUL (y[0],u[0][0],x[0])

FXPMUL (y[1],u[1][0],x[0])

FXPMUL (y[2],u[2][0],x[0])

FXCXNPMA (y[0],u[0][0],x[0],y[0])

FXCXNPMA (y[1],u[1][0],x[0],y[1])

FXCXNPMA (y[2],u[2][0],x[0],y[2])

FXCPMADD (y[0],u[0][1],x[1],y[0])

...

3 cycles to use the result in next calculation : pipeline stalls

FXPMUL (yc[0],u[0][0],xa[0])

FXPMUL (yd[0],u[0][0],xb[0])

FXPMUL (yc[1],u[1][0],xa[0])

FXPMUL (yd[1],u[1][0],xb[0])

FXPMUL (yc[2],u[2][0],xa[0])

FXPMUL (yd[2],u[2][0],xb[0])

FXCXNPMA (yc[0],u[0][0],xa[0],yc[0])

FXCXNPMA (yd[0],u[0][0],xb[0],yd[0])





FXCPMADD (yc[0],u[0][1],xa[1],yc[0])

...

6 cycles for next calculation

: pipeline does not stall

Multiplying

2 spinors

Yc = u*Xa

Yd = u*Xb

together

1 multiplication

Calculation order

2 multiplications



Loading gauge matrix to registerfor minus direction ： uxm, uym, uzm, utm operators

lfpdux u,12 //u[0][0]lfpdux u,13 //u[0][1]lfpdux u,14 //u[0][2]

y=u[0]*x[0]

y+=u[1]*x[1]

y+=u[2]*x[2]

for plus direction ： uxp, uyp, uzp, utp operators

matrix can be loaded sequentially

lfpdux u,14 //u[0][0]lfpdux u,15 //u[0][1]lfpdux u,12 //u[0][2]lfpdux u,17 //u[1][0]lfpdux u,16 //u[1][1]lfpdux u,13 //u[1][2]lfpdux u,30 //u[2][0]

lfpdux u,14 //u[2][2]

y=u*x[0]

y+=u*x[1]

y+=u*x[2]

lfpdux u,31 //u[2][1]



y[0] = ~u[0][0] * x[0] + ~u[1][0] * x[1] + ~u[2][0] * x[2];y[1] = ~u[0][1] * x[0] + ~u[1][1] * x[1] + ~u[2][1] * x[2];y[2] = ~u[0][2] * x[0] + ~u[1][2] * x[1] + ~u[2][2] * x[2];

order of matrix in array

y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];

order of matrix in array

to load matrix sequentially additional 2 temporary registers are used

Calculation order Calculation order



Step III: Adding result to 4x3 spinor

A RA GA B

FR24FR25FR26

B RB GB B

FR27FR28FR29

B

A

Ai

Bi

B

A

xx


FR0FR1FR2


FR3FR4FR5


FR6FR7FR8


FR9FR10FR11

x4

x1

x2

x3

v = v - i * wRe(v) = Re(v) + Im(w)Im(v) = Im(v) - Re(w)


LFPDX 31,unit

LFPDX 9,spinor4_R

LFPDX 24,A_R

FXCXNSMA 9,31,24,9

FXCXNSMA to subtract

For spinor 3 and 4

For spinor 1 and 2v = v + w

Re(v) = Re(v) + Re(w)Im(v) = Im(v) + Im(w)

LFPDX 0,spinor1_R

LFPDX 24,A_R

FPADD 0,0,24



Multiplying to 4x3 spinor

1,,,1,,,11,,,,,,1

,1,,,1,,1,1,,,,,1

,,1,,,1,1,,1,,,,1

,,,1,,,11,,,1,,,1

,,,,,,

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxUtzyxtzyxU

tzyxtzyxD

ttttt

tzzzz

tyyyy

txxxx

xutmxutp

xuzmxuzp

xuymxuyp

xuxmxuxp

tzyxtzyxD

,,,,,,

Multiplying to every u?p, u?m operators

This change increases calculation,

but does not increase double FPU instructions

This allows out of order calculation of each operators



Adding result to 4x3 spinor with multiplying

Ai

Bi

B

A

xx

A RA GA B

FR24FR25FR26

B RB GB B

FR27FR28FR29

B

Aspinor 1 Rspinor 1 Gspinor 1 B

FR0FR1FR2


FR3FR4FR5


FR6FR7FR8


FR9FR10FR11

x4

x1

x2

x3

v = v - * i * w Re(v) = Re(v) + *Im(w)Im(v) = Im(v) - *Re(w)

double kappa[2] = {, };

LFPDX 31,kappa

LFPDX 9,spinor4_R

LFPDX 24,A_R

FXCXNSMA 9,31,24,9

For spinor 3 and 4

For spinor 1 and 2

v = v + * wRe(v) = Re(v) + *Re(w)Im(v) = Im(v) + *Im(w)

LFPDX 0,spinor1_R

LFPDX 24,A_R

FXCPMADD 0,kappa,0,24

v = v - i * wdouble unit[2] = {1,1};

LFPDX 31,unit

LFPDX 9,spinor4_R

LFPDX 24,A_R

FXCXNSMA 9,31,24,9 v = v + wLFPDX 0,spinor1_R

LFPDX 24,A_R

FPADD 0,0,24



Part II:

Parallelization of lattice QCD and optimization of communication



Parallelization of lattice QCD and optimization of communication

Decreasing communication time as much as we can

– Limit data exchange only between neighboring node on torus• Shortest path and never conflict data exchange

– Mapping 4D lattice to torus network

MPI is rich enough to do such a limited communication

– Overhead to call MPI function is too big for limited conditions

We used torus packet HW directly

– Very small latency to send and receive data

– We can send/recv directly from/to register of double FPU• We do not need buffer in memory

– We can overlap sending 6 directions and local computations• We can hide communication time



Parallelization of lattice QCDParallelization of lattice QCD

X

Y

X

Y

Solving on 1 CPU

Mapping physical topology of network

to avoid conflict of data exchange



Mapping lattice to torus network of Blue Gene How to divide 4D lattice into 3D torus network

– Using virtual node mode and communication between 2 CPUs in same compute node will be 4th dimensional torus

We mapped xyzt of 4D lattice into TXYZ of virtual 4D torus

– The fastest communication is between 2 CPUs in compute node

– x of lattice is inner most loop of spinor and gauge array

CPU0

CPU1

Lattice of QCDTorus network of Blue Gene

y

ztX

YZ

Tx



Data exchange by torus packet network

Packet header destination/size/etc…

16bytes data

16bytes data

16bytes data

Send FIFO X+

Send FIFO X-

Send FIFO Y+

Send FIFO Z-

...

double FPU

register

16bytes parallel load 16bytes parallel storeSending data

Receiving data

Store data to memory mapped FIFO to send data to neighboring node

6 FIFOs are independent and can transfer data at the same time

Packet size is multiple of 32 bytes up to 256 bytes

(including 16 bytes header)

...

double FPU

register

Recv FIFO X+

Recv FIFO X-

Recv FIFO Y+

Recv FIFO Z-

...

16bytes parallel load

CPU0

CPU1



Exchanging 2 spinors between neighboring nodesSending 2 spinor to + direction for uym

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D


FIFO buffer

(1KB)

Send FIFO X+

Recv FIFO X-

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D

Multiplying gauge matrix

Add to 4x3 spinor

Store data to FIFO as if it is part of memory

Load data from FIFO

Wait until all data is received in FIFO

Send directly from register

Size of 2 spinors is 2x3x16 = 96 bytes

1 packet is big enough to send



Exchanging 2 spinors between neighboring nodesSending 2 spinor to - direction for uyp Merging 4 spinors into 2 spinors

FIFO buffer

(1KB)

Send FIFO X-

Recv FIFO X+

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D


Add to 4x3 spinor

Store data to FIFO

Load data from FIFO

A RA GA B

FR24FR25FR26

B RB GB B

FR27FR28FR29

B

AA RA GA B

FR24FR25FR26

B RB GB B

FR27FR28FR29

B

A

Wait until all data is received in FIFO



Exchanging 2 spinors between 2 CPUsSending 2 spinor to + direction for uxm

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D


Shared memory

C RC GC B

FR18FR19FR20

D RD GD B

FR21FR22FR23

C

D


Add to 4x3 spinor

Store data to shared memory

Load data from shared memory

lockbox

barrierPass barrier after all data is stored

Wait until all data is stored

CPU0

CPU1

Shared memory is not FIFO

so 2 CPUs are synchronized to make sure to write all data in shared memory

(For safety, it is better to synchronize also before send)



Overlapping data exchange and computations

Torus packet HW can send 6 direction independently

– After storing to FIFO, data communication is non-blocking

• We can overlap computation or sending to other direction

But 6 send FIFOs are shared with 2 CPUs in compute node

– 2 sets of 3 FIFOs (X+,Y+,Z+) and (X-,Y-,Z-) are assigned for each CPU

SEND_X+

SEND_Y+

SEND_Z+

RECV_X-

RECV_Y-

RECV_Z-

Local computations

Data exchange between 2CPUs

loop for CPU0

SEND_X-

SEND_Y-

SEND_Z-

RECV_X+

RECV_Y+

RECV_Z+

loop for CPU1

Local computations

Data exchange between 2CPUs

Actual time to transfer data between compute node can be hidden if there is much CPU time between send and recv



Special communication API for lattice QCD

Limitation

– Only for exchanging data between neighboring nodes and 2 CPUs in compute node

What we can do with API

– API function to prepare packet header

– API macros to send/recv data from/to FPU register• Communication between node (XYZ) and between CPU (T) can be ha

ndled in same way• These macros are used with inline assembly to optimize instruction pi

peline with computations

– API function for internal barrier between 2 CPUs

– API functions to send / recv through user buffer• These functions are used if we do not want to use inline assembly



Comparison of API for QCD and MPI

between 2 neighboring compute node between 2 CPUs in compute node

20

40

60

80

100

120

140

160

100 1000 10000

Data size[B]

Ban

dwid

th [

MB

/sec

]

MPI

API for QCD

0

200

400

600

800

1000

1200

1400

1600

100 1000 10000

Data size[B]

Ban

dwid

th [

MB

/sec

]

MPI

API for QCD

1 packet = 256 bytes about 40 MB/sec

Comparison of bandwidth for ping-pong communication :

10 times as faster than MPI2-3 times as faster than MPI



Sending data using API macros

#define BGLNET_WORK_REG 30#define BGLNET_HEADER_REG 30

BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6); loop for several times: (calculate something to send)

BGLNet_Send_Enqueue_Header(fifo); BGLNet_Send_Enqueue(fifo,24);

FPADD(21, 3, 6);BGLNet_Send_Enqueue(fifo,25);

FPSUB(18, 0, 9);BGLNet_Send_Enqueue(fifo,26);

FPSUB(19, 1,10);BGLNet_Send_Enqueue(fifo,27);

FPSUB(20, 2,11);BGLNet_Send_Enqueue(fifo,28);

FPADD(22, 4, 7);BGLNet_Send_Enqueue(fifo,29);

FPADD(23, 5, 8);BGLNet_Send_Packet(fifo);

end of loop:

to tell API which register is used to load packet header

Waits until send FIFO is not empty

Sets address of FIFO

Loads packet header

Sends packet header

Sends data from register

Sends additional 16 bytes

Computation can be inserted as same as to optimize instruction pipeline with load/store and computation



Receiving data using API macros#define BGLNET_WORK_REG 30#define BGLNET_HEADER_REG 30

BGLNet_Recv_WaitReady(BGLNET_X_MINUS,fifo,Nx*8);loop for Nx times:

BGLNet_Recv_Dequeue_Header(fifo);FXCSMADD( 0,31,24, 0);

BGLNet_Recv_Dequeue(fifo,12);FXCSMADD( 1,31,25, 1);





BGLNet_Recv_Dequeue(fifo,17);FXCPMADD( 6,31,27, 6);

BGLNet_Recv_Packet(fifo);end of loop:

to tell API which register is used to recv packet header

Waits until all data is received in FIFO buffer

Sets address of FIFO

Receives packet header

Receives data to register

Receives additional 16 bytes



Sending data between 2 CPUs

BGLNet_Send_WaitReady(BGLNET_T_PLUS,fifo,6); (calculate something to send)BGLNet_Send_Enqueue(fifo,0);

FXNMSUB(21,24,31,21);BGLNet_Send_Enqueue(fifo + 1,1);





FXNMSUB(20,29,31,20);BGLNet_InternalBarrier();

Does not wait

Does not need packet header

Sets address of shared memory to pointer “fifo”

Sends data from register

Shared memory is not fifo , so we should update address to store next data

Barrier function to make sure all data is stored in shared memory

Receiver also calls this function before receiving data



Simpler way to access torus packet HW

2 API functions for send and recv can be used without inline assembly

Send data from user buffer

void BGLNet_Send(void* pData,int dir,int size);User buffer:

pData

size bytes

send FIFO

shared memory

For dir X,Y or Z

For dir = T

This function returns as soon as all data are copied to FIFO

(non-blocking send)

This function can send up to 32 spinors

Receive data to user buffer

send FIFO

shared memory

For dir X,Y or Z

For dir = T

void BGLNet_Recv(void* pData,int dir,int size);

User buffer:

pData

size bytesThis function waits until all data can be received



Optimization result of our lattice QCD program

global lattice size 16x16x16x32 24x24x24x48

1/2 rack (2.8 TFLOPS)

8x8x8x2 = 1024 CPUs

24.33%(0.68 TFLOPS)

29.23%(0.82 TFLOPS)

1 rack (5.6 TFLOPS)

8x8x16x2 = 2048 CPUs

22.78%(1.28 TFLOPS)

28.57%(1.60 TFLOPS) st

ron

g s

calin

g

global lattice size 8x8x8x16 12x12x12x24

1 node card

4x4x2x2 = 64 CPUs

25.45% 29.84%

wea

k sc

alin

g

Sustained performance per peak performance:

inline assembly with MPI (using buffer to send at once)

(1/2 rack 24x24x24x48)

17.88%

For comparison:



Easier way to optimize to get performance

XLC 8.0 with MPI original source code

(1/2 rack 24x24x24x48)

6.66%

XLC 8.0 with API for QCD non-overlapped communication

(1/2 rack 24x24x24x48)

7.95%

XLC 8.0 with MPI adding alignx() to tell compiler 16 bytes alignment (1/2 rack 24x24x24x48)

7.42%

XLC 8.0 with API for QCD overlapping 3 send for XYZ

(1/2 rack 24x24x24x48)

9.19%

Built-in functions to optimize double FPU instruction (much easier than inline assembly)

Overlapping communication and computations using API for QCD

get more performance



Summary

Optimization using double FPU instructions

– We used inline assembly to optimize complex arithmetic

Parallelization of lattice QCD

– Mapping 4D lattice into 4D virtual torus network to limit communication only to neighboring compute node

Optimization of communication

– We used torus packet HW directly

• We developed API for QCD to use easier

optimization and tuning techniques of lattice qcd for blue gene

Documents