optimization and tuning techniques of lattice qcd for blue gene
DESCRIPTION
Optimization and tuning techniques of lattice QCD for Blue Gene. Jun Doi Tokyo Research Laboratory IBM Japan. Agenda. Part I: Optimization of lattice QCD program using double FPU instructions Part II: Parallelization of lattice QCD and optimization of communication. Part I:. - PowerPoint PPT PresentationTRANSCRIPT
© 2006 IBM Corporation
Optimization and tuningtechniques of lattice QCDfor Blue Gene
Jun DoiTokyo Research LaboratoryIBM Japan
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Agenda
Part I:
– Optimization of lattice QCD program using double FPU instructions
Part II:
– Parallelization of lattice QCD and optimization of communication
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Part I:
Optimization of lattice QCD program using double FPU instructions
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Optimization of lattice QCD for Blue Gene
Our lattice QCD program
– Wilson’s method
– Original program is written in C++
Optimization using double FPU instructions
– We used inline assembly to optimize complex arithmetic
• We have to schedule instructions by ourselves instead of compiler
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Wilson-Dirac operator
Exchanging colors by multiplying color and 3x3 gauge matrix for 4 spin for 8 directions x+,x-,y+,y-,z+,z-,t+ and t-
: 4x3 spinor• U : 3x3 gauge matrix• Ut : Hermitian matrix of U• (1+γ),(1-γ) : 4x4 projector matrix
),,,(),,,(
),,,(),,,(
),,,(),,,(
),,,(),,,(
,,,
1,,,1,,,11,,,,,,1
,1,,,1,,1,1,,,,,1
,,1,,,1,1,,1,,,,1
,,,1,,,11,,,1,,,1
,,,,,,
tzyxutmtzyxutp
tzyxuzmtzyxuzp
tzyxuymtzyxuyp
tzyxuxmtzyxuxp
tzyx
tzyxtzyxUtzyxtzyxU
tzyxtzyxUtzyxtzyxU
tzyxtzyxUtzyxtzyxU
tzyxtzyxUtzyxtzyxU
tzyxtzyxD
ttttt
tzzzz
tyyyy
txxxx
(x,y,z,t)
(x,y-1,z,t)
(x-1,y,z,t)
(x,y+1,z,t)
(x+1,y,z,t)
Ux(x,y,z,t)
Uy(x,y,z,t)
Uy(x,y-1,z,t)Ux(x-1,y,z,t)
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
uxp : Part of Wilson-Dirac operator for X plus direction
11),( xxUxUxuxp xxx
100
010
010
001
1
i
i
i
i
xProjector :
11
11
32
41
xixxUB
xixxUA
x
x
Multiplying symmetric projector , we can merge 4 spinor into 2 spinor to calculate uxp
Ai
Bi
B
A
xxUxUxuxp xxx 11),(
11
11
32
41
xixD
xixC
DxUB
CxUA
x
x
Ai
Bi
B
A
xx
I. Merging 4 spinors into 2 II. Multiplying 2 spinors and gauge III. Adding to 4 spinors
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Floating point register usage for u?p, u?m 4x3 spinor to add result= 12 registers
4x3 spinor for neighboring lattice point (input) = 12 registers
3x3 gauge matrix = 9 registers
– = 33 registers are needed
– Additional registers are needed for constant values
12 registers
to add result
(Always loaded for other directions)
2 reg. for others
6 registers
for gauge matrix
12 registers
for input
and to save result
FP 0toFP 11
FP 12toFP 17
FP 18toFP 29
FP 30FP 31
6 registers
for 2 spinors6 registers
to save result
Last 3 elements are loaded after first 3 are multiplied
Merging 4 spinors into 2 spinors
Adding to 4x3 spinor
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Step I : Merging 4 spinors into 2 spinors
spinor 1 Rspinor 1 Gspinor 1 B
FR18FR19FR20
spinor 2 Rspinor 2 Gspinor 2 B
FR21FR22FR23
spinor 3 Rspinor 3 Gspinor 3 B
FR24FR25FR26
spinor 4 Rspinor 4 Gspinor 4 B
FR27FR28FR29
14 x
11 x
12 x
13 x
11
11
32
41
xixD
xixC
C RC GC B
FR18FR19FR20
D RD GD B
FR21FR22FR23
C
D
v = x + i * yRe(v) = Re(x) – Im(y)Im(v) = Im(x) + Re(y)
double unit[2] = {1,1};
LFPDX 31,unit
…
LFPDX 18,spinor1_R
…
LFPDX 27,spinor4_R
…
FXCXNPMA 18,31,27,18
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
FXCXNPMA instruction
Primary FPUregister
FR0~FR31
MUL
ADD
CA B S
Secondary FPUregister
FR32~FR63
MUL
ADD
CA B S
FR63 FR59FR18 x
FR63 FR27FR50
A C
x=FR18
-
FR50
B
+
= -( )
double unit[2] = {1,1};
LFPDX 31,unit
…
LFPDX 18,spinor1_R
…
LFPDX 27,spinor4_R
…
FXCXNPMA 18,31,27,18
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Step II : Multiplying spinor and 3x3 gauge matrix (for +directions)
y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];
y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];
y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];
u[0][0] * x[0] : Multiplying 2 complex numbers
FXPMUL (y[0],u[0][0],x[0])
FXCXNPMA (y[0],u[0][0],x[0],y[0])
+ u[0][1] * x[1] + u[0][2] * x[2]; Using FMA instructions
FXCPMADD (y[0],u[0][1],x[1],y[0])
FXCXNPMA (y[0],u[0][1],x[1],y[0])
FXCPMADD (y[0],u[0][2],x[2],y[0])
FXCXNPMA (y[0],u[0][2],x[2],y[0])
re(y[0])=re(u[0][0])*re(x[0])
im(y[0])=re(u[0][0])*im(x[0])
re(y[0])+=-im(u[0][0])*im(x[0])
im(y[0])+=im(u[0][0])*re(x[0])
x: input spinor
y: output spinor
u : 3x3 gauge matrix
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
FXPMUL and FXCPMADD instruction
Primary FPUregister
FR0~FR31
MUL
ADD
CA B S
Secondary FPUregister
FR32~FR63
MUL
ADD
CA B S
FR3 FR4
FXPMUL 5,3,4
FR5 x
FR3 FR36
=
FR37
A C
x=
FR0 FR1
FXCPMADD 10,0,1,5
FR10 x
FR0 FR33
=
FR42
A C
x=
FR5+
FR37
B
+
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Multiplying Hermitian gauge matrix (for -directions)
y[0] = ~u[0][0] * x[0] + ~u[1][0] * x[1] + ~u[2][0] * x[2];
y[1] = ~u[0][1] * x[0] + ~u[1][1] * x[1] + ~u[2][1] * x[2];
y[2] = ~u[0][2] * x[0] + ~u[1][2] * x[1] + ~u[2][2] * x[2];
~u[0][0] * x[0] : Multiplying conjugate complex
FXPMUL (y[0],u[0][0],x[0])
FXCXNSMA (y[0],u[0][0],x[0],y[0])
+ ~u[1][0] * x[1] + ~u[2][0] * x[2]; Using FMA instruction
FXCPMADD (y[0],u[1][0],x[1],y[0])
FXCXNSMA (y[0],u[1][0],x[1],y[0])
FXCPMADD (y[0],u[2][0],x[2],y[0])
FXCXNSMA (y[0],u[2][0],x[2],y[0])
re(y[0])=re(u[0][0])*re(x[0])
im(y[0])=re(u[0][0])*im(x[0])
re(y[0])+=im(u[0][0])*im(x[0])
im(y[0])+=-im(u[0][0])*re(x[0])
Multiplying Hermitian matrix is as follows x: input spinor
y: output spinor
~u : conjugate complex
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Optimization of instruction pipeline
y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];
y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];
y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];
FXPMUL (y[0],u[0][0],x[0])
FXPMUL (y[1],u[1][0],x[0])
FXPMUL (y[2],u[2][0],x[0])
FXCXNPMA (y[0],u[0][0],x[0],y[0])
FXCXNPMA (y[1],u[1][0],x[0],y[1])
FXCXNPMA (y[2],u[2][0],x[0],y[2])
FXCPMADD (y[0],u[0][1],x[1],y[0])
...
3 cycles to use the result in next calculation : pipeline stalls
FXPMUL (yc[0],u[0][0],xa[0])
FXPMUL (yd[0],u[0][0],xb[0])
FXPMUL (yc[1],u[1][0],xa[0])
FXPMUL (yd[1],u[1][0],xb[0])
FXPMUL (yc[2],u[2][0],xa[0])
FXPMUL (yd[2],u[2][0],xb[0])
FXCXNPMA (yc[0],u[0][0],xa[0],yc[0])
FXCXNPMA (yd[0],u[0][0],xb[0],yd[0])
FXCXNPMA (yc[1],u[1][0],xa[0],yc[1])
FXCXNPMA (yd[1],u[1][0],xb[0],yd[1])
FXCXNPMA (yc[2],u[2][0],xa[0],yc[2])
FXCXNPMA (yd[2],u[2][0],xb[0],yd[2])
FXCPMADD (yc[0],u[0][1],xa[1],yc[0])
...
6 cycles for next calculation
: pipeline does not stall
Multiplying
2 spinors
Yc = u*Xa
Yd = u*Xb
together
1 multiplication
Calculation order
2 multiplications
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Loading gauge matrix to registerfor minus direction : uxm, uym, uzm, utm operators
lfpdux u,12 //u[0][0]lfpdux u,13 //u[0][1]lfpdux u,14 //u[0][2]
y=u[0]*x[0]
y+=u[1]*x[1]
y+=u[2]*x[2]
for plus direction : uxp, uyp, uzp, utp operators
matrix can be loaded sequentially
lfpdux u,14 //u[0][0]lfpdux u,15 //u[0][1]lfpdux u,12 //u[0][2]lfpdux u,17 //u[1][0]lfpdux u,16 //u[1][1]lfpdux u,13 //u[1][2]lfpdux u,30 //u[2][0]
lfpdux u,14 //u[2][2]
y=u*x[0]
y+=u*x[1]
y+=u*x[2]
lfpdux u,31 //u[2][1]
lfpdux u,15 //u[1][0]lfpdux u,16 //u[1][1]lfpdux u,17 //u[1][2]
lfpdux u,12 //u[2][0]lfpdux u,13 //u[2][1]lfpdux u,14 //u[2][2]
y[0] = ~u[0][0] * x[0] + ~u[1][0] * x[1] + ~u[2][0] * x[2];y[1] = ~u[0][1] * x[0] + ~u[1][1] * x[1] + ~u[2][1] * x[2];y[2] = ~u[0][2] * x[0] + ~u[1][2] * x[1] + ~u[2][2] * x[2];
order of matrix in array
y[0] = u[0][0] * x[0] + u[0][1] * x[1] + u[0][2] * x[2];y[1] = u[1][0] * x[0] + u[1][1] * x[1] + u[1][2] * x[2];y[2] = u[2][0] * x[0] + u[2][1] * x[1] + u[2][2] * x[2];
order of matrix in array
to load matrix sequentially additional 2 temporary registers are used
Calculation order Calculation order
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Step III: Adding result to 4x3 spinor
A RA GA B
FR24FR25FR26
B RB GB B
FR27FR28FR29
B
A
Ai
Bi
B
A
xx
spinor 1 Rspinor 1 Gspinor 1 B
FR0FR1FR2
spinor 2 Rspinor 2 Gspinor 2 B
FR3FR4FR5
spinor 3 Rspinor 3 Gspinor 3 B
FR6FR7FR8
spinor 4 Rspinor 4 Gspinor 4 B
FR9FR10FR11
x4
x1
x2
x3
v = v - i * wRe(v) = Re(v) + Im(w)Im(v) = Im(v) - Re(w)
double unit[2] = {1,1};
LFPDX 31,unit
LFPDX 9,spinor4_R
LFPDX 24,A_R
FXCXNSMA 9,31,24,9
FXCXNSMA to subtract
For spinor 3 and 4
For spinor 1 and 2v = v + w
Re(v) = Re(v) + Re(w)Im(v) = Im(v) + Im(w)
LFPDX 0,spinor1_R
LFPDX 24,A_R
FPADD 0,0,24
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Multiplying to 4x3 spinor
1,,,1,,,11,,,,,,1
,1,,,1,,1,1,,,,,1
,,1,,,1,1,,1,,,,1
,,,1,,,11,,,1,,,1
,,,,,,
tzyxtzyxUtzyxtzyxU
tzyxtzyxUtzyxtzyxU
tzyxtzyxUtzyxtzyxU
tzyxtzyxUtzyxtzyxU
tzyxtzyxD
ttttt
tzzzz
tyyyy
txxxx
xutmxutp
xuzmxuzp
xuymxuyp
xuxmxuxp
tzyxtzyxD
,,,,,,
Multiplying to every u?p, u?m operators
This change increases calculation,
but does not increase double FPU instructions
This allows out of order calculation of each operators
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Adding result to 4x3 spinor with multiplying
Ai
Bi
B
A
xx
A RA GA B
FR24FR25FR26
B RB GB B
FR27FR28FR29
B
Aspinor 1 Rspinor 1 Gspinor 1 B
FR0FR1FR2
spinor 2 Rspinor 2 Gspinor 2 B
FR3FR4FR5
spinor 3 Rspinor 3 Gspinor 3 B
FR6FR7FR8
spinor 4 Rspinor 4 Gspinor 4 B
FR9FR10FR11
x4
x1
x2
x3
v = v - * i * w Re(v) = Re(v) + *Im(w)Im(v) = Im(v) - *Re(w)
double kappa[2] = {, };
LFPDX 31,kappa
LFPDX 9,spinor4_R
LFPDX 24,A_R
FXCXNSMA 9,31,24,9
For spinor 3 and 4
For spinor 1 and 2
v = v + * wRe(v) = Re(v) + *Re(w)Im(v) = Im(v) + *Im(w)
LFPDX 0,spinor1_R
LFPDX 24,A_R
FXCPMADD 0,kappa,0,24
v = v - i * wdouble unit[2] = {1,1};
LFPDX 31,unit
LFPDX 9,spinor4_R
LFPDX 24,A_R
FXCXNSMA 9,31,24,9 v = v + wLFPDX 0,spinor1_R
LFPDX 24,A_R
FPADD 0,0,24
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Part II:
Parallelization of lattice QCD and optimization of communication
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Parallelization of lattice QCD and optimization of communication
Decreasing communication time as much as we can
– Limit data exchange only between neighboring node on torus• Shortest path and never conflict data exchange
– Mapping 4D lattice to torus network
MPI is rich enough to do such a limited communication
– Overhead to call MPI function is too big for limited conditions
We used torus packet HW directly
– Very small latency to send and receive data
– We can send/recv directly from/to register of double FPU• We do not need buffer in memory
– We can overlap sending 6 directions and local computations• We can hide communication time
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Parallelization of lattice QCDParallelization of lattice QCD
X
Y
X
Y
Solving on 1 CPU
Mapping physical topology of network
to avoid conflict of data exchange
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Mapping lattice to torus network of Blue Gene How to divide 4D lattice into 3D torus network
– Using virtual node mode and communication between 2 CPUs in same compute node will be 4th dimensional torus
We mapped xyzt of 4D lattice into TXYZ of virtual 4D torus
– The fastest communication is between 2 CPUs in compute node
– x of lattice is inner most loop of spinor and gauge array
CPU0
CPU1
Lattice of QCDTorus network of Blue Gene
y
ztX
YZ
Tx
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Data exchange by torus packet network
Packet header destination/size/etc…
16bytes data
16bytes data
16bytes data
Send FIFO X+
Send FIFO X-
Send FIFO Y+
Send FIFO Z-
...
double FPU
register
16bytes parallel load 16bytes parallel storeSending data
Receiving data
Store data to memory mapped FIFO to send data to neighboring node
6 FIFOs are independent and can transfer data at the same time
Packet size is multiple of 32 bytes up to 256 bytes
(including 16 bytes header)
...
double FPU
register
Recv FIFO X+
Recv FIFO X-
Recv FIFO Y+
Recv FIFO Z-
...
16bytes parallel load
CPU0
CPU1
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Exchanging 2 spinors between neighboring nodesSending 2 spinor to + direction for uym
C RC GC B
FR18FR19FR20
D RD GD B
FR21FR22FR23
C
D
Merging 4 spinors into 2 spinors
FIFO buffer
(1KB)
Send FIFO X+
Recv FIFO X-
C RC GC B
FR18FR19FR20
D RD GD B
FR21FR22FR23
C
D
Multiplying gauge matrix
Add to 4x3 spinor
Store data to FIFO as if it is part of memory
Load data from FIFO
Wait until all data is received in FIFO
Send directly from register
Size of 2 spinors is 2x3x16 = 96 bytes
1 packet is big enough to send
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Exchanging 2 spinors between neighboring nodesSending 2 spinor to - direction for uyp Merging 4 spinors into 2 spinors
FIFO buffer
(1KB)
Send FIFO X-
Recv FIFO X+
C RC GC B
FR18FR19FR20
D RD GD B
FR21FR22FR23
C
D
Multiplying gauge matrix
Add to 4x3 spinor
Store data to FIFO
Load data from FIFO
A RA GA B
FR24FR25FR26
B RB GB B
FR27FR28FR29
B
AA RA GA B
FR24FR25FR26
B RB GB B
FR27FR28FR29
B
A
Wait until all data is received in FIFO
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Exchanging 2 spinors between 2 CPUsSending 2 spinor to + direction for uxm
C RC GC B
FR18FR19FR20
D RD GD B
FR21FR22FR23
C
D
Merging 4 spinors into 2 spinors
Shared memory
C RC GC B
FR18FR19FR20
D RD GD B
FR21FR22FR23
C
D
Multiplying gauge matrix
Add to 4x3 spinor
Store data to shared memory
Load data from shared memory
lockbox
barrierPass barrier after all data is stored
Wait until all data is stored
CPU0
CPU1
Shared memory is not FIFO
so 2 CPUs are synchronized to make sure to write all data in shared memory
(For safety, it is better to synchronize also before send)
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Overlapping data exchange and computations
Torus packet HW can send 6 direction independently
– After storing to FIFO, data communication is non-blocking
• We can overlap computation or sending to other direction
But 6 send FIFOs are shared with 2 CPUs in compute node
– 2 sets of 3 FIFOs (X+,Y+,Z+) and (X-,Y-,Z-) are assigned for each CPU
SEND_X+
SEND_Y+
SEND_Z+
RECV_X-
RECV_Y-
RECV_Z-
Local computations
Data exchange between 2CPUs
loop for CPU0
SEND_X-
SEND_Y-
SEND_Z-
RECV_X+
RECV_Y+
RECV_Z+
loop for CPU1
Local computations
Data exchange between 2CPUs
Actual time to transfer data between compute node can be hidden if there is much CPU time between send and recv
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Special communication API for lattice QCD
Limitation
– Only for exchanging data between neighboring nodes and 2 CPUs in compute node
What we can do with API
– API function to prepare packet header
– API macros to send/recv data from/to FPU register• Communication between node (XYZ) and between CPU (T) can be ha
ndled in same way• These macros are used with inline assembly to optimize instruction pi
peline with computations
– API function for internal barrier between 2 CPUs
– API functions to send / recv through user buffer• These functions are used if we do not want to use inline assembly
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Comparison of API for QCD and MPI
between 2 neighboring compute node between 2 CPUs in compute node
20
40
60
80
100
120
140
160
100 1000 10000
Data size[B]
Ban
dwid
th [
MB
/sec
]
MPI
API for QCD
0
200
400
600
800
1000
1200
1400
1600
100 1000 10000
Data size[B]
Ban
dwid
th [
MB
/sec
]
MPI
API for QCD
1 packet = 256 bytes about 40 MB/sec
Comparison of bandwidth for ping-pong communication :
10 times as faster than MPI2-3 times as faster than MPI
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Sending data using API macros
#define BGLNET_WORK_REG 30#define BGLNET_HEADER_REG 30
BGLNet_Send_WaitReady(BGLNET_X_PLUS,fifo,6); loop for several times: (calculate something to send)
BGLNet_Send_Enqueue_Header(fifo); BGLNet_Send_Enqueue(fifo,24);
FPADD(21, 3, 6);BGLNet_Send_Enqueue(fifo,25);
FPSUB(18, 0, 9);BGLNet_Send_Enqueue(fifo,26);
FPSUB(19, 1,10);BGLNet_Send_Enqueue(fifo,27);
FPSUB(20, 2,11);BGLNet_Send_Enqueue(fifo,28);
FPADD(22, 4, 7);BGLNet_Send_Enqueue(fifo,29);
FPADD(23, 5, 8);BGLNet_Send_Packet(fifo);
end of loop:
to tell API which register is used to load packet header
Waits until send FIFO is not empty
Sets address of FIFO
Loads packet header
Sends packet header
Sends data from register
Sends additional 16 bytes
Computation can be inserted as same as to optimize instruction pipeline with load/store and computation
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Receiving data using API macros#define BGLNET_WORK_REG 30#define BGLNET_HEADER_REG 30
BGLNet_Recv_WaitReady(BGLNET_X_MINUS,fifo,Nx*8);loop for Nx times:
BGLNet_Recv_Dequeue_Header(fifo);FXCSMADD( 0,31,24, 0);
BGLNet_Recv_Dequeue(fifo,12);FXCSMADD( 1,31,25, 1);
BGLNet_Recv_Dequeue(fifo,13);FXCSMADD( 2,31,26, 2);
BGLNet_Recv_Dequeue(fifo,14);FXCSMADD( 3,31,27, 3);
BGLNet_Recv_Dequeue(fifo,15);FXCSMADD( 4,31,28, 4);
BGLNet_Recv_Dequeue(fifo,16);FXCSMADD( 5,31,29, 5);
BGLNet_Recv_Dequeue(fifo,17);FXCPMADD( 6,31,27, 6);
BGLNet_Recv_Packet(fifo);end of loop:
to tell API which register is used to recv packet header
Waits until all data is received in FIFO buffer
Sets address of FIFO
Receives packet header
Receives data to register
Receives additional 16 bytes
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Sending data between 2 CPUs
BGLNet_Send_WaitReady(BGLNET_T_PLUS,fifo,6); (calculate something to send)BGLNet_Send_Enqueue(fifo,0);
FXNMSUB(21,24,31,21);BGLNet_Send_Enqueue(fifo + 1,1);
FXNMSUB(18,27,31,18);BGLNet_Send_Enqueue(fifo + 2,2);
FXNMSUB(22,25,31,22);BGLNet_Send_Enqueue(fifo + 3,3);
FXNMSUB(23,26,31,23);BGLNet_Send_Enqueue(fifo + 4,4);
FXNMSUB(19,28,31,19);BGLNet_Send_Enqueue(fifo + 5,5);
FXNMSUB(20,29,31,20);BGLNet_InternalBarrier();
Does not wait
Does not need packet header
Sets address of shared memory to pointer “fifo”
Sends data from register
Shared memory is not fifo , so we should update address to store next data
Barrier function to make sure all data is stored in shared memory
Receiver also calls this function before receiving data
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Simpler way to access torus packet HW
2 API functions for send and recv can be used without inline assembly
Send data from user buffer
void BGLNet_Send(void* pData,int dir,int size);User buffer:
pData
size bytes
send FIFO
shared memory
For dir X,Y or Z
For dir = T
This function returns as soon as all data are copied to FIFO
(non-blocking send)
This function can send up to 32 spinors
Receive data to user buffer
send FIFO
shared memory
For dir X,Y or Z
For dir = T
void BGLNet_Recv(void* pData,int dir,int size);
User buffer:
pData
size bytesThis function waits until all data can be received
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Optimization result of our lattice QCD program
global lattice size 16x16x16x32 24x24x24x48
1/2 rack (2.8 TFLOPS)
8x8x8x2 = 1024 CPUs
24.33%(0.68 TFLOPS)
29.23%(0.82 TFLOPS)
1 rack (5.6 TFLOPS)
8x8x16x2 = 2048 CPUs
22.78%(1.28 TFLOPS)
28.57%(1.60 TFLOPS) st
ron
g s
calin
g
global lattice size 8x8x8x16 12x12x12x24
1 node card
4x4x2x2 = 64 CPUs
25.45% 29.84%
wea
k sc
alin
g
Sustained performance per peak performance:
inline assembly with MPI (using buffer to send at once)
(1/2 rack 24x24x24x48)
17.88%
For comparison:
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Easier way to optimize to get performance
XLC 8.0 with MPI original source code
(1/2 rack 24x24x24x48)
6.66%
XLC 8.0 with API for QCD non-overlapped communication
(1/2 rack 24x24x24x48)
7.95%
XLC 8.0 with MPI adding alignx() to tell compiler 16 bytes alignment (1/2 rack 24x24x24x48)
7.42%
XLC 8.0 with API for QCD overlapping 3 send for XYZ
(1/2 rack 24x24x24x48)
9.19%
Built-in functions to optimize double FPU instruction (much easier than inline assembly)
Overlapping communication and computations using API for QCD
get more performance
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation
Summary
Optimization using double FPU instructions
– We used inline assembly to optimize complex arithmetic
Parallelization of lattice QCD
– Mapping 4D lattice into 4D virtual torus network to limit communication only to neighboring compute node
Optimization of communication
– We used torus packet HW directly
• We developed API for QCD to use easier
Deep Computing | Tokyo Research Laboratory
© 2006 IBM Corporation