using hps switch on bassi jonathan carter user services group lead jtcarter@lbl
DESCRIPTION
Using HPS Switch on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User Group Meeting June 12, 2006. IBM Switch Evolution. IBM Switch Evolution. HPS Switch Configuration. Bassi Switch Configuration. IBM Software. - PowerPoint PPT PresentationTRANSCRIPT
1
Using HPS Switch on Bassi
Jonathan CarterUser Services Group Lead
NERSC User Group MeetingJune 12, 2006
2
IBM Switch Evolution
3
IBM Switch Evolution
Year Name Peak BW Latency Processor
1996 SP Switch 300 MB/s per node
2x150 MB/s channel
20-35 us Power2/
Power3
2000 SP Switch2 (Colony)
2GB/s per node
2x500MB/s per port
~17 us Power3/
Power4
2003 HPS
(Federation)
2GB/s per port 5-14 us Power4/
Power5
4
HPS Switch Configuration
5
Bassi Switch Configuration
B0101 B0201 B0301 B0401 B0501 B0601 B0701 B0801 B0901 B1001 B1101 B1201
B0102 B0202 B0302 B0402 B0502 B0602 B0702 B0802 B0902 B1002 B1102 B1202
B0103 B0203 B0303 B0403 B0503 B0603 B0703 B0803 B0903 B1003 B1103 B1203
B2904 B0304 B0404 B0504 B0704 B0804 B0904 B1004 B1104 B1204
B0205 B0305 B0405 B0505 B0705 B8905 B0905 B1005 B1105 B1205
B0206 B0306 B0406 B0506 B0706 B0806 B0906 B1006 B1106 B1206
B0207 B0307 B0407 B0507 B0707 B8907 B0907 B1007 B1107 B1207
B2908 B0308 B0408 B0508 B0708 B0808 B0908 B1008 B1108 B1208
B0209 B0309 B0409 B0709 B0809 B0909 B1009 B1109 B1209
B0210 B0310 B0410 B0710 B0810 B0910 B1010 B1110 B1210
B0211 B0311 B0411 B0711 B0811 B0911 B1011 B1111 B1211
B0212 B0312 B0412 B0712 B0812 B0912 B1012 B1112 B1212
6
IBM Software
• Parallel Environment (PE 4.2.2) which contains poe and MPI remains unchanged
• Parallel System Support Package (PSSP 3.5.0), which contains LAPI, absorbed in Reliable Scalable Clustering Technology (RSCT 2.4.2) software stack.
7
IBM Software
• MPI 4.2.2– Uses LAPI as reliable transport layer– Uses threads not signals for
asynchronous activities
• Binary compatible• New performance characteristics– Eager– Bulk transfer– Collectives
8
IBM Software Stack
HPS
SMA3+ Adapter
HAL
LAPI
IF_LS
IP
MPI
Application
ESSL PESSL GPFS Sockets
VSD TCP UDP
9
Communication Modes
• FIFO mode– Chopped into 2KB
chunks on host, copied by CPU
• Remote Direct Memory Access (RDMA)– CPU offload– One I/O bus crossing
Adapter
CPUUser Buffer
FIFORDMA
DMA
10
RDMA (Bulk transfer)
• Overlap of communication and computation possible– Asynchronous-messaging applications– One-sided communications
• Reduce CPU work– Offload fragmentation and reassembly– Minimize packet arrival interrupts
• Reduce memory subsystem load– Zero copy transport
• Striping across adapters
11
RDMA vs. Packet
0
500
1000
1500
2000
2500
3000
3500
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
5.2E+05
2.1E+06
8.4E+06
MSG Size
MB
/s PingPong
PingPong
12
MPI Transfer Protocols
• Eager: send data immediately; store in remote buffer– No synchronization– Only one message sent– Uses memory for buffering (less for application)
• Rendezvous: send message header; wait for recv to be posted; send data– No data copy may be required– No memory required for buffering (more for
application)– More messages required– Synchronization (standard send blocks until recv
posted)
P0 P1
data
ack
req
ack
data
ack
13
Eager vs. Rendezvous
0
20
40
60
80
100
120
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
MSG Size
Tim
e (
us
)
Eager
Rendevous
14
Latency
System Intra (us) Inter (us)
Seaborg 10.5 24.5
Jacquard 0.6 4.7
Bassi 1.1 4.5
15
Internode Comparison
0
500
1000
1500
2000
2500
3000
3500
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
5.2E+05
2.1E+06
8.4E+06
MSG Size
MB
/s
bassi
seaborg
jacquard
16
Internode Comparison
0
50
100
150
200
250
300
350
400
0.0E+00
1.0E+00
2.0E+00
4.0E+00
8.0E+00
1.6E+01
3.2E+01
6.4E+01
1.3E+02
2.6E+02
5.1E+02
1.0E+03
2.0E+03
4.1E+03
MSG Size
MB
/s
bassi
seaborg
jacquard
17
Intranode Comparison
0
1000
2000
3000
4000
5000
6000
7000
8000
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
5.2E+05
2.1E+06
MSG Size
MB
/s
bassi
seaborg
jacquard
18
Intranode Comparison
0
200
400
600
800
1000
1200
1400
0.0E+00
1.0E+00
2.0E+00
4.0E+00
8.0E+00
1.6E+01
3.2E+01
6.4E+01
1.3E+02
2.6E+02
5.1E+02
1.0E+03
2.0E+03
4.1E+03
MSG Size
MB
/s
bassi
seaborg
jacquard
19
Packed-node Comparison
0
100
200
300
400
500
600
0.0E+00
2.0E+00
8.0E+00
3.2E+01
1.3E+02
5.1E+02
2.0E+03
8.2E+03
3.3E+04
1.3E+05
5.2E+05
2.1E+06
8.4E+06
MSG Size
MB
/s
bassi
seaborg
jacquard
20
Packed-node Comparison
0
50
100
150
200
250
300
0.0E+00
1.0E+00
2.0E+00
4.0E+00
8.0E+00
1.6E+01
3.2E+01
6.4E+01
1.3E+02
2.6E+02
5.1E+02
1.0E+03
2.0E+03
4.1E+03
MSG Size
MB
/s
bassi
seaborg
jacquard
2121
• MP_SINGLE_THREAD– Set to Yes for slight latency
decrease, set to No for MPI I/O and OpenMP, etc.
• MP_USE_BULK_XFER– Default to Yes
• MP_BULK_MIN_MSG_SIZE– Default to ~150KB
POE environment variables
2222
• MP_BUFFER_MEM– Default is 64MB
• MP_EAGER_LIMIT– Varies from 32KB to 1KB depending on job
size, can be increased in conjunction with MP_BUFFER_MEM
• LAPI parameters for apps with many blocking send of small mgs:– MP_REXMIT_BUF_SIZE
• Default 128 bytes
– MP_REXMIT_BUF_CNT• Default is 128 buffers
POE environment variables
23
IBM Documentation
• RSCT for AIX 5L LAPI Programming Guide (SA22-7936-03) – LAPI programming
• Parallel Environment for AIX 5L V4.2.2Operation and Use, Vol 1 (SA22-7948-04)– Running jobs
• Parallel Environment for AIX 5L V4.2.2Operation and Use, Vol 2 (SA22-7949-04)– Performance tools
• Parallel Environment for AIX 5L V4.2.2MPI Programming Guide (SA22-7945-04)– IBM MPI implementation