cray inc. hot interconnects 1 bob alverson, duncan roweth, larry kaplan cray inc
Post on 14-Jan-2016
217 Views
Preview:
TRANSCRIPT
Gemini System Interconnect
Cray Inc. Hot Interconnects 1
Bob Alverson, Duncan Roweth, Larry Kaplan
Cray Inc.
OverviewNetwork InterfaceRouterReliability, Availability, and Serviceability FeaturesSoftware StackPerformance
Agenda
Cray Inc. Hot Interconnects 2
Integrated NIC and RouterExternal HSS Monitoring Supports 2 Nodes per ASIC Advanced Resiliency Features Hardware Global Address Support Advanced NIC designed to efficiently
support MPI One-sided MPI Shmem UPC, Coarray FORTRAN
Cray Inc. Hot Interconnects 3
Cray XE6 With Gemini Network
Cray XE6 Chassis Topology
Cray Inc. Hot Interconnects 4
Y
X
Z
Z
X
Y
Fast Memory Access (FMA) – fine grain remote PUT/GETBlock Transfer Engine (BTE) – offload for long transfersCompletion Queue (CQ) – client notificationAtomic Memory Op (AMO) – fetch&add, etc.
Gemini Network Interface
Cray Inc. Hot Interconnects 5
HT
3 C
ave
vc0
vc1
vc1
vc0
LB Ring
LBLM
NL
FMA
CQ
NPT
RMTnet req
HARB
net rsp
ht pireq
ht treq p
ht irsp
ht npireq
ht np req
ht np reqnet req
ht p req ORB
RAT
NAT
BTE
net req
net rsp
ht treq np
ht trsp net req
net req
net req
net req
net reqnet req
ht p req
ht p req
ht p req net rsp
CLM
AMO net rsp headers
TARB
net req
net rsp
SSID
Ro
ute
r T
iles
Single-sidedProcessor stores become remote
PUT or GETFMA descriptors hold state to help
determine destination node and memory location
FMA PUT for short messagesUncached processor store to
Gemini window translated directly to network packet
FMA GET allows reverse direction data transfer of 1 to 64 bytes
Fast Memory Access
Cray Inc. Hot Interconnects 6
Driver managedBTE PUT for long messages
DMA transfer to offload data movement from processorBTE SEND for IP traffic, etc.
Send message to remote nodeSingle receive queue for all sourcesUpper level protocol covers lost messages
BTE GET support for simplified data transfersIn lieu of involving remote side for PUT
Block Transfer Engine
Cray Inc. Hot Interconnects 7
Hardware remote atomic memory operations in the NICAdd, Compare & Swap, Logical OperationsExecuted at the node with the memoryAMO cache for hot locations
Up to 64 locations with AMOs in process
Global operations supportBarriersCountersCollectives (reductions, global sum)
Atomic Memory Operations (AMOs)
Cray Inc. Hot Interconnects 8
6x8 tile matrixInput queue to one of 6
subswitchesRoute to one of 8 output
buffersHashed routing preserves
order to cachelinesAdaptive routing
Router
Cray Inc. Hot Interconnects 9
Cray Inc. Hot Interconnects 10
Adaptive routing
Route around stalled or down links If a link goes down, adaptive routing mask updated in hardware to
exclude it OS traffic uses adaptive routing only, recovers from finite loss of packets Quiesce and re-route to repair deterministic routes
Congestion feedback to allow routing around bottlenecks Potential for improved performance
on difficult traffic patterns such as transpose
Packets reordered in receive buffer (DRAM) Separate notification (completion
event) when all stored
24 bit flitMaximum size packet is
7+24+1=32 flit Put request of 64 bytes
Minimum is 2 flit Put response
Network Packet Format
Cray Inc. Hot Interconnects 11
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
phit 0 h a r=0 v p c
phit 1 p c
phit 2 p c
…
last phit R R R 1 p c
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
phit 0 h a r=0v=0 p c
phit 1 F carmt b p c
phit 2 p c
phit 3 vm ra p c
phit 4 dt pt p c
phit 5 p c
phit 6 p c
23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
phit n p c
(phit n+1) p c
(phit n+2) p c
CRC-16 ok
payload
address[37:24]ptag[7:0]
vc
cmd[5:0]
vc
payloadoptional hash bitspayload
reserved addr[45:40]
General Network Packet Format
destination[15:0]
packetID[11:0]SSID[7:0]
MDH[11:0]
Network Request Packet Format
address[23:6]
destination[15:0]
sizedata[19:0]
mask[15:0]
BTEvc
data[63:42]
SrcIDDstIDsource[15:0]
data[41:20]
addr[39:38]
Data Payload (up to 24 phits)
Automatic link-level retriesHT3 support including automatic retries and improved CRCMost internal data structures are at least parity protected
The longer the occupancy of data at a location, the stronger the protection
Errors reported as precisely as possiblePayload errors reported directly to userControl errors often cannot be associated with a particular
transactionIn all cases OS or HSS can be notified of the error
Router errors includedReported at the point of errorEndpoint(s) (user) see a timeout
RAS Features
Cray Inc. Hot Interconnects 12
Gemini Software Stack
Cray Inc. Hot Interconnects 13
User level Gemini Network Interface(uGNI)
DMAPP
MPICH MPICH2 SHMEM
Gemini Hardware Abstraction Layer (GHAL)
GNI Core
IOC
TL o
r Syste
m C
all
Kernel level GNI(kGNI)
Lustre Network Driver(LND)
IP over Gemini Fabric
(IPoGIF)
Dire
ct A
ccess
Linux Core
GART Resource Management
(GRM)
Cray COW solution
MRT-size page support
Registration Cache support
PGAS
Dire
ct A
ccess
LatencyBandwidthAtomic operations
Performance
Cray Inc. Hot Interconnects 14
Gemini expanded to HT3 at up to 5.2 GT/sExpect to sustain greater than 6 GB/s user data injection
Network bandwidth is limited by XT packagingLink speed from 3.125 to 6.25 Gbit/secIn some cases, double wide X & Z links also offer
increased bandwidth
Gemini relies on user level threadsMPI processing limits to 2M messages/sec per threadScales beyond 10M msg/sec per NIC
Injection BW and Message Rate
Cray Inc. Hot Interconnects 15
One way PUT in 750ns
Waiting for Ack in only 1.1 us
Remote GET increases to 1.4 us
Latency vs. Transfer Size
Cray Inc. Hot Interconnects 16
0.0
0.5
1.0
1.5
2.0
2.5
8 16 32 64 128 256 512 1024
Tim
e (m
icros
ecs)
Size (bytes)
PUT, ping-pongPUT, at sourceGET
Peak bandwidth reached with small transfers
Multiple threads reach peak with smaller, still, transfers
Bandwidth vs. Transfer Size
Cray Inc. Hot Interconnects 17
0
1000
2000
3000
4000
5000
6000
7000
8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K
Ban
dwid
th (M
byte
s/se
c)
Size (bytes)
PPN=1
PPN=2
PPN=4
Hot location reaches 100 Mupdates/sec
Random locations (GUPS) still over 45 Mupdates/sec
Atomic Memory Operation Rate
Cray Inc. Hot Interconnects 18
0
20
40
60
80
100
120
0 256 512 768 1024
AMO
rate
(mill
ions
)
Number of processes
1 AMO8192 AMOs
Gemini provides low latency, and performance for fine grain operations
Gemini has features to scale in performance and reliability to large system size
Questions?
Conclusion
Cray Inc. Hot Interconnects 19
top related