outline
Post on 31-Dec-2015
19 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Opportunities and Challenges of Modern Communication
Architectures: Case Study with QsNet
CAC Workshop
Santa Fe, NM, 2004
Sameer Kumar* and Laxmikant V. KaleParallel Programming Laboratory
University of Illinois at Urbana Champaign
2
Outline Processor virtualization QsNet
Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary
3
Processor Virtualization Basic idea of processor virtualization
User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI
User View
System implementation
4
QsNet Popular interconnect from
Quadrics Several parallel systems in top500
use QsNet Pittsburgh’s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor
5
Elite Network 320 MB/s each way after protocol Reliable fat-tree network
Multiple routes provides fault tolerance
Adaptive worm hole routing 35 ns per hop
6
Elan Network Adaptor Features
Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node)
Components Sparc processor DMA Engine 64 MB RAM On chip cache
7
Low CPU Overhead
0
5
10
15
20
25
30
35L
ate
ncy
(u
s)
16 64 256 1024 4096
Message Size (Bytes)
Latency CPU Overhead
CPU Overhead is small and does not change much with the message size
8
Traditional Message Passing
Time
P0
P1
Send Overhead Receive Overhead
Idle Time Traditional Message Passing does not utilize
low CPU overhead of Elan
9
Adaptive Overlap
VP0 VP1 VP0 VP1
Time
P0
P1
Send Overhead Receive Overhead
Processor Virtualization takes full advantage of the low CPU overhead of Elan
10
Benefit of Adaptive Overlap
Problem setup: 3D stencil calculation of size 2403 run on Lemieux.
Shows AMPI with virtualization ratio of 1 and 8.
0.001
0.01
0.1
1
1 10 100 1000
Procs
Exe
c Ti
me
[sec
]AMPI(1)
AMPI(8)
11
Charm++ Message Driven Execution
Handler
Scheduler
Pump Garbage Collection
Send
Tport SendPost Receives
Receive Message
12
NAMD: A Production MD System
•Written in Charm++•Fully featured program•NIH-funded development•Distributed free of charge (5000+ downloads so far)•Binaries and source code•Installed at NSF centers•Large published simulations (e.g., aquaporin simulation featured in keynote)
14
QsNet Challange: Latency
02468
101214161820
1 5 9 17 33
Number of Receives Posted
Shor
t Mes
sage
Lat
ency
(us)
MPI ConverseApplications need to post receives
for messages of different sizes
15
Latency Bottlenecks Latency
Slow NIC processor with a 100Mhz clock
Cache size only 8KB Traversing a large
loop flushes it
1 86017
5 92475
9 103037
13 174060
17 1008003
Cache Misses vs Number of Receives Posted
16
Managing Latency: Message Combining
Organize processors in a 2D (virtual) Mesh
Phase 1: Processors send messages to row neighbors1 P
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
Phase 1: Processors send messages to column neighbors1 P
2* messages instead of P-1 1P
17
NAMD PME Performance
0
20
40
60
80
100
120
140S
tep
Tim
e
256 512 1024
Processors
Mesh
Direct
Native MPI
Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages
18
QsNet Challenge: Bandwidth
MB/s
One Way 290
Two Way 128
PCI/DMA Contention restricts bandwidth on Alpha servers
QsNetNetwork Bandwidth
320 MB/s
19
Improving Bandwidth
Main-Main Elan-Main Elan-Elan
One Way 290 305 319
Two Way 128 305 319
Sending messages from Elan memory is
faster
Node bandwidth (MB/s) for different placements of source and destination
20
QsNet Challenge: Stretched Handlers
Stretched Sends
Green superscripts
Similar stretches observed in the middle of entry methods
NAMD Timeline
Time
Pro
cess
ors
Force computeIntegrate
21
Stretching Solution Stretched Sends
Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged
Solved the problem by closely working with Quadrics and obtaining a patch
Isend only blocks on the rendezvous of the previous message to the same destination
22
Stretching Solution Contd. Stretches in the middle of entry
methods Caused by OS daemons Using blocking receives minimized
these stretches Daemons can be scheduled when
processor is idle
24
NAMD Performance on Lemieux
0
5
10
15
20
25
30
1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
Processors
Ste
p Ti
me
(ms)
0
200
400
600
800
1000
1200
Per
form
ance
GF
LO
PS
Namd Step Time (ms) Performance (GF)
25
Summary QsNet and excellent network NIC co-processor ideal for message
driven execution Programming guidelines
Send messages from Elan memory Post limited number of receives and
before the sends Blocking receives to avoid stretching
top related