outline

1

Opportunities and Challenges of Modern Communication

Architectures: Case Study with QsNet

CAC Workshop

Santa Fe, NM, 2004

Sameer Kumar* and Laxmikant V. KaleParallel Programming Laboratory

University of Illinois at Urbana Champaign

2

Outline Processor virtualization QsNet

Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary

3

Processor Virtualization Basic idea of processor virtualization

User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI

User View

System implementation

4

QsNet Popular interconnect from

Quadrics Several parallel systems in top500

use QsNet Pittsburgh’s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor

5

Elite Network 320 MB/s each way after protocol Reliable fat-tree network

Multiple routes provides fault tolerance

Adaptive worm hole routing 35 ns per hop

6

Elan Network Adaptor Features

Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node)

Components Sparc processor DMA Engine 64 MB RAM On chip cache

7

Low CPU Overhead

0

5

10

15

20

25

30

35L

ate

ncy

(u

s)

16 64 256 1024 4096

Message Size (Bytes)

Latency CPU Overhead

CPU Overhead is small and does not change much with the message size

8

Traditional Message Passing

Time

P0

P1

Send Overhead Receive Overhead

Idle Time Traditional Message Passing does not utilize

low CPU overhead of Elan

9

Adaptive Overlap

VP0 VP1 VP0 VP1

Time

P0

P1

Send Overhead Receive Overhead

Processor Virtualization takes full advantage of the low CPU overhead of Elan

10

Benefit of Adaptive Overlap

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

0.001

0.01

0.1

1

1 10 100 1000

Procs

Exe

c Ti

me

[sec

]AMPI(1)

AMPI(8)

11

Charm++ Message Driven Execution

Handler

Scheduler

Pump Garbage Collection

Send

Tport SendPost Receives

Receive Message

12

NAMD: A Production MD System

•Written in Charm++•Fully featured program•NIH-funded development•Distributed free of charge (5000+ downloads so far)•Binaries and source code•Installed at NSF centers•Large published simulations (e.g., aquaporin simulation featured in keynote)

13

Scaling NAMD Several QsNet challenges had to

be overcome to scale NAMD

14

QsNet Challange: Latency

02468

101214161820

1 5 9 17 33

Number of Receives Posted

Shor

t Mes

sage

Lat

ency

(us)

MPI ConverseApplications need to post receives

for messages of different sizes

15

Latency Bottlenecks Latency

Slow NIC processor with a 100Mhz clock

Cache size only 8KB Traversing a large

loop flushes it

1 86017

5 92475

9 103037

13 174060

17 1008003

Cache Misses vs Number of Receives Posted

16

Managing Latency: Message Combining

Organize processors in a 2D (virtual) Mesh

Phase 1: Processors send messages to row neighbors1 P

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

Phase 1: Processors send messages to column neighbors1 P

2* messages instead of P-1 1P

17

NAMD PME Performance

0

20

40

60

80

100

120

140S

tep

Tim

e

256 512 1024

Processors

Mesh

Direct

Native MPI

Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

18

QsNet Challenge: Bandwidth

MB/s

One Way 290

Two Way 128

PCI/DMA Contention restricts bandwidth on Alpha servers

QsNetNetwork Bandwidth

320 MB/s

19

Improving Bandwidth

Main-Main Elan-Main Elan-Elan

One Way 290 305 319

Two Way 128 305 319

Sending messages from Elan memory is

faster

Node bandwidth (MB/s) for different placements of source and destination

20

QsNet Challenge: Stretched Handlers

Stretched Sends

Green superscripts

Similar stretches observed in the middle of entry methods

NAMD Timeline

Time

Pro

cess

ors

Force computeIntegrate

21

Stretching Solution Stretched Sends

Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged

Solved the problem by closely working with Quadrics and obtaining a patch

Isend only blocks on the rendezvous of the previous message to the same destination

22

Stretching Solution Contd. Stretches in the middle of entry

methods Caused by OS daemons Using blocking receives minimized

these stretches Daemons can be scheduled when

processor is idle

23

NAMD With Blocking ReceivesP

roce

ssor

s

Time

Blocking Receives

24

NAMD Performance on Lemieux

0

5

10

15

20

25

30

1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000

Processors

Ste

p Ti

me

(ms)

0

200

400

600

800

1000

1200

Per

form

ance

GF

LO

PS

Namd Step Time (ms) Performance (GF)

25

Summary QsNet and excellent network NIC co-processor ideal for message

driven execution Programming guidelines

Send messages from Elan memory Post limited number of receives and

before the sends Blocking receives to avoid stretching

26

Future Work One sided communication

Barrier? Persistent one sided

communication Reserve buffers on destination

outline

Documents

virtual processors

mpihigh bandwidth

fasternode bandwidth

physical processors

virtualization ratio

d virtual meshmessage

processor collective

d stencil calculation