outline

Opportunities and Challenges of Modern Communication

Architectures: Case Study with QsNet

CAC Workshop

Santa Fe, NM, 2004

Sameer Kumar* and Laxmikant V. KaleParallel Programming Laboratory

University of Illinois at Urbana Champaign

Outline Processor virtualization QsNet

Opportunities Performance Evaluation of QsNet Challenges of QsNet Summary

Processor Virtualization Basic idea of processor virtualization

User specifies interaction between objects (VPs) RTS maps VPs onto physical processors Typically, # virtual processors > # processors Embodied in Charm++ and AMPI

User View

System implementation

QsNet Popular interconnect from

Quadrics Several parallel systems in top500

use QsNet Pittsburgh’s Lemieux (6TF) ASCI-Q (20TF) Elite network Elan adaptor

Elite Network 320 MB/s each way after protocol Reliable fat-tree network

Multiple routes provides fault tolerance

Adaptive worm hole routing 35 ns per hop

Elan Network Adaptor Features

Low latency (4.5 μs for MPI) High bandwidth (320MB/s/node)

Components Sparc processor DMA Engine 64 MB RAM On chip cache

Low CPU Overhead

16 64 256 1024 4096

Message Size (Bytes)

Latency CPU Overhead

CPU Overhead is small and does not change much with the message size

Traditional Message Passing

Send Overhead Receive Overhead

Idle Time Traditional Message Passing does not utilize

low CPU overhead of Elan

Adaptive Overlap

VP0 VP1 VP0 VP1

Send Overhead Receive Overhead

Processor Virtualization takes full advantage of the low CPU overhead of Elan

Benefit of Adaptive Overlap

Problem setup: 3D stencil calculation of size 2403 run on Lemieux.

Shows AMPI with virtualization ratio of 1 and 8.

1 10 100 1000

]AMPI(1)

AMPI(8)

Charm++ Message Driven Execution

Handler

Scheduler

Pump Garbage Collection

Tport SendPost Receives

Receive Message

NAMD: A Production MD System

•Written in Charm++•Fully featured program•NIH-funded development•Distributed free of charge (5000+ downloads so far)•Binaries and source code•Installed at NSF centers•Large published simulations (e.g., aquaporin simulation featured in keynote)

Scaling NAMD Several QsNet challenges had to

be overcome to scale NAMD

QsNet Challange: Latency

101214161820

1 5 9 17 33

Number of Receives Posted

MPI ConverseApplications need to post receives

for messages of different sizes

Latency Bottlenecks Latency

Slow NIC processor with a 100Mhz clock

Cache size only 8KB Traversing a large

loop flushes it

1 86017

5 92475

9 103037

13 174060

17 1008003

Cache Misses vs Number of Receives Posted

Managing Latency: Message Combining

Organize processors in a 2D (virtual) Mesh

Phase 1: Processors send messages to row neighbors1 P

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

Phase 1: Processors send messages to column neighbors1 P

2* messages instead of P-1 1P

NAMD PME Performance

256 512 1024

Processors

Direct

Native MPI

Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

QsNet Challenge: Bandwidth

One Way 290

Two Way 128

PCI/DMA Contention restricts bandwidth on Alpha servers

QsNetNetwork Bandwidth

320 MB/s

Improving Bandwidth

Main-Main Elan-Main Elan-Elan

One Way 290 305 319

Two Way 128 305 319

Sending messages from Elan memory is

faster

Node bandwidth (MB/s) for different placements of source and destination

QsNet Challenge: Stretched Handlers

Stretched Sends

Green superscripts

Similar stretches observed in the middle of entry methods

NAMD Timeline

Force computeIntegrate

Stretching Solution Stretched Sends

Elan Isend blocked when the rendezvous for the previous Isend to any destination had not been acknowledged

Solved the problem by closely working with Quadrics and obtaining a patch

Isend only blocks on the rendezvous of the previous message to the same destination

Stretching Solution Contd. Stretches in the middle of entry

methods Caused by OS daemons Using blocking receives minimized

these stretches Daemons can be scheduled when

processor is idle

NAMD With Blocking ReceivesP

Blocking Receives

NAMD Performance on Lemieux

1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000

Processors

Namd Step Time (ms) Performance (GF)

Summary QsNet and excellent network NIC co-processor ideal for message

driven execution Programming guidelines

Send messages from Elan memory Post limited number of receives and

before the sends Blocking receives to avoid stretching

Future Work One sided communication

Barrier? Persistent one sided

communication Reserve buffers on destination

outline

virtual processors

mpihigh bandwidth

fasternode bandwidth

physical processors

virtualization ratio

d virtual meshmessage

processor collective

d stencil calculation

Documents

le cameroun laura friedman. click to edit the outline text...

p economics...schaum’s easy outline: genetics schaum’s...

the public law outline - bps · public law proceedings...

technical proposal outline business … proposal outline...

constitutional law i outline - final exam outline

logistic regression outline outline outline - clic-cimec

schaum's outline of college mathematics (schaum's outline...

welcome to the comenius cultural evening. click to edit the...

[ outline ]

outline - intellectual property outline depot)

professional - lawnews24.ru · schaum’s easy outline:...

schaum's outline of calculus, 5th ed. (schaum's outline...

outline maps - nasa · pdf fileoutline maps outline map of...

strategic outline 2015–16 strategic outline 2015–16

outline purpose design implementation market conclusion...

issue outline no. 4: contextual building design update/ldc...

london eye laura martín gallardo.. click to edit the...

faa airport pavement technology program outline outline -...

outline: todayÕs topic outline: entire course outline

cyberethics essay outline (full) outline assignment...