upc research activities at uf presentation for upc workshop ’04 alan d. george hung-hsun su burton...

UPC Research Activities at UF

Presentation for UPC Workshop ’04

Alan D. GeorgeHung-Hsun Su

Burton C. GordonBryan Golden

Adam Leko

HCS Research LaboratoryUniversity of Florida

2

Outline

FY04 research activities Objectives Overview

Design Experimental Setup

Results Conclusions

New activities for FY05 Introduction Approach Conclusions

3

Research Objective (FY04) Goal

Extend Berkeley UPC support to SCI – a new SCI Conduit. Compare UPC performance on platforms with various

interconnects using existing and new benchmarks.

UPC Code Translator

Translator-Generated C Code

Berkeley UPC Runtime System

GASNet Communication System

Network Hardware

Platform-independent

Network-independent

Compiler-independent

Language-independent

4

GASNet SCI Conduit - Design

AM Header

Medium AM Payload

Long AM Payload

Message Ready Flags

Control

Command Y-1

Node X

...

Command Y-X

...

Command Y-N

Payload Y

Node Y

Wait for Completion

1st Message Pointer = NULL?

Dequeue 1st message

New Message Available?

Enqueue all new messages

Yes Yes

No

Check

Extract new message

information

Handle message Exit Polling

Start Polling

Other processing

Process reply

message

AM reply or control reply

Memory

The queue is constructed by placing a next pointer in the

header that points to the next available message

Control X

Command X-1

...

Command X-N

Payload X

Local (In use)

Local (free)

Control 1

...

...

Control N

Command 1-X

...

Command X-X

Control X

Payload 1

...

Payload X

...

Control 1

...

Control X

Command 1-X

...

Command X-X

Command N-X

Payload N

Command X-X

...

...

Control N

...

Command N-X

...

Control Segments(N total)

Command Segments(N*N total)

Payload Segments(N total)

SCI Space

Node X

Physical Address

Virtual Address

ExportingImporting

DMA Queues (Local)

5

Experimental Testbed Elan, VAPI (Xeon), MPI, and SCI conduits

Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset.

SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus.

Quadrics: 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16 16-port switch.

InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO 2000 8-port switch from Infinicon.

RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany. Berkeley UPC runtime system 1.1.

VAPI (Opteron) Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S

server motherboard. InfiniBand: Same as in VAPI (Xeon).

GM (Myrinet) conduit (c/o access to cluster at MTU) Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM. Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3F-

SW16 switch. RedHat 7.3 with Intel C compiler V 7.1., Berkeley UPC runtime system 1.1.

ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor

connections. Tru64 5.1B Unix, HP UPC V2.1 compiler.

* via testbed made available courtesy of Michigan Tech

6

IS (Class A) from NAS Benchmark

0

5

10

15

20

25

30

GM Elan GigE mpi VAPI (Xeon) SCI mpi SCI Marvel

Ex

ec

uti

on

Tim

e (

se

c)

1 Thread 2 Threads 4 Threads 8 Threads

IS (Integer Sort), lots of fine-grain communication, low computation. Poor performance in the GASNet communication system does not necessary indicate poor

performance in UPC application.

7

DES Differential Attack Simulator S-DES (8-bit key) cipher (integer-based).

Creates basic components used in differential cryptanalysis. S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT).

Bandwidth-intensive application. Designed for high cache miss rate, so very costly in terms of memory access.

0

500

1000

1500

2000

2500

3000

3500

GM Elan GigE mpi VAPI(Xeon)

SCI mpi SCI Marvel

Ex

ec

uti

on

Tim

e (

ms

ec

.)

Sequential 1 Thread 2 Threads 4 Threads

8

DES Analysis

With increasing number of nodes, bandwidth and NIC response time become more important.

Interconnects with high bandwidth and fast response times perform best. Marvel shows near-perfect linear speedup, but processing time of

integers an issue. VAPI shows constant speedup. Elan shows near-linear speedup from 1 to 2 nodes, but more

nodes needed in testbed for better analysis. GM does not begin to show any speedup until 4 nodes, then

minimal. SCI conduit performs well for high-bandwidth programs but with

the same speedup problem as GM. MPI conduit clearly inadequate for high-bandwidth programs.

9

Differential Cryptanalysis for CAMEL Cipher Uses 1024-bit S-Boxes. Given a key, encrypts data, then tries to guess key solely based on encrypted data

using differential attack. Has three main phases:

Compute optimal difference pair based on S-Box (not very CPU-intensive). Performs main differential attack (extremely CPU-intensive).

Gets a list of candidate keys and checks all candidate keys using brute force in combination with optimal difference pair computed earlier.

Analyze data from differential attack (not very CPU-intensive). Computationally (independent processes) intensive + several synchronization points.

0

50

100

150

200

250

SCI (Xeon) VAPI (Opteron) Marvel

Execu

tio

n T

ime (

s) 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads

Parameters

MAINKEYLOOP = 256

NUMPAIRS = 400,000

Initial Key: 12345

10

CAMEL Analysis

Marvel Attained almost perfect speedup. Synchronization cost very low.

Berkeley UPC Speedup decreases with increasing number of threads.

Cost of synchronization increases with number of threads. Run time varied greatly as number of threads increased.

Hard to get consistent timing readings. Still decent performance for 32 threads (76.25% efficiency,

VAPI). Performance is more sensitive to data affinity.

11

Conclusions (FY04) SCI conduit

Functional, optimized version is available. Although limited by current driver from vendor, it is able to achieve

performance comparable to other conduits. Enhancements to resolve driver limitation are being investigated in close

collaboration with Dolphin. Support access of all virtual memory on remote node. Minimize transfer setup overhead.

Paper accepted by 2004 IEEE Workshop on High-Speed Local Networks.

Performance comparison Marvel

Provides better compiler warnings. Has better speedup.

Berkeley UPC system a promising COTS cluster tool Performance on par with HP UPC. VAPI and Elan are initially found to be strongest.

12

Introduction to New Activity (FY05)

UPC Performance Analysis Tool (PAT) Motivations

UPC program does not yield the expected performance. Why? Due to the complexity of parallel computing, difficult to

determine without tools for performance analysis. Discouraging for users, new & old; few options for shared-

memory computing in UPC and SHMEM communities. Goals

Identify important performance “factors” in UPC computing. Develop framework for a performance analysis tool.

As new tool or as extension/redesign of existing non-UPC tools. Design with both performance and user productivity in mind. Attract new UPC users and support improved performance.

13

Approach

Application Layer

Language Layer

Compiler Layer

Middleware Layer

Hardware Layer

Performance Analysis Tool

Measurable Factor List

Major Factor List

Minor Factor List

Tool Study

Experimental Study II

Literature Study / Intuition

Relevant Factor List

PAT

Survey of existing literature plus ideas

from past experience

Survey of existing tools with list of features as

end result

Experimental Study I

Preliminary experiments designed to test the validity

of each factor

Irrelevant Factor List

Additional experiments to understand the degree of

sensitivity, condition, etc. of each factor

Features from tool study plus analyses and factors from literature study that are

measurable

Factors shown not to be applicable to

performance

Collection of factors shown to have significant effect on

program performanceCollection of factors shown NOT to have significant effect on program

performance

Updated list including factors shown to have effect on program performance

Define layers to divide the workload. Conduct existing-tool study and performance

layers study in parallel to: Minimize development time Maximize usefulness of PAT

Develop model to predict optimal performance.

14

Conclusions (FY05) PAT development cannot be successful without UPC developer

and user input. Develop a UPC user pool to obtain user input.

What kind of information is important? Familiarity with any existing PAT? Preference if any? Why? Past experience on program optimization.

Require extensive knowledge on how each UPC compiler works to support each of them successfully. Compilation strategies. Optimization techniques. List of current and future platforms.

Propose the idea of a standard set of performance measurements for all UPC platforms and implementations. Computation (local, remote). Communication.

Develop a repository of known performance bottleneck issues.

15

Comments and Questions?

16

Appendix – Sample User Survey

Are you currently using any performance analysis tools? If so, which ones? Why?

What features do you think are most important in deciding which tool to use?

What kind of information is most important for you when determining performance bottlenecks?

Which platforms do you target most? Which compiler(s)?

From past experience, what coding practices typically lead to most of the performance bottlenecks (for example: bad data affinity to node) ?

17

Appendix - GASNet Latency on Conduits

0

5

10

15

20

25

30

35

40

1 2 4 8 16 32 64 128 256 512 1K

Message Size (bytes)

Ro

un

d-t

rip

La

ten

cy (

use

c)

GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get

18

Appendix - GASNet Throughput on Conduits

0

100

200

300

400

500

600

700

800

128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K

Message Size (bytes)

Th

rou

gh

pu

t (M

B/s

)

GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get

upc research activities at uf presentation for upc workshop ’04 alan d. george hung-hsun su burton...

Documents

sci conduits nodes

upc performance

new sci conduit

upc research activities

upc application

vapi opteron nodes

mtu nodes

upc workshop