Download - N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

Comparison of Communication and I/O of the Cray T3E and IBM SP

Jonathan Carter

NERSC User Services


2

Overview

• Node Characteristics

• Interconnect Characteristics

• MPI Performance

• I/O Configuration

• I/O Performance


3

T3E Architecture

• Distributed memory, single CPU processing elements

Interconnect

CPU Memory


4

T3E Communication Network

• Processing Elements (PE) are connected by a 3D torus.


5

T3E Communication Network

• The peak bandwidth of the torus is about 600 Mbyte/sec bidirectional

• Sustainable bandwidth is about 480 Mbytes/sec bidirectional

• Latency is 1μs

• Shmem API gives latency of 1μs, bandwidth 350 Mbyte/sec bidirectional


6

SP Architecture

• Cluster of SMP nodes

Interconnect

Memory

CPU

CPU


7

SP Communication Network

• Nodes are connected via adapters to the SP Switch. Switch is composed of boards which link 16 nodes. Boards are linked to form larger network.

Switch Board

Nodes


8

SP Communication Network

• The peak bandwidth of adapter and switch is 300 Mbyte/sec bidirectional

• Latency of the switch is about 2μs

• Sustainable bandwidth is about 185 Mbytes/sec bidirectional


9

MPI Performance

T3E SP(intra-node)

SP(inter-node)

Latencys

12 10 22

BandwidthMbyte/s

270 300 150

Intra-node is 1 MPI process per node, 2 MPI processes (typical) will halve bandwidth


10

MPI Performance

MPI_reduce (sum)

0

500

1000

1500

2000

2500

3000

3500

4000

16 32 64 128

Procs.

Tim

e (u

s) T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes


11

MPI Performance

MPI_Bcast

0

100

200

300

400

500

600

700

16 32 64 128

Procs.

Tim

e (u

s) T3E 256 bytesSP 256 bytesT3E 1024 bytesSP 1024 bytes


12

T3E I/O Configuration

• PEs do not have local disk

• All PEs access all filesystems equivalently

• Path for (optimum) I/O generally looks like:– PE to I/O node via torus

– I/O node to Fibre Channel Node (FCN) via Gigaring

– FCN to Disk Array via Fibre loop

• In some cases data on APP PE must be transferred to a system buffer on an OS PE then out to an FCN


13

T3E I/O Configuration

I/O FCN

Gigaring

Disk Arrays


14

SP I/O Configuration

• Nodes have local disk. One SCSI disk for all local filesystems. Non-optimal.

• All nodes access Global Parallel File System (GPFS) filesystems equivalently

• Path for GPFS I/O looks like:– Node to GPFS Node via IP over the switch

– GPFS Node to Disk Array via SSA loop


15

SP I/O Configuration

Nodes

Switch

Switch

GPFS Nodes

Disk Array


16

T3E Filesystems• /usr/tmp

– fast

– subject to 14 day purge, not backed up

– check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes)

• $TMPDIR

– fast

– purged at end of job or session

– shares quota with /usr/tmp

• $HOME

– slower

– permanent, backed up

– check quota with quota (usually 2Gb and 3500 inodes)


17

SP Filesystems• /scratch and $SCRATCH

– global

– fast (GPFS)

– subject to 14 day purge (or at session end for $SCRATCH), not backed up

– check quota with myquota (usually 100Gb and 6000 inodes)

• $TMPDIR

– local (created in /scr) - only 2 Gbyte total

– slower

– purged at end of job or session

• $HOME

– global

– slower (GPFS)

– permanent, not backed up yet

– check quota with myquota (usually 4Gb and 5000 inodes)


18

Types of I/O

• Bewildering number of choices on both machines:– Standard Language I/O: Fortran or C (ANSI or POSIX)

– Vendor extensions to language I/O

– MPI I/O

– Cray FFIO library (can be used from Fortran or C)

– IBM MIO library, requires code changes


19

Standard Language I/O

• Fortran direct access is slightly more efficient then sequential access both on the T3E (see comments on FFIO later) and the SP. It also allows file transferability.

• C language I/O (fopen, fwrite, etc.) is inefficient on both machines.

• POSIX standard I/O (open, read, etc.) can be efficient on the T3E, but requires care (see comments on FFIO later). Works well on the SP.


20

Vendor Extensions to Language I/O

• Cray has a number of I/O routines (aqopen, etc.) which are legacies from the PVP systems. Non-portable.

• IBM has extended Fortran syntax to provide asynchronous I/O. Non-portable.


21

MPI I/O

• Part of MPI-2

• Interface for High Performance Parallel I/O– data partitioning

– collective I/O

– asynchronous I/O

– portability and interoperability bwteen T3E and SP

• Different subset implemented on T3E and SP


22

Summary of access routines for T3E

Positioning Synchronism CoordinationNon-collective Collective

Explicit BlockingNon-blocking

READ_AT READ_AT_ALL

IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END

Individual BlockingNon-blocking

READ READ_ALL

IREAD READ_ALL_BEGINWAIT READ_ALL_END

Shared BlockingNon-Blocking

READ_SHARED READ_ORDERED

IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END


23

Summary of access routines for SP

Positioning Synchronism CoordinationNon-collective Collective

Explicit BlockingNon-blocking

READ_AT READ_AT_ALL

IREAD_AT READ_AT_ALL_BEGINWAIT READ_AT_ALL_END

Individual BlockingNon-blocking

READ READ_ALL

IREAD READ_ALL_BEGINWAIT READ_ALL_END

Shared BlockingNon-Blocking

READ_SHARED READ_ORDERED

IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END


24

Cray FFIO library

• FFIO is a set of I/O layers tuned for different I/O characteristics

• Buffering of data (configurable size)

• Caching of data (configurable size)

• Available to regular Fortran I/O without reprogramming

• Available for C through POSIX-like calls, e.g. ffopen, ffwrite


25

FFIO - The assign command

• controls program behavior at runtime

• the assign command controls– controls which FFIO layer is active

– striping across multiple partitions

– lots more

• scope of assign– File name

– Fortran unit number

– File type (e.g. all sequential unformatted files)


26

IBM MIO library

• User interface based on POSIX I/O routines, so requires program modification

• Useful trace module to collect statistics

• Not much experience with using on GPFS filesystem

• Coming soon


27

I/O Strategies - Exclusive access files

• Each process reads and writes to a separate file– Language I/O

• Increase language I/O performance with FFIO library (for example, sepcify a large buffer with the bufa layer) on T3E. For Fortran direct access default buffer is only the maximum of the record length or 32 Kbytes

• read/write large amounts of data per request on the SP

– MPI I/O• read/write large amounts of data per request


28

bufa FFIO layer Overview

• bufa is an asynchronous buffering layer

• performs read-ahead, write-behind

• specify buffer size with -F bufa:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers

• buffer space increases your applications memory requirements


29

I/O Strategies - Shared files

• All PEs read and write the same file simultaneously– Language I/O (requires FFIO library global layer for T3E)

– MPI I/O

– On T3E, language I/O with FFIO library global layer and Cray extensions for additional flexibility


30

Positioning with a shared file

• Positioning of a read or write is your responsibility

• File pointers are private

• Fortran– Use a direct access file, and read/write(rec=num)– Use Cray T3E extensions setpos and getpos to position file

pointer (not portable)

• C– Use ffseek

• MPI I/O– MPI I/O fileview generally takes care of this. Positioning routines

also available.


31

global FFIO layer Overview

• global is a caching and buffering layer which enables multiple PEs to read and write to the same file

• if one PE has already read the data, an additional read request from another PE will result in a remote memory copy

• file open is a synchronizing event

• By default, all PEs must open a global file, this can be changed by calling GLIO_GROUP_MPI(comm)

• specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE


32

GPFS and shared files

• On the T3E the global FFIO layer takes care of updates to a file from multiple PEs by tracking the state of the file across all PEs.

• On the SP, GPFS implements a safe update scheme via tokens and a token manager.– If two processes access the same block of a GPFS file (256 Kbytes),

a negotiation is conducted between the nodes and the token manager to determine the order of updates. This can slow down I/O considerably.

– MPI I/O merges requests from different processes to alleviate this problem


33

I/O Performance Comparison• Each process writes a 200 Mbyte file. 2 processes per node on SP.

0

200

400

600

800

1000

1200

16 32 64

processes

Mby

te/s

ec T3E WriteT3E ReadSP WriteSP read


34

Further Information

• I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials

• Cray Publication - Application Programmer’s I/O Guide

• Cray Publication - Cray T3E Fortran Optimization Guide

• man assign

• XL Fortran User’s Guide

Download - N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User

Top Related