hpc components for cca manoj krishnan and jarek nieplocha computational sciences and mathematics...

HPC Components for CCAHPC Components for CCAHPC Components for CCAHPC Components for CCA

Manoj Krishnan and Jarek NieplochaComputational Sciences and Mathematics Division

Pacific Northwest National Laboratory

2

HPC ComponentsHPC ComponentsHPC ComponentsHPC Components

Distributed Arrays Component Global Arrays (GA)

Parallel I/O Component Disk Resident Arrays (DRA)

One-sided Communication Component Remote Memory Access (RMA) Communication Aggregate Remote Memory Copy Interface (ARMCI)

3

Distributed Array ComponentDistributed Array ComponentDistributed Array ComponentDistributed Array Component

Based on Global Arrays (GA)Core Capabilities dense arrays 1-7 dimensions global rather than per-task view of

data structures user control over data distribution:

regular and irregular

GAClassicPort 36+98 (direct+indirect) GA methods

GADADFPort distributed array descriptors (DAD) and

templates proposed by Data Working Group of

CCA Forum

LinearAlgebraPort (LA) manipulating vectors, matrices, and

linear solvers (for TAO)

physically distributed dense array

single, shared data structure global indexing (e.g., A(4,3) rather than buf(7) on task 2)

GA

LA

DAD

GA Classic

4

Distributed ArraysDistributed ArraysDistributed ArraysDistributed Arrays

Data Locality, distribution

Ease of programming

High performance Gets 5.2 GFLOP/s per CPU out of 6 GFLOP/s peak

Parallel Matrix Multiplication16000x16000

0

1000

2000

3000

4000

5000

6000

10 100 1000 10000

Processors

Ag

gre

ga

te G

FL

OP

s

SRUMMA

PBLAS/ScaLAPACK

MPI GA

-Invert Data Locally-Identify where (process ranks) to send the data-find # of MPI_Recv’s to post-Manipulate the global indices for each Recv (identify where each data fit locally)-Do the actual data transfer.

- Invert Data Locally-Do a GA_Put

0 1 2 3 4 5

1-d transpose (inverse data globally)

5

Parallel I/O ComponentParallel I/O ComponentParallel I/O ComponentParallel I/O Component

Based on Disk Resident ArraysHigh-level API for transfer of data between N-dim arrays stored on disk and distributed arrays stored in memoryUses parallel or local filesystems

Hides filesystem issues Scalable performance utilizing local disks

of a cluster More nodes used – more disks available –

higher aggregate b/wUse when

Arrays too big to store in core checkpoint/restart out-of-core solvers

Development Ohio State collaboration (P. Sadayappan) Non-collective I/O Data reorganization/layout

Recent paper at LACSI

array in memory

array on disk(s)

6

Communication ComponentCommunication ComponentCommunication ComponentCommunication ComponentBased on ARMCI

Aggregate Remote Memory Copy Interface Used in Global Arrays, Rice Co-Array Fortran

compiler, Ames GPSHMEM, Co-Array Python Vendor supported (Cray XD1, IBM porting to

BG/L)One sided communication (put/get model)

Remote Memory AccessCCA component offers language interoperability

Only C interface existed in ARMCI

Comm Driver

ARMCI Elan (Quadrics)

ARMCI GM (Myrinet)

ARMCI Vapi(Infiniband)

ARMCI Sockets (Ethernet)

(Any) Component

P1P0put

remote memory access (RMA) 1-sided model

A B

Plug-and-play for network drivers using CCA

7

Processor Group Issues in Processor Group Issues in Distributed Array ManagementDistributed Array Management

Processor Group Issues in Processor Group Issues in Distributed Array ManagementDistributed Array Management

Access to data in components running on different processor groups Identifying the rank of

processes/thread and group naming in component interfaces

Data movement and Reorganization

An instance of MxN problem revisited

For component interoperability would like support from framework identifying and naming

processes/groups distributed and parallel

environments, hybrid Threads/processes, MPI/PVM

issues

MPI GA

Comp A Comp B

CCA Framework

hpc components for cca manoj krishnan and jarek nieplocha computational sciences and mathematics...

Documents