hpc components for cca manoj krishnan and jarek nieplocha computational sciences and mathematics...
TRANSCRIPT
HPC Components for CCAHPC Components for CCAHPC Components for CCAHPC Components for CCA
Manoj Krishnan and Jarek NieplochaComputational Sciences and Mathematics Division
Pacific Northwest National Laboratory
2
HPC ComponentsHPC ComponentsHPC ComponentsHPC Components
Distributed Arrays Component Global Arrays (GA)
Parallel I/O Component Disk Resident Arrays (DRA)
One-sided Communication Component Remote Memory Access (RMA) Communication Aggregate Remote Memory Copy Interface (ARMCI)
3
Distributed Array ComponentDistributed Array ComponentDistributed Array ComponentDistributed Array Component
Based on Global Arrays (GA)Core Capabilities dense arrays 1-7 dimensions global rather than per-task view of
data structures user control over data distribution:
regular and irregular
GAClassicPort 36+98 (direct+indirect) GA methods
GADADFPort distributed array descriptors (DAD) and
templates proposed by Data Working Group of
CCA Forum
LinearAlgebraPort (LA) manipulating vectors, matrices, and
linear solvers (for TAO)
physically distributed dense array
single, shared data structure global indexing (e.g., A(4,3) rather than buf(7) on task 2)
GA
LA
DAD
GA Classic
4
Distributed ArraysDistributed ArraysDistributed ArraysDistributed Arrays
Data Locality, distribution
Ease of programming
High performance Gets 5.2 GFLOP/s per CPU out of 6 GFLOP/s peak
Parallel Matrix Multiplication16000x16000
0
1000
2000
3000
4000
5000
6000
10 100 1000 10000
Processors
Ag
gre
ga
te G
FL
OP
s
SRUMMA
PBLAS/ScaLAPACK
MPI GA
-Invert Data Locally-Identify where (process ranks) to send the data-find # of MPI_Recv’s to post-Manipulate the global indices for each Recv (identify where each data fit locally)-Do the actual data transfer.
- Invert Data Locally-Do a GA_Put
0 1 2 3 4 5
1-d transpose (inverse data globally)
5
Parallel I/O ComponentParallel I/O ComponentParallel I/O ComponentParallel I/O Component
Based on Disk Resident ArraysHigh-level API for transfer of data between N-dim arrays stored on disk and distributed arrays stored in memoryUses parallel or local filesystems
Hides filesystem issues Scalable performance utilizing local disks
of a cluster More nodes used – more disks available –
higher aggregate b/wUse when
Arrays too big to store in core checkpoint/restart out-of-core solvers
Development Ohio State collaboration (P. Sadayappan) Non-collective I/O Data reorganization/layout
Recent paper at LACSI
array in memory
array on disk(s)
6
Communication ComponentCommunication ComponentCommunication ComponentCommunication ComponentBased on ARMCI
Aggregate Remote Memory Copy Interface Used in Global Arrays, Rice Co-Array Fortran
compiler, Ames GPSHMEM, Co-Array Python Vendor supported (Cray XD1, IBM porting to
BG/L)One sided communication (put/get model)
Remote Memory AccessCCA component offers language interoperability
Only C interface existed in ARMCI
Comm Driver
ARMCI Elan (Quadrics)
ARMCI GM (Myrinet)
ARMCI Vapi(Infiniband)
ARMCI Sockets (Ethernet)
(Any) Component
P1P0put
remote memory access (RMA) 1-sided model
A B
Plug-and-play for network drivers using CCA
7
Processor Group Issues in Processor Group Issues in Distributed Array ManagementDistributed Array Management
Processor Group Issues in Processor Group Issues in Distributed Array ManagementDistributed Array Management
Access to data in components running on different processor groups Identifying the rank of
processes/thread and group naming in component interfaces
Data movement and Reorganization
An instance of MxN problem revisited
For component interoperability would like support from framework identifying and naming
processes/groups distributed and parallel
environments, hybrid Threads/processes, MPI/PVM
issues
MPI GA
Comp A Comp B
CCA Framework