1 qualifying examwei chen unified parallel c (upc) and the berkeley upc compiler wei chen the...
TRANSCRIPT
1Qualifying Exam Wei Chen
Unified Parallel C (UPC) and the Berkeley UPC Compiler
Wei Chenthe Berkeley UPC Group
3/11/07
2 Wei ChenUPC talk
Parallel Programming•Most parallel programs are written using either:
•Message passing with a SPMD model• Usually for scientific applications with C++/Fortran• Scales easily: user controlled data layout• Hard to use: send/receive matching, message packing/unpacking
•Shared memory with OpenMP/pthreads/Java• Usually for non-scientific applications• Easier to program: direct reads and writes to shared data
• Hard to scale: (mostly) limited to SMPs, no concept of locality
•PGAS: an alternative hybrid model
3 Wei ChenUPC talk
Partitioned Global Address Space• PGAS model uses global address space abstraction
• Shared memory is partitioned by processors• User controlled data layout (global pointers and distributed arrays)
• One-sided communication: • Use RDMA support for reads/writes of shared variables • Much faster than message passing for small/medium size messages
• Hybrid model works for both SMPs and clusters• Languages: Titanium, Co-Array Fortran, UPC
Shared
Glo
bal
ad
dre
ss
sp
ace
X[0]
Privateptr: ptr: ptr:
X[1] X[P]
4 Wei ChenUPC talk
Unified Parallel C• A SPMD parallel extension of C• PGAS: add shared qualifier to type system• Several kinds of shared array distributions• Fine-grained and bulk communication• Commercial compilers with Cray/HP/IBM• Open source compilers with Berkeley UPC
Vector Addition in UPC
#define N 100*THREADSshared int v1[N], v2[N], sum[N]; //cyclic layoutvoid main() {
for(int i=0; i<N; i++) if (MYTHREAD == i%THREADS) //SPMD sum[i]=v1[i]+v2[i];
}
5 Wei ChenUPC talk
Overview of the Berkeley UPC Compiler
TranslatorUPC Code
Translator Generated C Code
Berkeley UPC Runtime System
GASNet Communication System
Network Hardware
Platform-independent
Network-independent
Compiler-independent
Language-independent
Two Goals: Portability and High-PerformanceLower UPC code into ISO C
code
Shared Memory Management and pointer operations
Uniform get/put interface for underlying
networks
6 Wei ChenUPC talk
UPC to C Translator
Preprocessed UPC Source
WHIRL with shared types
WHIRL with runtime calls
ISO C code
Parsing
Optimized WHIRL
Lowering
WHIRL2C
Lowering
Backend C compiler
Optimizer
• Based on Open64• Extend with shared type• Reuse analysis framework• Add UPC specific optimizations
• Portable translation• High level IR• Config file for platform dependent information
• Reinclude library headers • Convert shared memory
operations into runtime calls
7 Wei ChenUPC talk
Optimization framework
• Combination of language/compiler/runtime support• Transparent to the user• Performance portable
• Short term goal: effective on different cluster networks.• Long term goal: code designed for SMP get good
performance on clusters
Optimize regular array accesses
Optimize irregular pointer accesses
Nonblocking bulk communication
Loop framework for message vectorization, strip mining
PRE framework with split-phase access and coalescing
Runtime framework for communication overlap
A[i][j][k] p->x->y upc_memget(dst, src, size)
8 Wei ChenUPC talk
Application Performance – LU DecompositionLU performance comparison
10% 20% 30% 40% 50% 60% 70% 80% 90%
Cray X1 (64)
Cray X1 (128/124)
SGI Altix (32)
Opteron 2.2 GHz (64)
Syste
m
(pro
c c
ou
nt)
% peak performance
MPI/HPL
UPC/LU
• UPC performance comparable to MPI/HPL(Linpack) with < ½ the code size
• Uses light-weight multi-threading atop SPMD latency tolerant
• Highly adaptable to different problem and machine sizes
9 Wei ChenUPC talk
Application Performance – 3D FFT
• One-sided UPC approach sends more, smaller messages• Same total volume of data, but send earlier and more often• Aggressively overlaps the transpose with the 2nd 1-D FFT
• Same approach is less effective in MPI due to higher per-message cost• Consistently outperforms MPI-based implementations – by as much as 2X
MF
LOP
S /
Pro
cup is good
10 Wei ChenUPC talk
Current Status
• Public release v2.4 in November 2006• Fully compliant with UPC 1.2 specification• Communication optimizations• Extensions for performance and programmability
• Support from laptops to supercomputers• OS: UNIX (Linux, BSD, AIX, Solaris, etc), Mac, Cygwin
• Arch: x86, Itanium, Opteron, Alpha, PPC, SPARC, Cray X1, NEC SX-6, Blue Gene, etc.
• Network: SMP, Myrinet, Quadrics, Infiniband, IBM LAPI, MPI, Ethernet, SHMEM, etc.
• Give us a try at http://upc.lbl.gov
11 Wei ChenUPC talk
Summary
• UPC designed to be consistent with C• Expose memory layout• Flexible communication with pointers and arrays
• Give users more control to achieve high performance
• Berkeley UPC compiler provides an open-source and portable implementation
• Hand optimized UPC programs match and often beat MPI’s performance
• Research goal: productive user + efficient compiler