konstantin berlin 1 , jun huan 2 , mary jacob 3 , garima kochhar 3 , jan prins 2 , bill pugh 1 ,

28
Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 , P. Sadayappan 3 , Jaime Spacco 1 , Chau-Wen Tseng 1 1 University of Maryland, College Park 2 University of North Carolina, Chapel Hill Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures

Upload: junius

Post on 10-Jan-2016

28 views

Category:

Documents


3 download

DESCRIPTION

Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures. Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 , Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 , P. Sadayappan 3 , Jaime Spacco 1 , Chau-Wen Tseng 1 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Konstantin Berlin1, Jun Huan2, Mary Jacob3,

Garima Kochhar3, Jan Prins2, Bill Pugh1,

P. Sadayappan3, Jaime Spacco1, Chau-Wen Tseng1

1 University of Maryland, College Park2 University of North Carolina, Chapel Hill

3 Ohio State University

Evaluating the Impact of Programming Language Features on the

Performance of Parallel Applications on Cluster Architectures

Page 2: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Motivation

• Irregular, fine-grain remote accesses– Several important applications– Message passing (MPI) is inefficient

• Language support for fine-grain remote accesses?– Less programmer effort than MPI– How efficient is it on clusters?

Page 3: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Contributions

• Experimental evaluation of language features• Observations on programmability & performance• Suggestions for efficient programming style• Predictions on impact of architectural trends

Findings not a surprise, but we quantify penalties for language features for challenging applications

Page 4: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Outline

• Introduction• Evaluation

– Parallel paradigms– Fine-grain applications– Performance

• Observations & recommendations• Impact of architecture trends• Related work

Page 5: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Parallel Paradigms• Shared-memory

– Pthreads, Java threads, OpenMP, HPF– Remote accesses same as normal accesses

• Distributed-memory– MPI, SHMEM– Remote accesses through explicit (aggregated) messages– User distributes data, translates addresses

• Distributed-memory with special remote accesses– Library to copy remote array sections (Global Arrays)– Extra processor dimension for arrays (Co-Array Fortran)– Global pointers (UPC)– Compiler / run-time system converts accesses to messages

Page 6: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Global Arrays• Characteristics

– Provides illusion of shared multidimensional arrays– Library routines

• Copy rectangular shaped data in & out of global arrays• Scatter / gather / accumulate operations on global array

– Designed to be more restrictive, easier to use than MPI

• Example

NGA_Access(g_a, lo, hi, &table, &ld);

for (j = 0; j < PROCS; j++) { for (i = 0; i < counts[j]; i++) {

table[index-lo[0]] ^= stable[copy[i] >> (64-LSTSIZE)]; } }

NGA_Release_update(g_a, lo, hi);

Page 7: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

UPC• Characteristics

– Provides illusion of shared one-dimensional arrays– Language features

• Global pointers to cyclically distributed arrays• Explicit one-sided msgs (upc_memput(), upc_memget())

– Compilers translate global pointers, generate communication

• Exampleshared unsigned int table[TABSIZE];for (i=0; i<NUM_UPDATES/THREADS; i++) { int ran = random(); table[ (ran & (TABSIZE-1)) ] ^= stable[ (ran >> (64-LSTSIZE)) ];}barrier();

Page 8: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

UPC

• Most flexible method for arbitrary remote references• Supported by many vendors• Can cast global pointers to local pointers

– Efficiently access local portions of global array

• Can program using hybrid paradigm– Global pointers for fine-grain accesses– Use upc_memput(), upc_mempget() for coarse-grain accesses

Page 9: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Target Applications

• Parallel applications– Most standard benchmarks are easy

• Coarse-grain parallelism• Regular memory access patterns

• Applications with irregular, fine-grain parallelism– Irregular table access– Irregular dynamic access– Integer sort

Page 10: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Options for Fine-grain Parallelism• Implement fine-grain algorithm

– Low user effort, inefficient

• Implement coarse-grain algorithm– High user effort, efficient

• Implement hybrid algorithm– Most code uses fine-grain remote accesses– Performance critical sections use coarse-grain algorithm– Reduce user effort at the cost of performance

• How much performance is lost on clusters?

Page 11: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Experimental Evaluation

• Cluster : Compaq Alphaserver SC (ORNL)– 64 nodes, 4-way Alpha EV67 SMP, 2 GB memory each– Single Quadrics adapter per node

• SMP : SunFire 6800 (UMD)– 24 processors, UltraSparc III, 24 GB memory total– Crossbar interconnect

Page 12: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Irregular Table Update

• Applications– Parallel databases, giant histogram / hash table

• Characteristics– Irregular parallel accesses to large distributed table– Bucket version (aggregated non-local accesses) possible

• Example

for ( i=0; i<NUM_UPDATES; i++ ) {

ran = random();

table[ran & (TABSIZE-1)] ^= stable[ran >> (64-LSTSIZE)];

}

Page 13: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Table Update (AlphaServer, 2^22 table)

1

10

100

1000

10000

1 2 4 8 16 32

Processors

Up

dat

es /

mse

c /

pro

cess

or MPI

UPC (bucket)

UPC

Global Arrays

• UPC / Global Array fine-grain accesses inefficient (100x)• Hybrid coarse-grain (bucket) version closer to MPI

Page 14: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Table Update (Sun SMP, 2^25 table)

0

500

1,000

1,500

2,000

2,500

1 3 5 7 9 11 13 15 17 19 21 23

Processors

Up

dat

es /

mse

c /

pro

cess

or Java

C / OpenMP

C / Pthreads

UPC

• UPC fine-grain accesses inefficient even on SMP

Page 15: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Irregular Dynamic Accesses

• Applications– NAS CG (sparse conjugate gradient)

• Characteristics– Irregular parallel accesses to sparse data structures– Limited aggregation of non-local accesses

• Example (NAS CG)

for (j = 0; j < n; j++) {

sum = 0.0;

for (k = rowstr[j]; k < rowstr[j+1]; k++)

sum = sum + a[k] * v[colidx[k]];

w[j] = sum;

}

Page 16: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

NAS Conjugate Gradient (AlphaServer, Class B)

0

20

40

60

80

100

120

1 2 4 8 16 32

Processors

MF

LO

PS

/ p

roc

es

so

r

MPI

OpenMP

UPC (MPI)

UPC (OpenMP)

• UPC fine-grain accesses inefficient (4x)• Hybrid coarse-grain version slightly closer to MPI

Page 17: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Integer Sort

• Applications– NAS IS (integer sort)

• Characteristics– Parallel sort of large list of integers– Non-local accesses can be aggregated

• Example (NAS IS)

for ( i=0; i<NUM_KEYS; i++ ) { /* sort local keys into buckets */

key = key_array[i];

key_buff1[bucket_ptrs[key >> shift]++] = key; }

upc_reduced_sum(…) ; /* get bucket size totals */

for ( i = 0; i < THREADS; i++ ) { /* get my bucket from every proc */

upc_memget(…); }

Page 18: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

NAS Integer Sort (AlphaServer, 128K keys)

0

0.2

0.4

0.6

0.8

1

1.2

1 2 4 8 16 32

Processors

Eff

icie

ncy

MPI

UPC

• UPC fine-grain accesses inefficient• Coarse-grain version closer to MPI (2-3x)

Page 19: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

UPC Microbenchmarks

• Compare memory access costs– Quantify software overhead

• Private– Local memory, local pointer

• Shared-local– Local memory, global pointer

• Shared-same-node– Non-local memory (but on same SMP node)

• Shared-remote– Non-local memory

Page 20: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

UPC Microbenchmarks

• Architectures– Compaq AlphaServer SC, v1.7 compiler (ORNL)– Compaq AlphaServer Marvel, v2.1 compiler (Florida)– Sun SunFire 8600 (UMD)– AMD Athlon PC cluster (OSU)– Cray T3E (MTU)– SGI Origin 2000 (UNC)

Page 21: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

2.5

1450

18000 16000

1.7

187 191

10

1630 1700

2

90

18800 50350

83

13063098

119

876 876

1

100

10000

Tim

e p

er o

per

atio

n (

ns)

Alpha SC(v1.7)

AlphaMarvel (v2.1)

Sun SunFire6800

AMDAthalon PC

Cluster

Cray T3E SGI Origin2000

UPC Point-wise Data Access Costs (read-modify-write double)

Private Shared-local Shared-same-node Shared-remote

• Global pointers significantly slower• Improvement with newer UPC compilers

Page 22: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Observations

• Fine-grain programming model is seductive– Fine-grain access to shared data– Simple, clean, easy to program

• Not a good reflection of clusters– Efficient fine-grain communication not supported in hardware– Architectural trend towards clusters, away from Cray T3E

Page 23: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Observations

• Programming model encourages poor performance– Easy to write simple fine-grain parallel programs– Poor performance on clusters– Can code around this, often at the cost of complicating your

model or changing your algorithm

• Dubious that compiler techniques will solve this problem– Parallel algorithms with block data movement needed for clusters– Compilers cannot robustly transform fine-grained code into

efficient block parallel algorithms

Page 24: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Observations

• Hybrid programming model is easy to use– Fine-grained shared data access easy to program– Use coarse-grain message passing for performance– Faster code development, prototyping– Resulting code cleaner, more maintainable

• Must avoid degrading local computations– Allow compiler to fully optimize code– Usually not achieved in fine-grain programming– Strength of using explicit messages (MPI)

Page 25: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Recommendations

• Irregular coarse-grain algorithms– For peak cluster performance, use message passing– For quicker development, use hybrid paradigm

• Use fine-grain remote accesses sparingly– Exploit existing code / libraries where possible

• Irregular fine-grain algorithms– Execute smaller problems on large SMPs– Must develop coarse-grain alternatives for clusters

• Fine-grain programming on clusters still just a dream– Though compilers can help for regular access patterns

Page 26: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Impact of Architecture Trends

• Trends– Faster cluster interconnects (Quadrics, InfiniBand)– Larger memories– Processor / memory integration– Multithreading

• Raw performance improving– Faster networks (lower latency, higher bandwidth) – Absolute performance will improve

• But same performance limitations!– Avoid small messages– Avoid software communication overhead– Avoid penalizing local computation

Page 27: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

Related Work

• Parallel paradigms– Many studies– PMODELs (DOE / NSF) project

• UPC benchmarking– T. El-Ghazawi et al. (GWU)

• Good performance on NAS benchmarks • Mostly relies on upc_memput(), upc_memget()

– K. Yelick et al. (Berkeley)• UPC compiler targeting GASNET• Compiler attempts to aggregate remote accesses

Page 28: Konstantin Berlin 1 , Jun Huan 2 , Mary Jacob 3 ,  Garima Kochhar 3 , Jan Prins 2 , Bill Pugh 1 ,

End of Talk