improving performance of communication through loop

26
Improving Performance of Communication Through Loop Scheduling in UPC Michail Alvanos a,d,1,* , Gabriel Tanase 2,b , Montse Farreras 6,e , Ettore Tiotto 4,c , Jos´ e Nelson Amaral 1,f , Xavier Martorell 5,e a Barcelona Supercomputing Center, C/ Jordi Girona, 1-3, Campus Nord UPC, Barcelona, Spain, 08034 b IBM Thomas J. Watson Research Center 1101 Kitchawan Rd, Yorktown Heights, 10598, NY , United States c IBM Toronto Laboratory, Toronto, Canada d IBM Canada CAS Research, Markham, Ontario, Canada e Department of Computer Architecture, Universitat Polit` ecnica de Catalunya, C/ Jordi Girona, 1-3, Barcelona 08034, Spain. f Department of Computing Science, University of Alberta, Athabasca Hall (ATH) 342, T6G 2E8, Edmonton, Alberta, Canada Abstract Partitioned Global Address Space (PGAS) languages appeared to address programmer productivity in large scale parallel machines. The main goal of a PGAS language is to provide the ease of use of shared memory pro- gramming model with the performance of MPI. Unified Parallel C programs containing all-to-all communication can suffer from shared access conflicts on the same node. Manual or compiler code optimization is required to avoid oversubscription of the nodes. The downside of manual code transforma- tions is the increased program complexity and hindering of the programmer productivity. This paper explores loop scheduling algorithms and presents a compiler optimization that schedules the loop iterations to provide better network uti- * Corresponding author 1 Email: [email protected] 2 Email: [email protected] 3 Email: [email protected] 4 Email: [email protected] 5 Email: [email protected] 6 Email: [email protected] 1

Upload: others

Post on 16-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Performance of Communication Through Loop

Improving Performance of Communication Through

Loop Scheduling in UPC

Michail Alvanosa,d,1,∗, Gabriel Tanase2,b, Montse Farreras6,e, EttoreTiotto4,c, Jose Nelson Amaral1,f, Xavier Martorell5,e

aBarcelona Supercomputing Center, C/ Jordi Girona, 1-3, Campus Nord UPC,Barcelona, Spain, 08034

b IBM Thomas J. Watson Research Center 1101 Kitchawan Rd, Yorktown Heights,10598, NY , United States

c IBM Toronto Laboratory, Toronto, Canadad IBM Canada CAS Research, Markham, Ontario, Canada

e Department of Computer Architecture, Universitat Politecnica de Catalunya, C/ JordiGirona, 1-3, Barcelona 08034, Spain.

f Department of Computing Science, University of Alberta, Athabasca Hall (ATH) 342,T6G 2E8, Edmonton, Alberta, Canada

Abstract

Partitioned Global Address Space (PGAS) languages appeared to addressprogrammer productivity in large scale parallel machines. The main goalof a PGAS language is to provide the ease of use of shared memory pro-gramming model with the performance of MPI. Unified Parallel C programscontaining all-to-all communication can suffer from shared access conflicts onthe same node. Manual or compiler code optimization is required to avoidoversubscription of the nodes. The downside of manual code transforma-tions is the increased program complexity and hindering of the programmerproductivity.

This paper explores loop scheduling algorithms and presents a compileroptimization that schedules the loop iterations to provide better network uti-

∗Corresponding author1Email: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]: [email protected]

1

Page 2: Improving Performance of Communication Through Loop

lization and avoid node oversubscription. The compiler schedules the loopiterations to spread the communication uniformly across nodes. The evalu-ation shows mixed results: (i) slowdown up to 10% in Sobel benchmark, (ii)performance gains from 3% up to 25% for NAS FT and bucket-sort bench-marks, and (iii) up to 3.4X speedup for the microbenchmarks.

Keywords:Unified Parallel C, Partitioned Global Address Space, One-SidedCommunication, Performance Evaluation, Loop Scheduling

1. Introduction

Partitioned global address space languages [1, 2, 3, 4, 5, 6, 7] promisesimple means for developing applications that can run on parallel systemswithout sacrificing performance. These languages extend existing languageswith constructs to express parallelism and data distribution. They providea shared-memory-like programming model, where the address space is parti-tioned and the programmer has control over the data layout.

One of the characteristics of partitioned global address space (PGAS)languages is the transparency provided to the programmer when accessingshared memory. Accesses on shared memory lead to automatic creation ofadditional runtime calls and possible network communication. A number ofprevious research proposed different ways to improve performance includingprefetching [8], static coalescing [9, 10], and software cache [11].

Furthermore, high-radix network topologies are becoming a common ap-proach [12, 13] to address the latency wall of modern supercomputers. Ahigh-radix networks provides low latency through low hop count. This net-work architecture provides good performance on different traffic patterns. Toeffectively avoid network congestion the programmer or the runtime spreadsthe non-uniform traffic evenly over the different links. However, an impor-tant category of UPC applications, that require all-to-all communication,may create communication hotspots without the awareness of the program-mer. Although UPC language provides collective operations, users avoidusing them to exploit the simplicity of the language.

This paper explores possible loop scheduling schemes and proposes a looptransformation to improve the performance of the network communication.The paper explores three approaches of manual loop scheduling for loopswith fine-grained communication and three approaches for loops with coarse

2

Page 3: Improving Performance of Communication Through Loop

grained communication. In the automatic loop transformation, the compilertransforms the loops to spread the communication between different networkslinks to produce a more uniform traffic pattern. The implementation usesthe XLUPC compiler platform [14] and the evaluation the IBM R© Power R©

775 [12]. Our contributions are:

• We propose and evaluate different loop scheduling schemes to increasethe performance of the applications by decreasing potential contentionin the network. The performance improvement depends on schedulingalgorithm and accesses type.

• We present a compiler loop transformation that transforms the loopto improve the performance of the all-to-all communication pattern.The compiler provides comparable performance to manually optimizedloops.

The rest of this paper is organized as follows. Section 2 provides anoverview of the Unified Parallel C language, and introduces the contentionproblem. Section 3 presents the loop scheduling optimization. Section 4presents the methodology used. Section 5 presents the evaluation and sec-tion 6 presents the related work. Section 7 presents the conclusions.

2. Background

Partitioned Global Address Space Languages (PGAS) programming lan-guages provide a uniform programming model for local, shared and dis-tributed memory hardware. The programmer sees a single coherent sharedaddress space, where variables may be directly read and written by anythread, but each variable is physically associated with a single thread. PGASlanguages, such as Unified Parallel C [1], Co-Array Fortran [2], Fortress [3],Chapel [4], X10 [5], Global Arrays [15], and Titanium [6], extend existinglanguages with constructs to express parallelism and data distribution. Theprogrammer can access each variable between different processes by usingregular reads and writes. The downside of this approach is that the pro-grammer is not always aware of the locality of data and can use remoteaccesses that lead to performance degradation.

3

Page 4: Improving Performance of Communication Through Loop

2.1. Unified Parallel CThe Unified Parallel C (UPC) language [1] is an example of PGAS pro-

gramming model. The language is an extension of the C programming lan-guage [16] designed for high performance computing on large-scale parallelmachines. UPC uses a Single Program Multiple Data (SPMD) model of com-putation in which the amount of parallelism is fixed at program startup time.The UPC language can be mapped to either distributed-memory machines,shared-memory machines or hybrid which are clusters of shared memory ma-chines.

Listing 1 presents the computation kernel of a parallel vector addition.The benchmark adds the contents of three vectors (A, B, and D) to the vectorC. The programmer declares all vectors shared arrays. Shared arrays can beaccessed from all UPC threads. In this example, the programmer does notspecify the layout qualifier (blocking factor). Thus, the compiler assumesthat the blocking factor is one. The construct upc forall distributes loopiterations among the UPC threads. The fourth expression in the upc forall

construct is the affinity expression. The affinity expression specifies that theowner thread of the specified element will execute the ith loop iteration. TheUPC compiler transforms the loop in a simple for loop with the proper boundlimits and step size (Listing 2).

1 #define N 16384

2 shared int A[N+1], B[N+1], C[N], D[N]

34 upc_forall(i=0; i<N; i++; i)

5 C[i] = A[i+1] + B[i+1] + D[i];

Listing 1: Example of a parallel upc forall loop.

1 #define N 16384

2 shared int A[N+1], B[N+1], C[N], D[N]

34 for (i=MYTHREAD; i < N; i+= THREADS){

5 C[i] = A[i+1] + B[i+1] + D[i];

6 }

Listing 2: Example of a transformed parallel upc forall loop.

2.1.1. Memory Allocation and Data Blocking

In UPC there are two different types of memory allocations: local mem-ory allocations performed using malloc and which are currently outside of

4

Page 5: Improving Performance of Communication Through Loop

Threads

0

1

2

3 12 13 14 15

Data

BF= AR_SZ/THDS or [*]

(a)

Threads

0

1

2

3

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

Data

BF= [1] or empty

(b)

Threads

0

1

2

3

0 1 2 3 4 ...

Data

BF= [0] or []

(c)

Figure 1: Different blocking allocation schemes: “Perfect blocking” (a), default blocking(b), and single UPC thread affinity blocking (c).

the tracking capability of the runtime, and shared memory allocated us-ing UPC specific constructs such as upc all alloc(), upc alloc() andupc global alloc() or statically with the keyword shared. Shared arraysallocated using upc all alloc() are allocated symmetrically at the samevirtual memory address on all locations. The users can declare shared arraysallocated in one thread’s address space using upc alloc.

On of the key characteristics of the Partitioned Global Address Spaceprogramming models, is that they introduced some complexity (partitioning)because it was very difficult to obtain performance from the Global AddressSpace (GAS) languages of the 90s, usually implemented as software DSMs.Unified Parallel C supports three different blocking factors. Blocking factorplays a key role on the performance of the application. The programmerdeclares the shared by setting the blocking factor at the array declarationor through the dynamic memory allocation. For example, the programmerdeclares an array by in blocked from uses this statement: shared [*] int

D[N]; or shared [N/THREADS] int D[N];. Figure 1 presents the differentblocking schemes and Figure 1(a) presents the most commonly used blockingscheme in applications.

2.1.2. Sources of Overhead:

The compiler translates the shared accesses to runtime calls to fetch andstore data. Runtime calls are responsible for fetching, or modifying, therequested data. Each runtime call may imply communication, creating fine-grained communication that leads to poor performance. Previous work oneliminating fine-grained access [10, 17, 18] has greatly improved the perfor-mance of these type of code by orders of magnitude. In this example, thecompiler privatizes accesses C[i] and D[i] (Listing 3). The compiler does notprivatize the A[i+1] and B[i+1] accesses because these elements belong to

5

Page 6: Improving Performance of Communication Through Loop

other UPC threads.

1 local_ptr_C = _local_addr(C);

2 local_ptr_D = _local_addr(D);

34 for (i=MYTHREAD; i < N; i+= THREADS){

5 tmp0 =__xlupc_deref(&A[i+1]);

6 tmp1 =__xlupc_deref(&B[i+1]);

7 *(local_ptr_C + OFFSET(i)) = tmp0 +

8 tmp1 + *(local_ptr_D + OFFSET(i));

9 }

Listing 3: Final form of a transformed parallel upc forall loop.

In addition, for loops that contain all-to-all or reduction communicationcan overwhelm the nodes and create network congestion. The impact ofhotspot creation is even higher in high-radix networks such as the PERCSinterconnect [19]. Listing 4 presents an example of a naive reduction codeexecuted from all the UPC threads. In this example, all the UPC threadsexecute this part of the code, creating network hotspots. The array is allo-cated in blocked form, thus the first N/THREADS elements belong to thefirst UPC thread.

1 #define N 16384

2 shared [*] int A[N+1], B[N+1], C[N], D[N]

34 long long int calc(){

5 int sum = 0, i;

6 for (i=0; i<N; i++) // Executed from all UPC threads

7 sum += A[i];

8 return sum;

9 }

Listing 4: Example of reduction in UPC.

2.2. XL UPC Framework

The experimental prototype for the code transformations described inthis paper is built on top of the XLUPC compiler framework [14]. TheXLUPC compiler has three main components: (i) the Front End (FE) trans-forms the UPC source code to an intermediate representation (W-Code); (ii)

6

Page 7: Improving Performance of Communication Through Loop

Drawer 1

...

Drawer 0...7 LL links

24GB/s/link Bidirectional

Node 7...

24 LR Links

5GB/s/link Bidirectional

.

.

.

10 GB/s/link

Bidir

D Links

Drawer 3

...

SuperNode 0

Drawer 0

...

SuperNode 1

...

...

Drawer 3

...

Drawer 2

Drawer 1

...

Drawer 0

...

SuperNode 2

...

Drawer 3

...

Drawer 2

... ...

......

DLinks

... ... ......

To other SNs

D Links

...D Links

...

To other SNs

D Links

Figure 2: The P775 system. A node (octant) consists of four POWER7 chips and a hubchip. A drawer contains up to eight compute nodes, and a supernode contains up to fourdrawers.

the Toronto Portable Optimizer (TPO) high-level optimizer performs ma-chine independent optimizations; (iii) a low-level optimizer performs machine-dependent optimizations. The FE translates the UPC source into an ex-tended W-Code to annotate all the shared variables and other constructs withtheir UPC semantics. TPO contains optimizations for UPC and other lan-guages including C and C++. The TPO performs a number of UPC-specificoptimizations such as the locality analysis, parallel loop-nest optimizationsand the coalescing optimization presented in this paper. The XLUPC com-piler uses the IBM PGAS runtime [20]. The runtime supports shared-memorymultiprocessors using the Pthreads library and the Parallel Active MessagingInterface (PAMI) [21]. The runtime exposes to the compiler an ApplicationProgram Interface for managing shared data and synchronization.

2.3. IBM R© Power R© 775 Overview

The P775 [12] system employs a hierarchical design allowing highly scal-able deployments of up to 512K processor cores. We include in Figure 2 abasic diagram of the system to help users visualize the various networks em-ployed. The compute node of the P775 consists of four POWER7 R© (P7) [22]CPUs and a HUB chip [19], all managed by a single OS instance. ThePOWER7 processor has 32 KBytes instruction and 32 KBytes L1 data cache

7

Page 8: Improving Performance of Communication Through Loop

per core, 256 KBytes 2nd level per core, and a 32 MByte 3rd level cacheshared per chip. Each core is equipped with four SMT threads and 12 execu-tion units. The HUB provides the network connectivity between the four P7CPUs participating in the cache coherence protocol. Additionally the HUBacts as a switch supporting communication with other HUBs in the system.There is no additional communication hardware present in the system (noswitches). Each compute node (octant) has a total of 32 cores, 128 threads,and up to 512GB memory. The peak performance of a compute node is 0.98Tflops/sec.

A large P775 system is organized, at a higher level, in drawers consistingof 8 octants (256 cores) connected in an all-to-all fashion for a total of 7.86Tflops/s. The links between any two nodes of a drawer are referred to asLlocal links (LL) with a peak bandwidth of 24 Gbytes/s in each direction.A supernode consists of four drawers (1024 cores, 31.4 Tflops/s). Within asupernode each pair of octants is connected with an Lremote (LR) link with 5Gbytes/s in each direction. A full P775 system may contain up to 512 supernodes (524288 cores) with a peak performance of 16 Pflops/s. Between eachpair of supernodes multiple optical D-links are used, each D-link having apeak performance of 10 Gbytes/s in each direction. The machine has apartial all-to-all topology where any two compute nodes are at most threehops away. The large bandwidth of the all-to-all topology, enabled variousP775 systems to be top ranked in HPC Challenge benchmarks (e.g., GUPS,FFT) [23] and Graph 500 [24].

3. Loop scheduling

The all-to-all communication pattern and the concurrent access of shareddata allocated to one UPC thread often stresses the interconnection net-work The all-to-all communication pattern is one of the most importantmetrics for the evaluation of the bandwidth of the system. In this case,each thread communicates with all other UPC threads to exchange data.Moreover, this pattern shows up in a large number of scientific applicationsincluding FFT, Sort, and Graph 500. Thus, improving the network’s effi-ciency by scheduling the loop iterations, the communication overhead canbe significantly decreased. This section examines four different approachesto schedule loop iterations for either coarse-grained or fine-grained commu-nication. The transformations assume that the programmer allocates theshared arrays in blocked fashion. Furthermore, we assume that the number

8

Page 9: Improving Performance of Communication Through Loop

the loop upper bound has the same value as the number of elements of theloop. This assumption is not always true but it simplifies the presentationof the algorithms. Finally, this section presents a solution to automatic looptransformation.

3.1. Approaches

The core idea is to schedule the accesses in such form to ensure that eachthread does not access the same shared data. The programmer manuallytransforms the loop to increase the performance. We categorize the loopscheduling transformations in four categories:

• Skew loops to start the iteration form a different point of the loopiteration space. Listing 5 presents the modified code. This is possiblein UPC language by using the MYTHREAD keyword in the equationthat calculates the induction variable. To calculate the new inductionvariable of the loop we use the following equation:

NEW IV = (IV + MY THREAD ×Block) % UB;

Where Block = SIZE OF ARRAYTHREADS

and UB is the upper bound of theloop.

• Skew loops plus : The ‘plus’ is to uniform distribute the communica-tion among nodes. Each UPC thread starts from a different block ofthe shared array and also from a different position inside the block.The new induction variable is calculated by: NEW IV = (IV +MY THREAD ×Block + MY THREAD × Block

THREADS) % UB;

Figure 3 right shows differences between the two skewed versions. The‘plus’ version access elements from other UPC threads in diagonal form.This approach is expected to achieve better performance than the base-line because it uniformly spreads the communication more than the‘simple’ version.

• Strided accesses : Each thread starts from a different block. The loopincreases the induction variable by a constant number: the numberof UPC threads per node plus one. This approach requires the up-per bound of the loop to not be divisible by the constant number

9

Page 10: Improving Performance of Communication Through Loop

(stride) [25, 26, 27]. The new induction variable is calculated by thefollowing equation:

NEW IV = (IV × STRIDE + MY THREAD)%UB;

To ensure the non-divisibility of the loop upper bound we can use asimple algorithm:

STRIDE = THREADS + 1;

while (UB % (STRIDE) == 0) STRIDE + +;

For example in the Power 775 architecture, when running with 32 UPCthreads per node and assuming that the upper bound of the loop is2048, the new induction variable is calculated by:

NEW IV = (IV × 33 + MY THREAD) % UB;

1 shared [M/THREADS] double K[M];

2 shared [M/THREADS] double L[M];

34 double fine_grain_get_skew(){

5 uint64_t i=0, block=M/THREADS;

6 double res = 0;

78 for (i=0;i<M;i++){

9 uint64_t idx = ( i + MYTHREAD*block ) % M;

10 res += K[idx];

11 }

12 return res;

13 }

Listing 5: Example of loop modifications when code contains fine-grained accesses.

• Random shuffled : the loop uses a look-up array that contains the num-ber of threads randomly shuffled. This approach works only when theupper bound of the loops is equivalent to the number of threads. Thisloop creates an all-to-all communication pattern through the network.This optimization is applicable to loops with both fine-grained andcoarse-grained communication. There are two downsides when apply-ing this approach to loop with fine-grained communication. First of all,

10

Page 11: Improving Performance of Communication Through Loop

...

...

......

P0:

P1:

P2:

...

(a)

...

...

......

P0:

P1:

P2:

...

(b)

...

...

......

P0:

P1:

P2:

...

(c)

Figure 3: Different schemes of accessing a shared array. The shared object is allocatingin blocked form. Each row represents data residing in one UPC thread and each box isan array element. The different access types are: (a) baseline: all UPC threads access thesame data; (b) ‘Skewed’: each UPC thread access elements from a different UPC thread;(c) ‘Skewed plus’: each UPC thread access elements from a different thread and from adifferent point inside the block.

it requires SIZE OF ARRAY×NUM THREADS×sizeof(uint64 t)memory that cannot be allocated of large arrays. The second drawbackis that the runtime (or the compiler) has to shuffle the array, each tomethe upper bound of the loop changes, wasting valuable time of the pro-gram execution.

3.2. Compiler-assisted loop transformation

The idea of compiler-assisted loop transformation is to conceal the com-plexity and make the network more straightforward for the programmer.First, the compiler collects normalized loops that contain shared referencesand have no loop carried dependencies. Next, the compiler checks when theupper bound of the loop is greater or equal to the number of UPC threads,and if it is not upc forall loop.

The compiler categorizes the loops in two groups based on the loop upperbound and shared access type. The compiler applies the transformationdepending on the loop category. Figure 4 presents the compiler algorithm.The compiler makes the decision based on the usage of upc memget andupc memput calls, and the value of loop upper bound. This type of code ismore likely to contain all-to-all communication, thus making it ideal targetfor the random shuffled solution. The compiler categorizes the loops in:

• Loops that have coarse-grained transfers and whose upper bound isthe number of UPC threads. In this case, the runtime returns to theprogram a look-up table with the number of UPC threads equal to the

11

Page 12: Improving Performance of Communication Through Loop

array size. The contents of the look-up table are the randomly shuffledvalues for the induction variable. We use this technique only to loopswith the upper bound equal with the number of the threads. Thus, therange of shuffled values range from 0 up to THREADS-1. To improve theperformance of this approach, the runtime creates the shuffled arrayat the initialization phase. Next, the compiler replaces the inductionvariable inside the body of the loop with the return value of the look uptable. Listing 6 presents the final form of the transformed loop. Theall-to-all communication pattern belongs to this category.

1 shared [N/THREADS] double X[N];

2 shared [N/THREADS] double Y[N];

34 void memget_threads_rand(){

5 uint64_t i=0, block = N/THREADS;

6 double *lptr = (double *) &X[block*MYTHREAD];

7 uint64_t *tshuffle = __random_thread_array();

8 for ( i=0; i<THREADS; i++ ){

9 uint64_t idx = tshuffle[i];

10 upc_memget( lptr, &Y[idx*block], block*sizeof(double));

11 }

12 }

Listing 6: Compiler transformed loop with coarse-grained access.

1 shared [N/THREADS] double X[N];

23 void memget_threads(){

4 uint64_t i=0, block = N/THREADS;

5 for ( i=0; i<N; i++ ){

6 uint64_t idx = ( i + MYTHREAD*block ) % N;

7 ... = X[idx];

8 }

9 }

Listing 7: Compiler transformed loop with fine-grained access. We assume the upperbound to be equal with the size of the array

• Loops that contain fine-grained communication or contain coarse-grainedare transfered but in these cases the upper bound of the loop is dif-ferent from the number of UPC threads. In this case, the compiler

12

Page 13: Improving Performance of Communication Through Loop

Candidate Loop

upc_mem*

calls ?

Is UB

THREADS ?

Skew loop iterations

Use a shuffled array with

the number of threads

NO

NOYES

YES

Figure 4: Automatic compiler loop scheduling.

skews the iterations in such a way that each UPC thread starts exe-cuting from a different point in the shared array. The compiler usesthe simple form to ‘skew’ the iterations to avoid the creation of addi-tional runtime calls. Finally, the compiler replaces the occurrences ofthe induction variable inside the loop body. Listing 7 illustrates thefinal form of a loop containing fine-grained communication.

4. Methodology

The evaluation uses two different Power R© 775. The first one has 1024nodes and is used to evaluate the manual code modifications. The secondsystem contains 64 nodes allowing runs with up to 2048 UPC threads and itis used to evaluate the automatic compiler transformations.

We use one process per UPC thread and schedule one UPC thread perPOWER7 R© core. The UPC threads are grouped in blocks of 32 per nodeand each UPC thread is bound to its own core. The experimental evaluationruns each benchmark five times. The results presented in this evaluationare the average of the execution time for the five runs. In all experimentsthe execution time variation is less than 3%. All benchmarks are compiledusing the ‘-qarch=pwr7 -qtune=pwr7 -O3 -qprefetch=aggressive’ com-piler flags. The evaluation varies the data set size with the number of UPCthreads (weak scaling).

13

Page 14: Improving Performance of Communication Through Loop

4.1. Benchmarks and Datasets

The evaluation uses three microbenchmarks and four applications:Microbenchmark: The microbenchmark is a loop that accesses a shared

array of structures. There are three variations of this microbenchmark. Inall versions each UPC thread executes the same code in the loop. In theupc memput microbenchmark the loop contains coarse grain upc memput

calls. The fine-grained get contains shared reads and the fine-grainedput contains shared writes. Listing 8 presents the code used in upc memput

benchmarks.

1 #define N (1llu<<32)

2 shared [N/THREADS] double X[N];

3 shared [N/THREADS] double Y[N];

45 void memget_threads(){

6 uint64_t i=0; uint64_t block = N/THREADS;

7 double *lptr = (double *) &X[block*MYTHREAD];

89 for (i=0;i<THREADS;i++){

10 uint64_t idx = ( i );

11 upc_memput( &Y[idx*block], lptr, block*sizeof(double));

12 }

13 }

Listing 8: Example of upc memget loop in UPC.

Sobel: The Sobel benchmark computes an approximation of the gradientof the image intensity function, performing a nine-point stencil operation. Inthe UPC version [28] the image is represented as a one-dimensional sharedarray of rows and the outer loop is a parallel upc forall loop. Listing 9presents the kernel of the Sobel benchmark.

Gravitational fish: The gravitational UPC fish benchmark emulatesfish movements based on gravity. The benchmark is a N-Body gravity sim-ulation using parallel ordinary differential equations [29]. There are threeloops in the benchmark that access shared data, one for the computationof acceleration between fishes, one for data exchange, and one for the newposition calculation.

Bucket-sort: The benchmark sorts an array of 16-byte records usingbucket-sort [30] algorithm. Each node generates its share of the records.Each thread uses a 17

16× 2GB buffer to hold records received from other

14

Page 15: Improving Performance of Communication Through Loop

threads, which are destined to be sorted on this thread. Once this threadhas generated all its share of the records, it distributes the remainder of eachbucket to the corresponding thread. Once this thread has received all itsappropriate data from each of the other threads, it performs a sort on thelocal data.

FT: The benchmark is part of the NPB NAS [31] suite. The benchmarksolves a three-dimensional partial differential equation (PDE) using the fastFourier transform (FFT). The benchmark creates an all-to-all communicationpattern over the network.

1 typedef struct { uint8_t r[IMGSZ]; } RowOfBytes;

2 shared RowOfBytes orig[IMGSZ];

3 shared RowOfBytes edge[IMGSZ];

45 void Sobel_upc(void){

6 int i,j,d1,d2; double magn;

7 upc_forall(i=1; i<IMGSZ-1; i++; &edge[i].r[0]){

8 for (j=1; j<IMGSZ-1; j++){

9 d1 = ((int) orig[i-1].r[j+1] - orig[i-1].r[j-1]);

10 d1 += ((int) orig[ i ].r[j+1] - orig[ i ].r[j-1])<<1;

11 d1 += ((int) orig[i+1].r[j+1] - orig[i+1].r[j-1]);

1213 d2 = ((int) orig[i-1].r[j-1] - orig[i+1].r[j-1]);

14 d2 += ((int) orig[i-1].r[ j ] - orig[i+1].r[ j ])<<1;

15 d2 += ((int) orig[i-1].r[j+1] - orig[i+1].r[j+1]);

1617 magn=sqrt((double)(d1*d1+d2*d2));

18 edge[i].r[j] = (uint8_t) (magn>255) ? 255:magn;

19 }

20 }

21 }

Listing 9: UPC version of Sobel.

5. Experimental results

This experimental evaluation assesses the effectiveness of the loop trans-formation by presenting the following: (1) a manual code transformationsstudy on microbenchmarks using manual code modifications and a large num-ber of UPC threads to help understand the maximum speedup that can be

15

Page 16: Improving Performance of Communication Through Loop

32

64

128

256

512

1024

2048

4096

8192

16384

32768

UPC THREADS

32

320

3200

32000

GB

/s

Baseline

Skewed

Stride

Random

Figure 5: Effect of loop scheduling policies on performance for upc memput

achieved and the potential performance bottlenecks; (2) the performance ofcompiler transformed microbenchmarks and real applications.

5.1. Manual code transformations study

Figures 5 presents the results for the coarse-grained microbenchmark.The drop of the performance when going from 512 to 1024 UPC threads isdue to the HUB link limits. The microbenchmarks use two supernodes whenrunning with 1024 UPC threads. Thus, a portion of communication betweenthe UPC threads uses the (remote) D-Links. Furthermore, the strided ver-sion has lower performance than the random shuffled version. The trafficrandomization balances the use of global links reducing contention. In con-trast, the strided version creates less randomized traffic and created predictedtraffic by using different node on each loop iteration. Overall, performancefor a coarse-grained communication pattern is better when using random al-location. Note, that the is no skewed plus version because the upc memgetand upc memput calls always start from the beginning of the block.

Figure 6 illustrates the performance of the fine-grained microbenchmarks.The skewed and skewed plus versions have similar performance, in fine-grained category. Thus, using a different starting point inside the blockhas no real impact in the performance. The strided version has worse per-formance than the skewed version in fine-grained get version. On the otherhand, the strided has better performance using the fine-grained put version.This occurs because the runtime issues in-order remote stores that target thesame remote node.

16

Page 17: Improving Performance of Communication Through Loop

32

64

128

256

512

1024

2048

4096

8192

UPC THREADS

90

900

GB

/s

Baseline

Skewed

Skewed Plus

Stride

(a)

32

64

128

256

512

1024

2048

4096

8192

UPC THREADS

90

900

9000

GB

/s

Baseline

Skewed

Skewed Plus

Stride

(b)

Figure 6: Effect of loop scheduling policies on performance for fine-grained get (a) andfine-grained put (b).

Moreover, the performance of the fine-grained put is an order of magni-tude faster than the fine-grained get. This behaviour is noticeable especiallyin the strided version. The main reason behind this is that the runtime allowsoverlapping of store/put operations when the destination node is different.On the other hand, in the read/get operations, the runtime blocks and waitsthe transfers to finish.

Overall, loops that contain coarse-grained memget/memput transfers, thecompiler should use random shuffle. On the other hand, the compiler shoulduse skewing transformation for loops with fine-grained communication.

5.2. Compiler-assisted loop transformation

This section compares the automatic compiler transformation with themanual transformation. The main difference is that the manual approachavoids additional overhead through the insertion of runtime calls. Figure 7compares the compiler-transformed and hand-optimized code. While the per-formance of the manual and compiler-transformed fine-grained microbench-marks is similar, the compiler transformation achieves slightly lower per-formance than the hand-optimized benchmark because of the insertion ofruntime calls.

Figure 8 presents the application results. There are three different pat-terns in the applications. In the first category are applications that haveperformance gain compared with the version without the scheduling (base-line), such as the NAS FT benchmark. This benchmark achieves from 3%up to 15% performance gain, due to its all-to-all communication pattern.

17

Page 18: Improving Performance of Communication Through Loop

32

64

128

256

512

1024

2048

UPC THREADS

600

60

6

GB

/s

Baseline

Compiler

Hand Optimized

(a)

32

64

128

256

512

1024

UPC THREADS

40

4

0.4

GB

/s

Baseline

Compiler

Hand Optimized

(b)

32

64

128

256

512

1024

UPC THREADS

1500

150

15

MB

/s

Baseline

Compiler

Hand Optimized

(c)

Figure 7: Comparison of compiler-transformed and hand-optimized code: upc memput(a), fine-grained get (b), and fine-grained put (c).

32

64

128

256

512

1024

UPC THREADS

30000

3000

300

Ob

ject

s/s

Baseline

Compiler

(a)

32

64

128

256

512

1024

UPC THREADS

20000

2000

200

GO

p/s

Baseline

Compiler

(b)

32

64

128

256

512

1024

UPC THREADS

150

15

1.5

GO

p/s

Baseline

Compiler

(c)

Figure 8: Comparison of baseline and compiler-transformed code for fish (a), Sobel (b),and NAS FT (c).

Moreover, the performance of the gravitational fish benchmark is almostidentical and the transformation reveals minimal performance gains. On theother hand, the performance of the Sobel benchmark decreases up to 20%compared with the baseline version, because of poor cache locality. Table 1presents the cache misses for the Sobel benchmark for different cache levelsusing the hardware counters. The main reason for the bad cache locality isthat the Sobel uses an array of structs and each struct contains an array.The transformation ‘skews’ the iterations of the loop that access the innerarray causing bad cache locality.

Figure 9 presents the results for bucket-sort with enabled and disabledlocal sort. There are minor differences between the baseline and the trans-formed version when using the benchmarks with enabled the local sort. How-

18

Page 19: Improving Performance of Communication Through Loop

Cache level Baseline % Schedule

Level 1 0.14% 0.19%Level 2 0.19% 24.49%Level 3 0.32% 28.84%

Table 1: Cache misses using for the Sobel benchmark using 256 UPC threads. Results arethe average from each of 256 cores.

32

64

128

256

512

1024

2048

UPC THREADS

3000

300

30

3

M R

ecord

/s

Baseline

Compiler

(a)32

64

128

256

512

1024

2048

UPC THREADS

10000

1000

100

M R

ecord

/s

Baseline

Compiler

(b)

Figure 9: Comparison of baseline and compiler-transformed code for bucket-sort (a) andbucket-sort with only the communication pattern (b).

ever, the transformed version has better results — up to 25% performancegain — than the baseline version when only the communication part is used.In Figure 9(b) the performance for 32 UPC threads is 40% worse than thebaseline because of the overhead of the additional runtime calls. The effec-tiveness of the loop transformation is limited when running less than 32 UPCthreads.

5.3. Summary

The evaluation indicates that the compiler transformation is an effectivetechnique for increasing the network performance in UPC languages. Theresults show a performance gain from 5% up to 25% compared to the base-line versions of the applications. Moreover, microbenchmark results showeven higher performance gains of up to 3.4X. On the other hand, the looptransformation has negative effect on the cache locality in some benchmarks,such as Sobel.

19

Page 20: Improving Performance of Communication Through Loop

6. Related Work

Overall, memory optimizations is a widely researched topic [32, 33]. Tra-ditional loop transformations focus on increasing the cache performance andbandwidth [34, 35]. Researchers also use loop scheduling techniques toimprove the performance of Non-Uniform Memory Access (NUMA) ma-chines [36, 37] and heterogeneous machines [38, 39].

The Crystallizing Fortran Project compiler [40] employs different algo-rithms [41] to change the data layout of the application to improve the per-formance in distributed environments. In contrast, our approach modifiesthe order of loop iterations and not the data layout of the application.

Other runtime implementations [42, 14] use the techniques presented inthis paper to schedule the communication internally in the runtime. However,the programmer must use the runtime calls to efficient use the interconnect.On the other hand, our approach does not restrict the user to use explicitlythe runtime calls to improve the network performance.

Recent efforts on loop transformations focus on reducing memory bankconflicts especially in GPUs [43] and embedded systems [44, 32]. In a similarway, our approach exploits loop transformations to increase the network effi-ciency by distributing the shared access across the UPC threads. In contrastwith the most of the previous work, our approach employs a skewed andrandom distribution to improve the performance.

The Dragonfly interconnect [45] uses randomized routing to reduce net-work hotspots. On the other hand, this randomized routing requires ad-ditional hardware support. We move the complexity of loop scheduling tosoftware. Our compiler transformation can offer comparable performance byrandomizing the source and destination pairs involved in communication.

Other approaches to produce uniform random traffic pattern, includesrandomized task placement [46] and adaptive routing [27]. The authorsin [27] randomize the routing at runtime to achieve better performance basedon different patterns. Randomized task placement [46] can increase theamount of randomized traffic and avoid traffic. However other researchersproved [47] that despite the improved results, the performance of randomizeduniform traffic is still far from ideal.

7. Conclusions and Future Work

This paper presents an optimization to increase the performance of dif-ferent communication patterns. The optimization schedules the loop iter-

20

Page 21: Improving Performance of Communication Through Loop

ations to increase the network performance. The paper evaluates differentapproaches to schedule the loop iterations to optimize the network trafficand to avoid the creation of network hotspots. Moreover, the paper presentsa compiler optimization that automatically transforms the loops to improvenetwork performance. The compiler transformation achieves performancethat is similar to the manually modified versions. Even though the commu-nication efficiency increases, there is still room for further improvements forthe compiler transformations. Nevertheless, this paper makes a significantcontribution to improve the performance of programs that contain problem-atic access patterns for high-radix networks, including the all-to-all commu-nication.

Acknowledgments

The researchers at Universitat Politecnica de Catalunya and BarcelonaSupercomputing Center are supported by the IBM Centers for AdvancedStudies Fellowship (CAS2012-069), the Spanish Ministry of Science and Inno-vation (TIN2007-60625, TIN2012-34557, and CSD2007-00050), the EuropeanCommission in the context of the HiPEAC3 Network of Excellence (FP7/ICT287759), and the Generalitat de Catalunya (2009-SGR-980). IBM researchersare supported by the Defense Advanced Research Projects Agency underits Agreement No. HR0011-07-9-0002. The researchers at University of Al-berta are supported by the NSERC Collaborative Research and Development(CRD) program of Canada. Any opinions, findings and conclusions or rec-ommendations expressed in this material are those of the author(s) and donot necessarily reflect the views of the funding agencies.

References

[1] U. Consortium, UPC Specifications, v1.2, Tech. rep., Lawrence BerkeleyNational Lab Tech Report LBNL-59208.

[2] R. Numwich, J. Reid, Co-Array Fortran for parallel programming, Tech.rep. (1998).

[3] E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen,S. Ryu, G. L. S. Jr., S. Tobin-Hochstadt, The Fortress LanguageSpecification Version 1.0, http://labs.oracle.com/projects/plrg/

Publications/fortress.1.0.pdf (March 2008).

21

Page 22: Improving Performance of Communication Through Loop

[4] Cray Inc, Chapel Language Specification Version 0.8, http://chapel.cray.com/spec/spec-0.8.pdf (April 2011).

[5] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra,K. Ebcioglu, C. von Praun, V. Sarkar, X10: An Object-oriented Ap-proach to Non-uniform Cluster Computing 40 (10) 519–538.

[6] K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krish-namurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, A. Aiken,Titanium: A High-performance Java Dialect, Concurrency - Practiceand Experience 10 (11-13) (1998) 825–836.

[7] J. Lee, M. Sato, Implementation and Performance Evaluation of Xcal-ableMP: A Parallel Programming Language for Distributed MemorySystems, in: Parallel Processing Workshops (ICPPW), 2010 39th Inter-national Conference on, 2010, pp. 413–420. doi:10.1109/ICPPW.2010.62.

[8] M. Alvanos, M. Farreras, E. Tiotto, X. Martorell, Automatic Commu-nication Coalescing for Irregular Computations in UPC Language, in:Conference of the Center for Advanced Studies, CASCON ’12.

[9] C. I. W. Chen, K. Yelick, Communication Optimizations for Fine-Grained UPC Applications, in: In 14th International Conference onParallel Architectures and Compilation Techniques, 2005.

[10] Christopher Barton, George Almasi, Montse Farreras, and Jose Nel-son Amaral, A Unified Parallel C compiler that implements automaticcommunication coalescing, in: 14th Workshop on Compilers for ParallelComputing, 2009.

[11] Z. Zhang, J. Savant, S. Seidel, A UPC Runtime System Based on MPIand POSIX Threads, Parallel, Distributed, and Network-Based Process-ing, Euromicro Conference on.

[12] R. Rajamony, L. Arimilli, K. Gildea, PERCS: The IBM POWER7-IHhigh-performance computing system, IBM Journal of Research and De-velopment 55 (3) (2011) 3–1.

22

Page 23: Improving Performance of Communication Through Loop

[13] G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson,T. Johnson, J. Kopnick, M. Higgins, J. Reinhard, Cray Cascade: a Scal-able HPC System Based on a Dragonfly network, in: Proceedings of theInternational Conference on High Performance Computing, Networking,Storage and Analysis, SC ’12, 2012.

[14] G. Tanase, G. Almasi, E. Tiotto, M. Alvanos, A. Ly, B. Daltonn, Per-formance Analysis of the IBM XL UPC on the PERCS Architecture,Tech. rep., RC25360 (2013).

[15] J. Nieplocha, R. J. Harrison, R. J. Littlefield, Global arrays: A nonuni-form memory access programming model for high-performance comput-ers, The Journal of Supercomputing 10 (2) (1996) 169–189.

[16] ISO/IEC JTC1 SC22 WG14, ISO/IEC 9899:TC2 Programming Lan-guages - C, Tech. rep., http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf (May 2005).

[17] C. Barton, C. Cascaval, G. Almasi, Y. Zheng, M. Farreras, S. Chatterje,J. N. Amaral, Shared memory programming for large scale machines,Programming Language Design and Implementation (PLDI) (2006) 108–117.

[18] M. Alvanos, M. Farreras, E. Tiotto, J. N. Amaral, X. Martorell, Im-proving Communication in PGAS Environments: Static and DynamicCoalescing in UPC, in: Proceedings of the 27th annual internationalconference on Supercomputing, ICS ’13.

[19] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup,T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, R. Rajamony, The PERCSHigh-Performance Interconnect, High-Performance Interconnects, Sym-posium on (2010) 75–82.

[20] G. I. Tanase, G. Almasi, H. Xue, C. Archer, Composable, Non-blockingCollective Operations on Power7 IH, in: Proceedings of the 26th ACMinternational conference on Supercomputing, ICS ’12, pp. 215–224.

[21] Redbooks, IBM, PAMI Programming Guide, 2011, http://publib.

boulder.ibm.com/epubs/pdf/a2322730.pdf.

23

Page 24: Improving Performance of Communication Through Loop

[22] R. Kalla, B. Sinharoy, W. Starke, M. Floyd, Power7: IBM’s Next-Generation Server Processor, Micro, IEEE 30 (2) (2010) 7 –15.

[23] HPCC, HPC Challenge Benchmark Results, http://icl.cs.utk.edu/hpcc/ (March 2013).

[24] The Graph 500 List, http://www.graph500.org/results_june_2012(June 2012).

[25] Das, Sarkar, Conflict-free data access of arrays and trees in parallelmemory systems, in: Proceedings of the 1994 6th IEEE Symposium onParallel and Distributed Processing, SPDP ’94, IEEE Computer Society,Washington, DC, USA, 1994, pp. 377–384.

[26] D. L. Erickson, Conflict-free access to rectangular subarrays in paral-lel memory modules, Ph.D. thesis, Waterloo, Ont., Canada, Canada,doctoral Dissertation, UMI Order No. GAXNN-81075 (1993).

[27] M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero,M. Valero, G. Rodriguez, J. Labarta, C. Minkenberg, On-the-fly Adap-tive Routing in High-Radix Hierarchical Networks, in: Parallel Process-ing (ICPP), 2012 41st International Conference on, IEEE, 2012, pp.279–288.

[28] T. El-Ghazawi, F. Cantonnet, UPC performance and potential: a NPBexperimental study, in: Proceedings of the 2002 ACM/IEEE conferenceon Supercomputing, Supercomputing ’02, pp. 1–26.

[29] S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms,Cambridge Monographs on Mathematical Physics, Cambridge Univer-sity Press, Cambridge, U.K.; New York, U.S.A., 2003.

[30] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction toalgorithms, MIT press, 2001, ISBN 0262032937.

[31] H. Jin, R. Hood, P. Mehrotra, A practical study of UPC using theNAS Parallel Benchmarks, in: Proceedings of the Third Conference onPartitioned Global Address Space Programing Models, PGAS ’09, 2009,pp. 8:1–8:7.

24

Page 25: Improving Performance of Communication Through Loop

[32] P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer,C. Kulkarni, A. Vandercappelle, P. G. Kjeldsberg, Data and MemoryOptimization Techniques for Embedded Systems, ACM TransactionsDes. Autom. Electron. Systems 6 (2) (2001) 149–206.

[33] P. Marchal, J. I. Gomez, F. Catthoor, Optimizing the Memory Band-width with Loop Fusion, in: Proceedings of the 2nd IEEE/ACM/I-FIP international conference on Hardware/software codesign and systemsynthesis, CODES+ISSS ’04, 2004, pp. 188–193.

[34] J. R. Allen, K. Kennedy, Automatic Loop Interchange, in: Proceedingsof the 1984 symposium on Compiler construction, pp. 233–246.

[35] M. E. Wolf, D. E. Maydan, D.-K. Chen, Combining loop transformationsconsidering caches and scheduling, in: Proceedings of the 29th annualACM/IEEE international symposium on Microarchitecture, MICRO 29,1996, pp. 274–286.

[36] E. P. Markatos, T. J. LeBlanc, Using processor affinity in loop schedulingon shared-memory multiprocessors, Parallel and Distributed Systems,IEEE Transactions on 5 (4) (1994) 379–400.

[37] H. Li, S. Tandri, M. Stumm, K. C. Sevcik, Locality and loop schedulingon numa multiprocessors, in: Parallel Processing, 1993. ICPP 1993.International Conference on, Vol. 2, IEEE, 1993, pp. 140–147.

[38] M. Cierniak, W. Li, M. J. Zaki, Loop scheduling for heterogeneity, in:High Performance Distributed Computing, 1995., Proceedings of theFourth IEEE International Symposium on, IEEE, 1995, pp. 78–85.

[39] A. T. Chronopoulos, R. Andonie, M. Benche, D. Grosu, A class of loopself-scheduling for heterogeneous clusters, in: Proceedings of the 2001IEEE international conference on cluster computing, Vol. 291, 2001.

[40] M. Chen, Y.-i. Choo, J. Li, Compiling parallel programs by optimizingperformance, The Journal of Supercomputing 2 (2) (1988) 171–207.

[41] J. Li, M. Chen, Compiling Communication-Efficient Programs for Mas-sively Parallel Machines, Parallel and Distributed Systems, IEEE Trans-actions on (1991) 361–376.

25

Page 26: Improving Performance of Communication Through Loop

[42] R. Rajamony, M. Stephenson, E. Speight, The Power 775 Architectureat Scale, Tech. rep., RC25366 (2013).

[43] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam,A. Rountev, P. Sadayappan, A Compiler Framework for Optimizationof Affine Loop Nests for GPGPUs, in: Proceedings of the 22nd annualinternational conference on Supercomputing, ICS ’08, pp. 225–234.

[44] Q. Zhang, Q. Li, Y. Dai, C.-C. Kuo, Reducing Memory Bank Conflictfor Embedded Multimedia Systems, in: Multimedia and Expo (ICME’04) IEEE International Conference on, pp. 471–474.

[45] J. Kim, W. J. Dally, S. Scott, D. Abts, Technology-driven, highly-scalable dragonfly topology, in: Proceedings of the 35th Annual Inter-national Symposium on Computer Architecture, ISCA ’08, pp. 77–88.

[46] A. Bhatele, N. Jain, W. D. Gropp, L. V. Kale, Avoiding hot-spotson two-level direct networks, in: High Performance Computing, Net-working, Storage and Analysis (SC), 2011 International Conference for,IEEE, 2011, pp. 1–11.

[47] A. Jokanovic, B. Prisacari, G. Rodriguez, C. Minkenberg, Randomizingtask placement does not randomize traffic (enough), in: Proceedings ofthe 2013 Interconnection Network Architecture: On-Chip, Multi-Chip,IMA-OCMC ’13, ACM, New York, NY, USA, 2013, pp. 9–12.

26