performance evaluation of openmp applications on ...€¦ · performance evaluation of openmp...

Performance Evaluation of OpenMPApplications onVirtualized Multicore Machines

Jie Tao1 [email protected]

Karl Fuerlinger2 [email protected]

Holger Marten1 [email protected]

1: Steinbuch Center for Computing,Karlsruhe Institute of Technology (KIT), Germany

2: MNM-Team, Department of Computer Science,LMU München, Germany

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 2

Outline

Introduction– Virtualization and the impact on performance

Experimental Setup– NAS parallel benchmarks, SPEC OpenMP, microbenchmarks

Study of SP (NAS Parallel Benchmarks)– Initial performance

– Analysis using ompP

– Optimization results and microbenchmark study

Conclusions


Virtualization

Running multiple OSs on the same hardware

Concepts– Hypervisor (xen, KVM, VMware)

– Full virtualization vs para-virtualization

Adopted for– Server consolidation

– Cloud Computing: on-demand resource provision

Performance impact

Hardware

Operating

System

Application

Host machine

Hypervisor

Guest

OS

Guest

OS

Guest

OS

Guest

OS

VM 1 VM 2 VM 3 VM 4


Performance Impact of Virtualization

Has been studied before, E.g., Keith Jackson, et al. „Performance of HPC Applications on the Amazon Web Services Cloud“

Here: The performance impact of virtualization on OpenMP applications


Experimental Setup

Benchmarks– NAS OpenMP (size A)

– SPEC OpenMP (reference dataset)

– EPCC OpenMP Microbenchmarks

Host machine– AMD Opteron 2376 („Shanghai“), 2.3 GHz, 2 socket quadcore

– Scientific Linux

– Virtualized with xen

Virtual machines– Hypervisor: xen

– OS: Debian 2.6.26

– Compiler: gcc 4.3.2

– #cores: 1-8

– Memory: 4GB


NAS Parallel Benchmarks


NAS Parallel Benchmarks (2)


SPEC OpenMP Benchmarks


SPEC OpenMP Benchmarks (2)


Execution time of NAS SP

What is going on here?


OpenMP Performance Analysis with ompP

ompP: OpenMP profiling tool– Based on source code instrumentation

– Independent of the compiler and runtime used

– Supports HW counters through PAPI

– Uses source code instrumenter Opari fromthe KOJAK/Scalasca toolset

– Available for download (GPL): http://www.ompp-tool.com

Source Code Automatic instrumentation of OpenMP constructs, manual region instrumentation

Executable

Profiling ReportSettings (env. Vars)

HW Counters,output format,…

ompP library

Execution onparallel machine


Source to Source Instrumentation with Opari

Preprocessor Instrumentation– Example: Instrumenting OpenMP constructs with Opari

– Preprocessor operation

– Example: Instrumentation of a parallel region

Instrumentation added by Opari

Orignialsource code

Modified (instrumented)source code

Pre-processor

POMP_Parallel_fork [master]#pragma omp parallel {

POMP_Parallel_begin [team]

/* user code in parallel region */

POMP_Barrier_enter [team]#pragma omp barrierPOMP_Barrier_exit [team]

POMP_Parallel_end [team]}POMP_Parallel_join [master]

#pragma omp parallel {

/* user code in parallel region */

}


ompP’s Profiling Data

Example code section and performance profile:

Code:

#pragma omp parallel {#pragma omp critical{

sleep(1.0);}

}

Profile:

R00002 main.c (34-37) (default) CRITICALTID execT execC bodyT enterT exitT0 3.00 1 1.00 2.00 0.001 1.00 1 1.00 0.00 0.002 2.00 1 1.00 1.00 0.003 4.00 1 1.00 3.00 0.00

SUM 10.01 4 4.00 6.00 0.00

Components:– Source code location and type of region

– Timing data and execution counts, depending on the particular construct

– One line per thread, last line sums over all threads

– Hardware counter data (if PAPI is available and HW counters are selected)

– Data is “exact” (measured, not based on sampling)


ompP Overhead Analysis (1)

Certain timing categories reported by ompP can be classified as overheads:

– Example: enterT in a critical section: Threads wait to enter the critical section (synchronization overhead).

Four overhead categories are defined in ompP:

– Imbalance: waiting time incurred due to an imbalanced amount of work in aworksharing or parallel region

– Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call

– Limited Parallelism: idle threads due not enough parallelism being exposed by the program

– Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available


ompP Overhead Analysis (2)

S: Synchronization overhead

M: Thread management overhead

I: Imbalance overhead

L: Limited Parallelism overhead


Overhead Analysis for the NAS Benchmarks

Total Overhead (%) Synch Imbal Limpar Mgmt

BT-hostBT-fullBT-para

1253.711294.551400.50

81.23 (06.48)148.48 (11.47)163.66 (11.65)

0.000.000.00

80.87148.47163.64

0.000.000.00

0.360.010.02

FT-hostFT-fullFT-para

72.2775.0288.67

25.62 (35.44)25.97 (34.53)32.22 (36.34)

0.010.010.00

1.061.046.45

24.4324.8525.73

0.120.070.04

CG-hostCG-fullCG-para

14.3617.6424.05

1.55 (08.95)4.87 (23.59)6.37 (26.49)

0.000.000.00

0.953.465.27

0.191.371.08

0.410.040.02

EP-hostEP-fullEP-para

92.2789.66

133.76

1.08 (01.17)1.24 (01.37)

29.60 (22.13)

0.000.000.00

0.930.75

29.32

0.000.000.00

0.150.490.27

SP-hostSP-fullSP-para

4994.7616466.47

6816.17

1652.66 (33.03)14315.84 (86.89)

5302.04 (77.68)

0.111.452.74

1651.9514314.36

5299.29

0.000.000.00

0.600.030.01


OpenMP Constructs in the NAS Parallel Benchmarks

2000542BT

111562FT

20012222CG

110011EP

2030692SP

MasterCriticalBarrierSingleLoopParallel


ompP Profile for SP

2316.76

290.48

289.99

289.62

289.68

289.14

289.12

289.35

289.41

exitBarT

10.921541444311.147

11.261541444310.854

11.241541444310.825

11.171541444311.106

11.241541444310.600

11.221541444310.501

11.31541444310.442

11.221541444310.263

89.60123315522485.71SUM

bodyTexecCexecTTID

ompP Profiling Report for sp.c (lines 898-906) (para-virtualized)

305.41

39.35

38.85

35.47

38.77

38.03

37.11

38.91

38.92

exitBarT (native host)


exitBarT in a Parallel Loops

Opari transforms the implicit barrier into an explict barrier

Worst case load imbalance scenario:

Loop_enter

Loop_exit

Barrier_enter

Barrier_exit

exitBarT =

Thread i can induce at most t seconds exitBarT time in each other thread

i

t

i


exitBarT shouldbe max. ~80 seconds

Barrier that takesa really long time

290.4810.921541444311.147

289.6811.261541444310.854

289.6211.241541444310.825

289.9911.171541444311.106

289.4111.241541444310.600

289.3511.221541444310.501

289.1211.31541444310.442

289.1411.221541444310.263

2316.7689.60123315522485.71SUM

exitBarTbodyTexecCexecTTID


Optimization

Move parallelization to outermost loop

for (j = 1; j <= grid_points[1]-2; j++) {for (k = 1; k <= grid_points[2]-2; k++) {

#pragma omp forfor (i = 0; i <= grid_points[0]-1; i++) {

ru1 = c3c4*rho_i[i][j][k];cv[i] = us[i][j][k];rhon[i] = max(dx2+con43*ru1,

max(dx5+c1c5*ru1,max(dxmax+ru1,

dx1)));}

#pragma omp forfor (i = 1; i <= grid_points[0]-2; i++) {

lhs[0][i][j][k] = 0.0;lhs[1][i][j][k] = - dttx2 * cv[i-1] -

dttx1 * rhon[i-1];lhs[2][i][j][k] = 1.0 + c2dttx1 *

rhon[i];lhs[3][i][j][k] = dttx2 * cv[i+1] -

dttx1 * rhon[i+1];lhs[4][i][j][k] = 0.0;

}}

}

#pragma omp forfor (j = 1; j <= grid_points[1]-2; j++) {for (k = 1; k <= grid_points[2]-2; k++) {

for (i = 0; i <= grid_points[0]-1; i++) {ru1 = c3c4*rho_i[i][j][k];cv[i] = us[i][j][k];rhon[i] = max(dx2+con43*ru1,

max(dx5+c1c5*ru1,max(dxmax+ru1,

dx1)));}for (i = 1; i <= grid_points[0]-2; i++) {lhs[0][i][j][k] = 0.0;lhs[1][i][j][k] = - dttx2 * cv[i-1] -

dttx1 * rhon[i-1];lhs[2][i][j][k] = 1.0 + c2dttx1 *

rhon[i];lhs[3][i][j][k] = dttx2 * cv[i+1] - dttx1

* rhon[i+1];lhs[4][i][j][k] = 0.0;

}}


Optimization Results


EPCC Microbenchmarks

There is significant overhead in fine-grained constructs related to thread scheduling and reduction operations


Conclusion and Future Work

Virtualization introduces application-dependent overheads– Following good practice advice (outermost, coarse-grained parallelization)

even more important

– Hypercalls are very expensive

Future work– Investigate this behavior with XEN tracing tools

– Other OpenMP runtimes

– Busy wait vs. yielding

– Virtualization aware runtime

performance evaluation of openmp applications on ...€¦ · performance evaluation of openmp...

Documents