performance evaluation of openmp applications on ...€¦ · performance evaluation of openmp...

24
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao 1 [email protected] Karl Fuerlinger 2 [email protected] Holger Marten 1 [email protected] 1 : Steinbuch Center for Computing, Karlsruhe Institute of Technology (KIT), Germany 2 : MNM-Team, Department of Computer Science, LMU München, Germany

Upload: others

Post on 23-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMPApplications onVirtualized Multicore Machines

Jie Tao1 [email protected]

Karl Fuerlinger2 [email protected]

Holger Marten1 [email protected]

1: Steinbuch Center for Computing,Karlsruhe Institute of Technology (KIT), Germany

2: MNM-Team, Department of Computer Science,LMU München, Germany

Page 2: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 2

Outline

Introduction– Virtualization and the impact on performance

Experimental Setup– NAS parallel benchmarks, SPEC OpenMP, microbenchmarks

Study of SP (NAS Parallel Benchmarks)– Initial performance

– Analysis using ompP

– Optimization results and microbenchmark study

Conclusions

Page 3: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 3

Virtualization

Running multiple OSs on the same hardware

Concepts– Hypervisor (xen, KVM, VMware)

– Full virtualization vs para-virtualization

Adopted for– Server consolidation

– Cloud Computing: on-demand resource provision

Performance impact

Hardware

Operating

System

Application

Host machine

Hypervisor

Guest

OS

Guest

OS

Guest

OS

Guest

OS

VM 1 VM 2 VM 3 VM 4

Page 4: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 4

Performance Impact of Virtualization

Has been studied before, E.g., Keith Jackson, et al. „Performance of HPC Applications on the Amazon Web Services Cloud“

Here: The performance impact of virtualization on OpenMP applications

Page 5: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 5

Experimental Setup

Benchmarks– NAS OpenMP (size A)

– SPEC OpenMP (reference dataset)

– EPCC OpenMP Microbenchmarks

Host machine– AMD Opteron 2376 („Shanghai“), 2.3 GHz, 2 socket quadcore

– Scientific Linux

– Virtualized with xen

Virtual machines– Hypervisor: xen

– OS: Debian 2.6.26

– Compiler: gcc 4.3.2

– #cores: 1-8

– Memory: 4GB

Page 6: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 6

NAS Parallel Benchmarks

Page 7: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 7

NAS Parallel Benchmarks (2)

Page 8: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 8

SPEC OpenMP Benchmarks

Page 9: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 9

SPEC OpenMP Benchmarks (2)

Page 10: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 10

Execution time of NAS SP

What is going on here?

Page 11: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 11

OpenMP Performance Analysis with ompP

ompP: OpenMP profiling tool– Based on source code instrumentation

– Independent of the compiler and runtime used

– Supports HW counters through PAPI

– Uses source code instrumenter Opari fromthe KOJAK/Scalasca toolset

– Available for download (GPL): http://www.ompp-tool.com

Source Code Automatic instrumentation of OpenMP constructs, manual region instrumentation

Executable

Profiling ReportSettings (env. Vars)

HW Counters,output format,…

ompP library

Execution onparallel machine

Page 12: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 12

Source to Source Instrumentation with Opari

Preprocessor Instrumentation– Example: Instrumenting OpenMP constructs with Opari

– Preprocessor operation

– Example: Instrumentation of a parallel region

Instrumentation added by Opari

Orignialsource code

Modified (instrumented)source code

Pre-processor

POMP_Parallel_fork [master]#pragma omp parallel {

POMP_Parallel_begin [team]

/* user code in parallel region */

POMP_Barrier_enter [team]#pragma omp barrierPOMP_Barrier_exit [team]

POMP_Parallel_end [team]}POMP_Parallel_join [master]

#pragma omp parallel {

/* user code in parallel region */

}

Page 13: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 13

ompP’s Profiling Data

Example code section and performance profile:

Code:

#pragma omp parallel {#pragma omp critical{

sleep(1.0);}

}

Profile:

R00002 main.c (34-37) (default) CRITICALTID execT execC bodyT enterT exitT0 3.00 1 1.00 2.00 0.001 1.00 1 1.00 0.00 0.002 2.00 1 1.00 1.00 0.003 4.00 1 1.00 3.00 0.00

SUM 10.01 4 4.00 6.00 0.00

Components:– Source code location and type of region

– Timing data and execution counts, depending on the particular construct

– One line per thread, last line sums over all threads

– Hardware counter data (if PAPI is available and HW counters are selected)

– Data is “exact” (measured, not based on sampling)

Page 14: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 14

ompP Overhead Analysis (1)

Certain timing categories reported by ompP can be classified as overheads:

– Example: enterT in a critical section: Threads wait to enter the critical section (synchronization overhead).

Four overhead categories are defined in ompP:

– Imbalance: waiting time incurred due to an imbalanced amount of work in aworksharing or parallel region

– Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call

– Limited Parallelism: idle threads due not enough parallelism being exposed by the program

– Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available

Page 15: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 15

ompP Overhead Analysis (2)

S: Synchronization overhead

M: Thread management overhead

I: Imbalance overhead

L: Limited Parallelism overhead

Page 16: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 16

Overhead Analysis for the NAS Benchmarks

Total Overhead (%) Synch Imbal Limpar Mgmt

BT-hostBT-fullBT-para

1253.711294.551400.50

81.23 (06.48)148.48 (11.47)163.66 (11.65)

0.000.000.00

80.87148.47163.64

0.000.000.00

0.360.010.02

FT-hostFT-fullFT-para

72.2775.0288.67

25.62 (35.44)25.97 (34.53)32.22 (36.34)

0.010.010.00

1.061.046.45

24.4324.8525.73

0.120.070.04

CG-hostCG-fullCG-para

14.3617.6424.05

1.55 (08.95)4.87 (23.59)6.37 (26.49)

0.000.000.00

0.953.465.27

0.191.371.08

0.410.040.02

EP-hostEP-fullEP-para

92.2789.66

133.76

1.08 (01.17)1.24 (01.37)

29.60 (22.13)

0.000.000.00

0.930.75

29.32

0.000.000.00

0.150.490.27

SP-hostSP-fullSP-para

4994.7616466.47

6816.17

1652.66 (33.03)14315.84 (86.89)

5302.04 (77.68)

0.111.452.74

1651.9514314.36

5299.29

0.000.000.00

0.600.030.01

Page 17: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 17

OpenMP Constructs in the NAS Parallel Benchmarks

2000542BT

111562FT

20012222CG

110011EP

2030692SP

MasterCriticalBarrierSingleLoopParallel

Page 18: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 18

ompP Profile for SP

2316.76

290.48

289.99

289.62

289.68

289.14

289.12

289.35

289.41

exitBarT

10.921541444311.147

11.261541444310.854

11.241541444310.825

11.171541444311.106

11.241541444310.600

11.221541444310.501

11.31541444310.442

11.221541444310.263

89.60123315522485.71SUM

bodyTexecCexecTTID

ompP Profiling Report for sp.c (lines 898-906) (para-virtualized)

305.41

39.35

38.85

35.47

38.77

38.03

37.11

38.91

38.92

exitBarT (native host)

Page 19: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 19

exitBarT in a Parallel Loops

Opari transforms the implicit barrier into an explict barrier

Worst case load imbalance scenario:

Loop_enter

Loop_exit

Barrier_enter

Barrier_exit

exitBarT =

Thread i can induce at most t seconds exitBarT time in each other thread

i

t

i

Page 20: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 20

exitBarT shouldbe max. ~80 seconds

Barrier that takesa really long time

290.4810.921541444311.147

289.6811.261541444310.854

289.6211.241541444310.825

289.9911.171541444311.106

289.4111.241541444310.600

289.3511.221541444310.501

289.1211.31541444310.442

289.1411.221541444310.263

2316.7689.60123315522485.71SUM

exitBarTbodyTexecCexecTTID

Page 21: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 21

Optimization

Move parallelization to outermost loop

for (j = 1; j <= grid_points[1]-2; j++) {for (k = 1; k <= grid_points[2]-2; k++) {

#pragma omp forfor (i = 0; i <= grid_points[0]-1; i++) {

ru1 = c3c4*rho_i[i][j][k];cv[i] = us[i][j][k];rhon[i] = max(dx2+con43*ru1,

max(dx5+c1c5*ru1,max(dxmax+ru1,

dx1)));}

#pragma omp forfor (i = 1; i <= grid_points[0]-2; i++) {

lhs[0][i][j][k] = 0.0;lhs[1][i][j][k] = - dttx2 * cv[i-1] -

dttx1 * rhon[i-1];lhs[2][i][j][k] = 1.0 + c2dttx1 *

rhon[i];lhs[3][i][j][k] = dttx2 * cv[i+1] -

dttx1 * rhon[i+1];lhs[4][i][j][k] = 0.0;

}}

}

#pragma omp forfor (j = 1; j <= grid_points[1]-2; j++) {for (k = 1; k <= grid_points[2]-2; k++) {

for (i = 0; i <= grid_points[0]-1; i++) {ru1 = c3c4*rho_i[i][j][k];cv[i] = us[i][j][k];rhon[i] = max(dx2+con43*ru1,

max(dx5+c1c5*ru1,max(dxmax+ru1,

dx1)));}for (i = 1; i <= grid_points[0]-2; i++) {lhs[0][i][j][k] = 0.0;lhs[1][i][j][k] = - dttx2 * cv[i-1] -

dttx1 * rhon[i-1];lhs[2][i][j][k] = 1.0 + c2dttx1 *

rhon[i];lhs[3][i][j][k] = dttx2 * cv[i+1] - dttx1

* rhon[i+1];lhs[4][i][j][k] = 0.0;

}}

Page 22: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 22

Optimization Results

Page 23: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 23

EPCC Microbenchmarks

There is significant overhead in fine-grained constructs related to thread scheduling and reduction operations

Page 24: Performance Evaluation of OpenMP Applications on ...€¦ · Performance Evaluation of OpenMP Applications on Virtualized MulticoreMachines Jie Tao1 jie.tao@kit.edu Karl Fuerlinger2

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 24

Conclusion and Future Work

Virtualization introduces application-dependent overheads– Following good practice advice (outermost, coarse-grained parallelization)

even more important

– Hypercalls are very expensive

Future work– Investigate this behavior with XEN tracing tools

– Other OpenMP runtimes

– Busy wait vs. yielding

– Virtualization aware runtime