performance evaluation of openmp applications on ...€¦ · performance evaluation of openmp...
TRANSCRIPT
Performance Evaluation of OpenMPApplications onVirtualized Multicore Machines
Jie Tao1 [email protected]
Karl Fuerlinger2 [email protected]
Holger Marten1 [email protected]
1: Steinbuch Center for Computing,Karlsruhe Institute of Technology (KIT), Germany
2: MNM-Team, Department of Computer Science,LMU München, Germany
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 2
Outline
Introduction– Virtualization and the impact on performance
Experimental Setup– NAS parallel benchmarks, SPEC OpenMP, microbenchmarks
Study of SP (NAS Parallel Benchmarks)– Initial performance
– Analysis using ompP
– Optimization results and microbenchmark study
Conclusions
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 3
Virtualization
Running multiple OSs on the same hardware
Concepts– Hypervisor (xen, KVM, VMware)
– Full virtualization vs para-virtualization
Adopted for– Server consolidation
– Cloud Computing: on-demand resource provision
Performance impact
Hardware
Operating
System
Application
Host machine
Hypervisor
Guest
OS
Guest
OS
Guest
OS
Guest
OS
VM 1 VM 2 VM 3 VM 4
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 4
Performance Impact of Virtualization
Has been studied before, E.g., Keith Jackson, et al. „Performance of HPC Applications on the Amazon Web Services Cloud“
Here: The performance impact of virtualization on OpenMP applications
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 5
Experimental Setup
Benchmarks– NAS OpenMP (size A)
– SPEC OpenMP (reference dataset)
– EPCC OpenMP Microbenchmarks
Host machine– AMD Opteron 2376 („Shanghai“), 2.3 GHz, 2 socket quadcore
– Scientific Linux
– Virtualized with xen
Virtual machines– Hypervisor: xen
– OS: Debian 2.6.26
– Compiler: gcc 4.3.2
– #cores: 1-8
– Memory: 4GB
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 6
NAS Parallel Benchmarks
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 7
NAS Parallel Benchmarks (2)
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 8
SPEC OpenMP Benchmarks
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 9
SPEC OpenMP Benchmarks (2)
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 10
Execution time of NAS SP
What is going on here?
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 11
OpenMP Performance Analysis with ompP
ompP: OpenMP profiling tool– Based on source code instrumentation
– Independent of the compiler and runtime used
– Supports HW counters through PAPI
– Uses source code instrumenter Opari fromthe KOJAK/Scalasca toolset
– Available for download (GPL): http://www.ompp-tool.com
Source Code Automatic instrumentation of OpenMP constructs, manual region instrumentation
Executable
Profiling ReportSettings (env. Vars)
HW Counters,output format,…
ompP library
Execution onparallel machine
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 12
Source to Source Instrumentation with Opari
Preprocessor Instrumentation– Example: Instrumenting OpenMP constructs with Opari
– Preprocessor operation
– Example: Instrumentation of a parallel region
Instrumentation added by Opari
Orignialsource code
Modified (instrumented)source code
Pre-processor
POMP_Parallel_fork [master]#pragma omp parallel {
POMP_Parallel_begin [team]
/* user code in parallel region */
POMP_Barrier_enter [team]#pragma omp barrierPOMP_Barrier_exit [team]
POMP_Parallel_end [team]}POMP_Parallel_join [master]
#pragma omp parallel {
/* user code in parallel region */
}
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 13
ompP’s Profiling Data
Example code section and performance profile:
Code:
#pragma omp parallel {#pragma omp critical{
sleep(1.0);}
}
Profile:
R00002 main.c (34-37) (default) CRITICALTID execT execC bodyT enterT exitT0 3.00 1 1.00 2.00 0.001 1.00 1 1.00 0.00 0.002 2.00 1 1.00 1.00 0.003 4.00 1 1.00 3.00 0.00
SUM 10.01 4 4.00 6.00 0.00
Components:– Source code location and type of region
– Timing data and execution counts, depending on the particular construct
– One line per thread, last line sums over all threads
– Hardware counter data (if PAPI is available and HW counters are selected)
– Data is “exact” (measured, not based on sampling)
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 14
ompP Overhead Analysis (1)
Certain timing categories reported by ompP can be classified as overheads:
– Example: enterT in a critical section: Threads wait to enter the critical section (synchronization overhead).
Four overhead categories are defined in ompP:
– Imbalance: waiting time incurred due to an imbalanced amount of work in aworksharing or parallel region
– Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call
– Limited Parallelism: idle threads due not enough parallelism being exposed by the program
– Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 15
ompP Overhead Analysis (2)
S: Synchronization overhead
M: Thread management overhead
I: Imbalance overhead
L: Limited Parallelism overhead
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 16
Overhead Analysis for the NAS Benchmarks
Total Overhead (%) Synch Imbal Limpar Mgmt
BT-hostBT-fullBT-para
1253.711294.551400.50
81.23 (06.48)148.48 (11.47)163.66 (11.65)
0.000.000.00
80.87148.47163.64
0.000.000.00
0.360.010.02
FT-hostFT-fullFT-para
72.2775.0288.67
25.62 (35.44)25.97 (34.53)32.22 (36.34)
0.010.010.00
1.061.046.45
24.4324.8525.73
0.120.070.04
CG-hostCG-fullCG-para
14.3617.6424.05
1.55 (08.95)4.87 (23.59)6.37 (26.49)
0.000.000.00
0.953.465.27
0.191.371.08
0.410.040.02
EP-hostEP-fullEP-para
92.2789.66
133.76
1.08 (01.17)1.24 (01.37)
29.60 (22.13)
0.000.000.00
0.930.75
29.32
0.000.000.00
0.150.490.27
SP-hostSP-fullSP-para
4994.7616466.47
6816.17
1652.66 (33.03)14315.84 (86.89)
5302.04 (77.68)
0.111.452.74
1651.9514314.36
5299.29
0.000.000.00
0.600.030.01
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 17
OpenMP Constructs in the NAS Parallel Benchmarks
2000542BT
111562FT
20012222CG
110011EP
2030692SP
MasterCriticalBarrierSingleLoopParallel
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 18
ompP Profile for SP
2316.76
290.48
289.99
289.62
289.68
289.14
289.12
289.35
289.41
exitBarT
10.921541444311.147
11.261541444310.854
11.241541444310.825
11.171541444311.106
11.241541444310.600
11.221541444310.501
11.31541444310.442
11.221541444310.263
89.60123315522485.71SUM
bodyTexecCexecTTID
ompP Profiling Report for sp.c (lines 898-906) (para-virtualized)
305.41
39.35
38.85
35.47
38.77
38.03
37.11
38.91
38.92
exitBarT (native host)
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 19
exitBarT in a Parallel Loops
Opari transforms the implicit barrier into an explict barrier
Worst case load imbalance scenario:
Loop_enter
Loop_exit
Barrier_enter
Barrier_exit
exitBarT =
Thread i can induce at most t seconds exitBarT time in each other thread
i
t
i
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 20
exitBarT shouldbe max. ~80 seconds
Barrier that takesa really long time
290.4810.921541444311.147
289.6811.261541444310.854
289.6211.241541444310.825
289.9911.171541444311.106
289.4111.241541444310.600
289.3511.221541444310.501
289.1211.31541444310.442
289.1411.221541444310.263
2316.7689.60123315522485.71SUM
exitBarTbodyTexecCexecTTID
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 21
Optimization
Move parallelization to outermost loop
for (j = 1; j <= grid_points[1]-2; j++) {for (k = 1; k <= grid_points[2]-2; k++) {
#pragma omp forfor (i = 0; i <= grid_points[0]-1; i++) {
ru1 = c3c4*rho_i[i][j][k];cv[i] = us[i][j][k];rhon[i] = max(dx2+con43*ru1,
max(dx5+c1c5*ru1,max(dxmax+ru1,
dx1)));}
#pragma omp forfor (i = 1; i <= grid_points[0]-2; i++) {
lhs[0][i][j][k] = 0.0;lhs[1][i][j][k] = - dttx2 * cv[i-1] -
dttx1 * rhon[i-1];lhs[2][i][j][k] = 1.0 + c2dttx1 *
rhon[i];lhs[3][i][j][k] = dttx2 * cv[i+1] -
dttx1 * rhon[i+1];lhs[4][i][j][k] = 0.0;
}}
}
#pragma omp forfor (j = 1; j <= grid_points[1]-2; j++) {for (k = 1; k <= grid_points[2]-2; k++) {
for (i = 0; i <= grid_points[0]-1; i++) {ru1 = c3c4*rho_i[i][j][k];cv[i] = us[i][j][k];rhon[i] = max(dx2+con43*ru1,
max(dx5+c1c5*ru1,max(dxmax+ru1,
dx1)));}for (i = 1; i <= grid_points[0]-2; i++) {lhs[0][i][j][k] = 0.0;lhs[1][i][j][k] = - dttx2 * cv[i-1] -
dttx1 * rhon[i-1];lhs[2][i][j][k] = 1.0 + c2dttx1 *
rhon[i];lhs[3][i][j][k] = dttx2 * cv[i+1] - dttx1
* rhon[i+1];lhs[4][i][j][k] = 0.0;
}}
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 22
Optimization Results
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 23
EPCC Microbenchmarks
There is significant overhead in fine-grained constructs related to thread scheduling and reduction operations
Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines| 24
Conclusion and Future Work
Virtualization introduces application-dependent overheads– Following good practice advice (outermost, coarse-grained parallelization)
even more important
– Hypercalls are very expensive
Future work– Investigate this behavior with XEN tracing tools
– Other OpenMP runtimes
– Busy wait vs. yielding
– Virtualization aware runtime