a new approach for performance analysis of …• a hybrid performance tool for openmp programs –...

57
1 A New Approach for Performance Analysis of OpenMP Programs Xu Liu, John Mellor-Crummey, Mike Fagan Rice University http://hpctoolkit.org

Upload: others

Post on 07-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

1

A New Approach for Performance Analysis of OpenMP Programs

Xu Liu, John Mellor-Crummey, Mike FaganRice University

http://hpctoolkit.org

Page 2: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Motivation

• Multicore is used everywhere– cell phones– laptops– supercomputers

• Threads per node grow rapidly– IBM Blue Gene/Q: 64 threads– IBM Power 7: 128 threads– Intel MIC: ~200 threads

• Taking advantage of massive threads is important

2

Page 3: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Programming model for multi-threading

• Low level– pthread

• High level– language based: Cilk– library based: Intel thread building block (TBB)– compiler based: OpenMP

3

Page 4: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Programming model for multi-threading

• Low level– pthread

• High level– language based: Cilk– library based: Intel thread building block (TBB)– compiler based: OpenMP

3

• standardized• portable

Page 5: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Programming model for multi-threading

• Low level– pthread

• High level– language based: Cilk– library based: Intel thread building block (TBB)– compiler based: OpenMP

3

• standardized• portable

OpenMP is the most popular programming model for multi-threading

Page 6: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP’s fork-join parallelism

4

Fork

Join

Fork

Join

Fork

Join

parallel region

parallel region

parallel region

Page 7: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Challenges in programming with OpenMP

• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance

5

Fork

Join

Fork

Join

Fork

Join

sync

Page 8: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Challenges in programming with OpenMP

• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance

5

Fork

Join

Fork

Join

Fork

Join

sync

Page 9: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Challenges in programming with OpenMP

• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance

5

Fork

Join

Fork

Join

Fork

Join

sync

Page 10: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Challenges in programming with OpenMP

• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance

5

Fork

Join

Fork

Join

Fork

Join

sync

Page 11: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Challenges in programming with OpenMP

• Performance problems in OpenMP programs– insufficient parallelism– serialization– load imbalance

5

Performance tool is needed

Fork

Join

Fork

Join

Fork

Join

sync

Page 12: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Previous work

• Instrumentation based tools: TAU, Scalasca– high measurement overhead

• Sampling based tools: Intel Vtune– no specific OpenMP support

• Hybrid tools: Sun Studio/Oracle Analyzer API for OpenMP– support low-overhead, sampling-based measurement– use large disk space for measurement data– insufficient support for statically-linked applications– insufficient mechanisms to blame root causes of performance losses

• attributes waiting to “symptoms” rather than “causes”

6

Page 13: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Previous work

• Instrumentation based tools: TAU, Scalasca– high measurement overhead

• Sampling based tools: Intel Vtune– no specific OpenMP support

• Hybrid tools: Sun Studio/Oracle Analyzer API for OpenMP– support low-overhead, sampling-based measurement– use large disk space for measurement data– insufficient support for statically-linked applications– insufficient mechanisms to blame root causes of performance losses

• attributes waiting to “symptoms” rather than “causes”

6

Goal: build a tool to overcome all the shortcomings in existing tools

Page 14: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Our approach

• A hybrid performance tool for OpenMP programs– quantify performance losses– pinpoint problematic code for optimization– low measurement overhead– support both dynamically and statically linked binaries

• With the feedback from our tool, we were able to get 1.1x-5.7x speedup for four well-known benchmarks

7

Page 15: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Overview of our tool

• OpenMP tools API– minimum set of support to tools– low overhead

• HPCToolkit– a leading sampling-based performance tool for parallel programs– call path profiler + static binary analyzer + GUI– on top of OpenMP tools API

8

Page 16: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP tools API• Maintain states• Provide callbacks

9

Fork

Join

Fork

Join

Fork

Join

Page 17: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP tools API• Maintain states• Provide callbacks

9

working

Fork

Join

Fork

Join

Fork

Join

Page 18: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP tools API• Maintain states• Provide callbacks

9

working

Fork

Join

Fork

Join

Fork

Join

idle

Page 19: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP tools API• Maintain states• Provide callbacks

9

working

Fork

Join

Fork

Join

Fork

Join

idle overhead

Page 20: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP tools API• Maintain states• Provide callbacks

9

working

Fork

Join

Fork

Join

Fork

Join

idle overhead

entrance callback

Page 21: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP tools API• Maintain states• Provide callbacks

9

working

Fork

Join

Fork

Join

Fork

Join

idle overhead

entrance callback

exit callback

Page 22: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP tools API• Maintain states• Provide callbacks

9

working

Fork

Join

Fork

Join

Fork

Join

idle overhead

entrance callback

exit callback

state callback

Page 23: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

OpenMP tools API• Maintain states• Provide callbacks

9

working

Fork

Join

Fork

Join

Fork

Join

idle overhead

entrance callback

exit callback

state callback

• Low overhead• Try to make it into

OpenMP standard

Page 24: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Enhancement of HPCToolkit

• Measure and attribute costs to full user-program calling contexts– unified view of calling contexts across all threads

• Shift blame for idleness from symptoms to causes• Support both OpenMP profiling and tracing

10

Page 25: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Problem: separate views for different threads Worker threads don’t know the full user-level context for work

11

Page 26: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Problem: separate views for different threads Worker threads don’t know the full user-level context for work

11

main→fn.0→fn.1→fn.2

Page 27: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Call stack snapshot in OpenMP threads

12

regions in gray have

distributed calling contexts

Page 28: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Call stack snapshot in OpenMP threads

12

regions in gray have

distributed calling contexts

Page 29: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Call stack snapshot in OpenMP threads

12

regions in gray have

distributed calling contexts

Page 30: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Call stack snapshot in OpenMP threads

12

regions in gray have

distributed calling contexts

Page 31: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Call stack snapshot in OpenMP threads

12

regions in gray have

distributed calling contexts

Page 32: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Call stack snapshot in OpenMP threads

12

regions in gray have

distributed calling contexts

Page 33: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Call stack snapshot in OpenMP threads

12

regions in gray have

distributed calling contexts

too tiny

Page 34: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Call stack snapshot in OpenMP threads

12

regions in gray have

distributed calling contexts

too tiny

Online deferred context resolution

Page 35: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Results of deferred context construction

13

main._omp_fn.*:outlined functions

correspond to <parallel region>

Page 36: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Enhancement of HPCToolkit

• Measure and attribute costs to full user-program calling contexts• Shift blame for idleness from symptoms to causes

– undirected blame: load imbalance, serialization– directed blame: lock and critical section contention

• Support both OpenMP profiling and tracing

14

Page 37: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Problem: meaningless hotspots

15

hotspot is do_wait,but don’t know why

Page 38: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Undirected blame

• do_wait is the symptom• Causes are working threads

– e.g. last thread has more work than other threads• Blaming do_wait to working threads

16

Join

Fork

do_wait

do_wait

do_wait

do_wait

Page 39: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Undirected blame

• do_wait is the symptom• Causes are working threads

– e.g. last thread has more work than other threads• Blaming do_wait to working threads

16

Join

Fork

do_wait

do_wait

do_wait

do_wait

Page 40: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Undirected blame

• do_wait is the symptom• Causes are working threads

– e.g. last thread has more work than other threads• Blaming do_wait to working threads

16

Join

Fork

do_wait

do_wait

do_wait

do_wait

2 idle threads

3 working threads

Page 41: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Undirected blame

• do_wait is the symptom• Causes are working threads

– e.g. last thread has more work than other threads• Blaming do_wait to working threads

16

Join

Fork

do_wait

do_wait

do_wait

do_wait

2 idle threads

3 working threads

take 2/3 idleness

Page 42: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Code-centric view: hypre_BoomerAMGRelax

17

20% idleness and 80% work in this parallel

region

Page 43: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Serial Code in AMG2006 8 PE, 8 Threads

18

7/8 threads are idle: sequential

code

Page 44: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

LULESH running on 48 cores

19

free in both LULESH benchmark and libgomp

causes high idleness

4x speedup after changing to

Google’s tcmalloc

Page 45: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Directed blame

• Thread waiting at lock is the symptom• Cause is the lock holder• Blame lock waiting to lock holder

20

Join

Fork

lockwait

acquire lock release lock

Page 46: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Directed blame

• Thread waiting at lock is the symptom• Cause is the lock holder• Blame lock waiting to lock holder

20

Join

Fork

lockwait

acquire lock release lock

accumulate samples indexed

by the lock address

Page 47: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Directed blame

• Thread waiting at lock is the symptom• Cause is the lock holder• Blame lock waiting to lock holder

20

Join

Fork

lockwait

acquire lock release lock

accumulate samples indexed

by the lock address

lock holder takes these samples at

lock release

Page 48: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Directed blame

• Thread waiting at lock is the symptom• Cause is the lock holder• Blame lock waiting to lock holder

20

Join

Fork

lockwait

acquire lock release lock

accumulate samples indexed

by the lock address

lock holder takes these samples at

lock release

Page 49: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Lock blame shifting for NAS UA

21

• Lots of locks• 8.4% of execution

time waiting for locks • 34% of lock waiting

due to locks acquired at highlighted call site

Page 50: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Reducing lock contention for NAS UA

22

• Use omp_test_lock• Defer the lock

acquisition to the next iteration

• Eliminate most lock contention time

Page 51: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Enhancement of HPCToolkit

• Measure and attribute costs to full user-program calling contexts• Shift blame for idleness from symptoms to causes• Support both OpenMP profiling and tracing

23

Page 52: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Tracing: AMG 2006 (solver phase)

24

time line

proc

ess/

thre

ad ra

nk

Page 53: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Tool performance

• Measurement overhead– OpenMP tools API– HPCToolkit scalability

• Insightful optimization feedbacks

25

Page 54: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Benchmarks

• Four case studies– AMG2006

• one of Sequoia benchmarks from LLNL• running on 48 threads

– LULESH• an application benchmark from LLNL• uses work-sharing parallel regions without nesting and tasking• 48 threads

– BT-MZ.B• BT in multi-zone NPB with workload B• uses nested parallel regions without tasking• 32 threads: 4 for outer region and 8 for inner region

– HEALTH• a benchmark in Barcelona tasking benchmarks• uses tasking: more than 17 million tasks• 8 threads, using medium input

26

Page 55: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Profiling overhead

• OpenMP tools API implemented in GNU OpenMP (GOMP)• HPCToolkit takes 200 samples/s/thread

27

Benchmarks GOMP GOMP + tools API

GOMP+tools API + profiling

AMG2006 54.02s +0.15% +5.0%

LULESH 402.34s +0.05% +3.6%

NAS BT-MZ 32.10s +0.16% +6.6%

HEALTH 71.74s +0.64% +3.5%

Page 56: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Performance gains

28

Benchmarks Problem Optimization Improvement

AMG2006 high idleness in parallel regions

use dynamic scheduling for load balance

1.12x

LULESH serialization in free

switch to tcmalloc 3.93x

NAS BT-MZ parallelize insignificant work

eliminate necessary parallelism

1.09x

HEALTHhigh lock

contention for task queue

coarsen task granularity 5.65x

Page 57: A New Approach for Performance Analysis of …• A hybrid performance tool for OpenMP programs – quantify performance losses – pinpoint problematic code for optimization – low

Summary

• OpenMP tools API -- we are involved in designing it– incurs very little runtime overhead– supports efficient sampling-based measurement tools– supports root cause analysis of performance bottlenecks– try to make it into OpenMP standard

• HPCToolkit -- we developed it– provides low-overhead profiling and tracing– supports all OpenMP features– delivers insights into performance of OpenMP codes

• Now enhanced HPCToolkit works on IBM Blue Gene/Q

29