intel threading tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498dhp/threading.pdf ·...

63
-1- Copyright © 2005 Intel Corporation. All Rights Reserved. Intel ® Threading Tools Paul Petersen, Intel

Upload: others

Post on 25-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 1 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Intel® Threading Tools

Paul Petersen, Intel

Page 2: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 2 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT.

Intel may make changes to specifications, product descriptions, and plans at any time, without notice.

All dates provided are subject to change without notice.

* Other names and brands may be claimed as the property of others.

Copyright © 2005, Intel Corporation.

Page 3: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 3 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Threading?“You lock too much. You don't lock enough. The program crashes. You try to debug the program.

Debugging a threaded program has a lot in common with medieval torture methods, IMHO.”

random quote found via GOOGLE search

Page 4: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 4 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Example: Prime Number Generation

33 3355 3,3,5577 3,5,3,5,7799 33

1111 3,5,7,9,3,5,7,9,11111313 3,5,7,9,11,3,5,7,9,11,13131515 331717 3,5,7,9,11,13,15,3,5,7,9,11,13,15,1717

22

#include <stdio.h>const long N = 18;long primes[N], number_of_primes = 0;main(){

printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special casefor (long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

primes[ number_of_primes++ ] = number;}printf( "Found %d primes\n", number_of_primes );

}

Page 5: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 5 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Example: Prime Number Kernel

33 3355 3,3,5577 3,5,3,5,7799 33

1111 3,5,7,9,3,5,7,9,11111313 3,5,7,9,11,3,5,7,9,11,13131515 331717 3,5,7,9,11,13,15,3,5,7,9,11,13,15,1717

22 for ( long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}

for ( long number = 3; number <= N; number += 2 ) {

long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}

Page 6: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 6 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

for ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

for ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number ) primes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

Example: Not Quite Right

Page 7: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 7 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Intel® Threading Tools

Intel® Thread CheckerPinpoint notorious threading bugs like data races, stalls and deadlocks

http://www.intel.com/software/products/threading/tcwin

Intel® Thread Profiler• Visualize your threads over time• Identify synchronization objects that cause delays

http://www.intel.com/software/products/threading/tp

Page 8: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 8 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

IntelIntel®® Thread CheckerThread CheckerVTuneVTune™™ Performance AnalyzerPerformance Analyzer

Data Flows

Binary InstrumentationPrimes.exe Primes.exe

(Instrumented)

Runtime Data

Collector

threadchecker.thr(results)

Win32* threads, POSIX* threads, OpenMP*

*Third party marks and brands are the property of their respective owners

+DLLs (Instrumented)

Page 9: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 9 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Intel® Thread Checker

Page 10: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 10 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

primes[ number_of_primes++ ] = number;}printf( "Found %d primes\n", number_of_primes );

}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

primes[ number_of_primes++ ] = number;}printf( "Found %d primes\n", number_of_primes );

}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

#pragma omp criticalprimes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

#include <stdio.h>const long N = 100000;long primes[N], number_of_primes = 0;main(){printf( "Determining primes from 1-%d \n", N );primes[ number_of_primes++ ] = 2; // special case

#pragma omp parallel forfor ( long number = 3; number <= N; number += 2 ) {long factor = 3;while ( number % factor ) factor += 2;if ( factor == number )

#pragma omp criticalprimes[ number_of_primes++ ] = number;

}printf( "Found %d primes\n", number_of_primes );

}

Example: A Bit Better

Page 11: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 11 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Fundamental Concepts

• Thread• Synchronization Instruction • Segment• Precedence Relation between Segments• Parallel Segments• Vector Clock of a Segment• Data Race

Page 12: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 12 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

• Thread – An Independent Sequence of Instructions

• Sync Instruction– Creates a thread– Destroys a thread– Posts information about host thread– Brings information posted by another thread to

host thread• Segments

– Parts of a thread separated by sync instructions

Definitions

Page 13: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 13 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Precedence Relation between Segments

Partial Order

S11 < S12 < S13

S21 < S22 < S23

S31 < S32

S21 < S12 < S23

S11 < S23

T1T2 T3

S11

S12

S13

S21

S22

S23

S31

S32

Page 14: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 14 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Parallel Segments

Two segments S and S´are parallel, if S < S´ and S´ < S are both false.

Examples: S11 and S21

S11 and S22

S11 and S32

S22 and S31.

T1T2 T3

S11

S12

S13

S21

S22

S23

S31

S32

Page 15: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 15 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Vector Clock of a Segment• Vector clocks represent numerically the

precedence relation between segments.• Vector Clock VS of a segment S is a function on

threads with values in {0, 1, 2,…}.

• Take a segment S on a thread T. Take a thread T´ known to S. Then VS(T´) is one more than the number of segments S´ on T´ such that S´ < S.

• Vector clock of a segment is computed from the vector clocks of its immediate predecessors.

Page 16: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 16 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Conditions in Terms of Vector Clocks

Let S denote a segment on a thread T and S´ a segment on a thread T´.

S < S´ iff VS(T) < VS´(T).

S and S´ are parallel iff VS(T) ≥ VS´(T)VS´(T´) ≥ VS(T´).

T1T2 T3

(1,0,0)

(2,2,0)

(3,2,0)

(0,1,0)

(0,2,0)

(3,3,0)

(0,0,1)

(0,0,2)

(3,4,3)

Page 17: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 17 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Data Race

• There is a data race between two segments S and S´ if– The segments are parallel;– There is a location x in shared memory

that both segments access, and at least one of the accesses is a “write.”

Page 18: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 18 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Data Race Detection• Basic Result:

x: A location in shared memoryS: A segment that has accessed xS´: A segment that accesses x after S.There is a race in x, if

- VS(T) ≥ VS´(T)- x is written by either S, or S´, or both.

Page 19: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 19 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Summary : Thread Checker

• The Intel® Thread Checker provides a safety net for developers– Increases confidence in developers ability to

debug threaded software.• The techniques used by the Thread

Checker allow analysis of “production”applications.

• Industry direction moving to multi-core solutions makes confidence tools even more critical to success

Page 20: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 20 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Intel® Thread Profiler

• The Thread Profiler consists contains two modes of analysis– OpenMP analysis

• Single-level fork-join parallelism– Native Threading analysis

• Arbitrary concurrency

Page 21: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 21 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Thread Profiler: OpenMP

Page 22: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 22 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Thread Profiler: Native threads• Thread Profiler motivation• Understand Threading Behavior• Optimize turnaround performance

– Serial Optimizations:• learn where to perform optimizations on code that

will most likely affect program run-time– Parallel Optimizations:

• Target concurrency level to fully utilize processing resources

• Thread blocking time is not always bad

A B C D

Page 23: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 23 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Product Status

• 1.0 – OpenMP Profiler only• 2.0 (2003)– Windows Threads• 2.1 (2004)– Linux pthreads (via RDC)• 2.2 (2005)– 32-bit app support on em64t /

incremental improvements

Page 24: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 24 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Tool Components

RuntimeLibrary

Binary or CompilerInstrumentation

User App

InstrumentedApp

ProfileData

Thread Profiler

Iterative Optimization

Process

Page 25: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 25 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Analyzer Concepts

Page 26: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 26 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

System APIs– Thread and Process Control APIs

• Create, Terminate, Suspend, Resume, Exit– Synchronization APIs

• Mutexes, Critical Sections, Locks, Semaphores, Thread Pools, Timers, Messages, APCs, Events, Condition vars

– Blocking APIs• Sleeping, Timeouts• I/O: Files, Pipes, Ports, Messages, Network,

Sockets• User I/O: Standard, GUI, Dialog Boxes

Page 27: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 27 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

User Synchronization• Thread Profiler provides an API for users to

instrument user synchronization.

__itt_notify_sync_prepare( &spin );while( wait for spin ) {

if( timeout ) {__itt_notify_sync_cancel( &spin );return;

}}__itt_notify_sync_acquired( &spin );

do stuff;

__itt_notify_sync_releasing( &spin );release spin;

Page 28: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 28 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Instrumentation• Usually low run-time overhead because only select events are instrumented• Target overhead of less than 2x slowdown with reasonable synchronization • common case is much less

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

RECORDED EVENTSCreate Thread (Fork)Thread EntryWait for synchronization object or eventAcquire synchronization object or event

Thread Exit

Release or signal synchronization object or eventWait for external eventReceive external event

wait for T2 & T3

wait for lock L

release lock L

wait for external event

Time

acquire lock L

acquire lock L

release lock L

Page 29: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 29 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Uncontended Events

• i.e. Thread acquires lock that no one else holds. It continues without any pause.

• Thread Profiler will not consider these transactions.

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

acquire lock L

acquire lock L

release lock L

Page 30: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 30 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Contended Transitions

• i.e. Thread attempts to acquire lock that another thread holds. It must wait to acquire it before continuing

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

acquire lock L

Page 31: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 31 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Thread Blocking

• Thread waits for external event such as user I/O, file operation, another process, &c.

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

acquire lock L

Page 32: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 32 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

• Flow splits when a thread creates a new thread or signals (unblocks) another thread

• Flow ends when a thread waits for another thread or terminates

Execution Flows

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

Page 33: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 33 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

• The continuous flow to target location is the critical path

• Default target is the program termination• Where to focus optimization energy• Next, we will “color” the critical path to offer more

information

Critical Path Analysis

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

Page 34: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 34 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Transition Overhead Time• Indicate the time spent between one

thread signaling and another thread receiving the signal. This is marked as overhead time.

T1

T2

T3

wait for T2 & T3

wait for lock L

release lock L

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

Page 35: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 35 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Processor Utilization• Measure processor

utilization so user can see how parallel their program really is

• Concurrency is the number of threads that are active (not waiting, sleeping, blocked, &c.) at a given time

1 3 2 1 1 2 1 2 1 0 11

idle (0)

serial (1)

parallel (2)

oversubscribed (3+)

Concurrency

(example reflects 2 processor machine)

T1

T2

T3

wait for T2 & T3

wait for lock L

release lock L

Page 36: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 36 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Characterize parallel behavior of Critical Path• Associate spans of time with the objects that caused the CP transitions• This provides which objects/barriers cause parallel scalability problems that

are directly affecting execution time to critical path target.

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

Cruise Time: Time a thread does not delay the next thread on the CP

Overhead (transition) Time: Threading synchronization or OS scheduling overhead

Blocking Time: Time a thread spends waiting for an external event or blocking while still on the CP (includes timeouts)

Impact Time: Time a thread on the CP delays next thread from getting on the CP

Lock L

Join T3

Event E

Page 37: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 37 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Combine concepts for Profile• Start with the CP• Mark overhead• Break it down by

Utilization• Further

categorize by behavior

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

Lock L

Join T3

Event E

idle (0)

serial (1)

parallel (2)

oversubscribed (3+)

(example reflects 2 processor machine)

Page 38: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 38 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Analyze and Optimize Code• Serial Optimizations

– Serial optimizations along the critical path should affect execution time.• Parallel Optimizations

– Red/Orange: Look to locations where the concurrency level is less than optimal– Cruise, Blocking, Impact: Try increasing the level of parallelism– Blocking time: Try reducing wait for external object– Impact time: This represents parallel imbalance or synchronization object contention

• Analyze benefit of increasing number of processors

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

Lock L

Join T3

Event E

Page 39: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 39 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Case Study

Page 40: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 40 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Case Study: simple data decomposition

• Master thread does:– initial work– creates 4 worker threads and 1 verifier thread– waits for them to finish

• Worker threads work in parallel• Verifier thread verifies each work segment

Page 41: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 41 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Parallel OptimizationsStep 1

Understand Threading Behavior

Page 42: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 42 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Critical Paths View• Able to compare multiple

runs of the same or different applications

• Colors quickly give general idea of program behavior

Page 43: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 43 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Timeline View• Gain an

overview perspective on thread behavior and CP flow

• Notice master thread forking 4 worker threads and 1 verifier thread

• Notice imbalance in top example

Page 44: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 44 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

View all transitions• By default, 2.2 will only record transitions on the

critical path.• Set “TP_PREVIEW” environment variable to

view details of all contended transitions.

Page 45: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 45 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Step 2Select Data set

Page 46: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 46 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Change Critical Path Target

• If the end of the application is not target of interest, then enable technology preview features with “TP_PREVIEW=1”

• This allows you to specify an arbitrary location for the critical path target

Page 47: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 47 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Time-based Filtering• Select time region of interest. Focus on

processing time, or at least ignore startup time.

Page 48: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 48 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Step 3Analyze Results

Page 49: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 49 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Profile View

• The Critical Path is represented as a stacked histogram

• Data may be divided up in different ways– Concurrency– Time thread is on CP– Objects causing

contention/imbalance for CP

– Source locations of waiting for the CP

Page 50: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 50 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Concurrency Profile

• Understand how parallel your application really is, how well is it utilizing the machine• Impact time means there is contention or imbalance preventing a higher concurrency level• Focus on underutilized time. Demonstrates OK vs. Bad impact time

Impact here is

OK

Impact here is

preventing parallelism.Let’s look closer…

(example reflects 4 processor machine)

Page 51: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 51 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Threads Profile

• Focus on undersubscribed time• See which threads are on the Critical Path preventing parallelism• Halos represent total thread lifetime and active thread lifetime

Undersubscribedimbalance

fixedThread waiting

Thread active

Undersubscribedimbalance

Fixed!

What can we do next to improve

parallelism?

Page 52: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 52 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Threads Profile• Focus on

undersubscribed time• This startup time

could have been filtered out of results

• Let’s assume we can’t easily parallelize this work

Initial master thread startup

Page 53: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 53 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Threads Profile• Focus on

undersubscribed time• Verifier thread still

has serial time• Let’s filter and group

the bar by synchronization object

Let’s filter and group this by

object

Page 54: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 54 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Object Profile• See which

synchronization object are causing imbalance or contention

• Now let’s view the source code for one of these objects

Let’s show transition

source view

Tooltips provide object

information

Page 55: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 55 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Step 4Modify Application

Page 56: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 56 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Transition Source Locations

• Current thread transitions out to the next thread on the critical path.

T1

T2

T3

E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12E0

wait for T2 & T3

wait for lock L

release lock L

wait for external event

acquire lock L

• next thread receives the signal.

Page 57: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 57 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Source View

• View previous, in, out, and next callsites associated with this transition.• Traverse the callstack for each one• Directly view the source code to see where code optimizations should be made.

Page 58: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 58 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Step 5REPEAT

Page 59: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 59 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Verify and Compare New Results

• Critical Paths View• Able to compare

multiple runs of the same or different applications

• Colors give general idea of program behavior

Page 60: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 60 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Threads Profile

• Serial impact time has become parallel time• Total bar heights (thus total run-time) is less

ImbalanceChecker thread serial

portion

Verifier thread has

been parallelized

Page 61: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 61 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Serial Optimizations

Page 62: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 62 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Sampling Integration• Use Critical Path to focus serial optimization

energy on code that will more likely affect program run time.1. Enable both TP and Sampling collectors in VTune2. From TP Timeline view select a region of time:

• View sampling for specified region(s)• View sampling for specified region(s), but just samples on

Critical Path

• Also useful to gain context of what threads are doing at source locations other than contended transitions.

Page 63: Intel Threading Tools - polaris.cs.uiuc.edupolaris.cs.uiuc.edu/~padua/cs498DHP/threading.pdf · -10-Copyright © 2005 Intel Corporation. All Rights Reserved. #include

- 63 -

Copyright © 2005 Intel Corporation. All Rights Reserved.

Summary : Thread Profiler• Use to profile and optimize threaded

behavior in an iterative process• Thread Profiler collects utilization and

critical path; correlates with synchronization objects, threads, and source

• GUI allows you to slice and dice profile data and view the corresponding source code