06 april 2006 parallel performance wizard: analysis module professor alan d. george, principal...

06 April 2006

Parallel Performance Wizard:

Analysis Module

Professor Alan D. George, Principal InvestigatorMr. Hung-Hsun Su, Sr. Research Assistant

Mr. Adam Leko, Sr. Research AssistantMr. Bryan Golden, Research Assistant

Mr. Hans Sherburne, Research AssistantMr. Max Billingsley, Research Assistant

Mr. Josh Hartman, Undergraduate Volunteer

HCS Research LaboratoryUniversity of Florida

06 April 20062

Outline

Introduction

Single run analysis

Multiple run analysis

Conclusions

Demo

Q&A

06 April 2006

Introduction

06 April 20064

Analysis Module

Goal of A module Bottleneck detection Primitive bottleneck resolution Code transformation (future)

To reduce complexity of analysis, the idea of program block is used (similar to BSP model) Block = region between two adjacent

global synchronization point (GSP) More specifically, each block starts

when the first node completes the 1st GSP (i.e. exits sync. wait) and ends when the last node enters the 2nd GSP (i.e. calls sync. notify)

I + M modules: gathering useful event data

P module: displaying data in an intuitive way to the user

A module: bottleneck detection and resolution

Bottleneck detection

Event data from I+M modules

Presentation Module

Bottleneck resolution

Code transformation

06 April 20065

Parallel Program: Regions Using the definition of

block, a parallel program (P) can be divided logically into Startup (S) Block #1 (B:1) GSP #1 (GSP:1) … Block #M-1 (B:M-1) GSP #M-1 (GSP :M-1) Block #M (B:M) Termination (T)

M = number of blocks

0

1

2

Time

Sync. Region (GSP)

Program block region (B)Program start/end

Sync. operation

Startup/termination region (S/T)Num Node ID

S T

B:1 B:2 B:3

GSP:1 GSP:2

06 April 20066

Parallel Program: Time Time (S) & Time (T)

Ideally the same for system with equal number of nodes System-dependent (compiler, network, etc.) Not much users can do to shorten the time

Time (GSP:1 .. GSP:M-1) Ideally the same for system with equal number of nodes System-dependent Not much users can do to shorten the time Possible redundancy

Time (B:1 .. B:M) Varies greatly depending upon local processing (computation & I/O), remote data

transfer, and group synchronization (also point-to-point) operations User actions greatly influence the time Possible redundancy

Time (P) = Time (S) + Time (B:1) + Time (GSP:1) + … + Time (B:N) + Time (T)

06 April 2006

Single Run Analysis

06 April 20068

Optimal Implementation (1) Assumption: system architecture (environment, system size, etc.) is fixed

To classify performance bottlenecks, we start off with a definition of an ideal situation and then characterize each bottleneck as a deviation from the ideal case

Unfortunately, it is nearly impossible to define the absolute ideal program Absolute best algorithm? Best algorithm to solve the program on a particular system environment? (best

algorithm in theory not necessarily an optimal solution on a particular system)

However, if we fixed the algorithm, it is possible to define an optimal implementation for that algorithm

Definition: a code version is an optimal implementation on a particular system environment if it executes with the shortest Time (P) when compared with other versions that implement the same algorithm

06 April 20069

Optimal Implementation (2)

To obtain shortest Time (P)* Smallest M (M = number of blocks)

No global synchronization redundancy Shortest Time (S:1), …, Time (S:M-1)

Synchronization takes minimal time to complete no sync. delay

Shortest Time (B:1), …, Time (B:M) No local redundancy: computation & I/O All local operations takes minimal time to

complete no local delay No remote redundancy: data transfer &

group synchronization All data transfer takes minimal time to

complete no transfer delay Number of data transfer is minimal (good

data locality)

*All variables independent of each other

Min (Time (P)) = Min (Time (S) + Time (T)) + Min (Time (B:1) + Time (GSP:1) + … + Time (B:N))

X=1M Y=1(M-1)

= Const. + Min (ΣTime (B:X) + ΣTime (GSP:Y))

0

1

2

Local computation

Write/Send

Read/Receive

Transfer overhead

06 April 200610

Global Synchronization Redundancy Detect possible global synchronization

Effect: all nodes see all shared variables having the same value after the global synchronization

Definition: a global synchronization is redundant if there is no read/write of same variable across two adjacent programs blocks separated by the global synchronization point

Detection: check existence of read/write to same variable from different node between adjacent blocks

Resolution: highlight possible global synchronization redundancy points

Mode: tracing

Roadblocks: Tracking local access to shared variable Variable aliasing

shared int sh_x; int *x = &sh_x; *x = 2;

Minimize number of blocks M

Global synchronization redundancy

0

1

2

06 April 200611

Local Redundancy Computation

Most good compilers remove computation redundancies

Part of sequential optimization process

Too expensive for our tool to perform

Detection: N/A

I/O Difficult to determine if an I/O

operation is redundant or not (requires checking of content of I/O operation)

Even if an operation is redundant, it might be desirable (e.g. display some information to the screen)

Not practical for our tool to perform Detection: N/A

Reduce Time (B:X)

0

1

2

Local redundancy

06 April 200612

Remote Redundancy: Group Synchronization Similar to global synchronization case except it is for a sub-group of

the nodes (including point-to-point synchronization such as locks)

Additional roadblock (on top of those for global synchronization) Consistency constraint Overlapping group synchronizations

Too expensive and complex to include in the first release

Detection: N/AGroup synchronization

redundancy

0

1

2

Reduce Time (B:X)

06 April 200613

Remote Redundancy: Data Transfer (1) Deals with possible transfer redundancies Within a single program block, operations originated from

Same node Read-Read: removable if no write operation exists for all nodes Read-Write, Write-Read: not removable Write-Write: removable if no read operation exists for all nodes

Different node: not removable Across adjacent program blocks, operations originated from

Same node: Read-Read: removable if no write operation exists for all nodes for both

program blocks Read-Write, Write-Read: not removable Write-Write: removable if no write operation exists for all nodes for both

program blocks Different node: not removable

Combine with global synchronization redundancy checking, only single program block case needed (GSP check Transfer check)

Reduce Time (B:X)

06 April 200614

Remote Redundancy: Data Transfer (2) Detection: group the operations by variable and for each node with only

Reads/Writes, check if any other nodes perform Write/Read in same block Resolution: highlight possible redundant data-transfer operations Mode: tracing Roadblock:

Tracking local access to shared variable Variable aliasing

0

1

W(X) W(X)

0

1

R(X) R(X)

Same node, same block: R-R redundancy

Same node, same block: W-W redundancy

0

1

R(X)

Same node, adjacent block: R-R redundancy

Same node, adjacent block: W-W redundancy

R(X)

0

1

W(X) W(X)

06 April 200615

Global Synchronization Delay (1) Nodes took much longer to exit the global synchronization point for

that program block

Delay most likely due to network congestion/work sharing delay No direct way for user to alleviate this behavior

Detection: compare the actual synchronization time to the expected synchronization time Tracing: each global synchronization Profiling: two possibilities

During execution: each global synchronization After execution: average of all global synchronizations

Resolution: N/A

Mode: tracing & profiling

Reduce Time (GSP:Y)

06 April 200616

Global Synchronization Delay (2)

0

1

2

Global sychronization delay

06 April 200617

Local Delay Computation and I/O delay due to

Context switching Cache misses Resource contention etc.

Detection: use hardware counters as indication Hardware interrupts counter L2 cache miss count Cycles stalled waiting for memory access Cycles stalled waiting for resource etc.

Resolution: N/A


Reduce Time (B:X)

06 April 200618

Data Transfer Delay (1) Data transfer took longer than expected

Possible causes Network delay/work sharing delay Wait on data synchronization (to preserve consistency) Multiple small transfers (when bulk transfer is possible)

Detection: compare the actual time to the expected value (obtained using script file) for that transfer size

Resolution: Suggest alternate order of data transfer operations that leads to minimal

delay (2nd cause, tracing) Determine if bulk transfer is possible (3rd cause, tracing)

Mode: tracing

06 April 200619

Data Transfer Delay (2)

Transfer delay

0

1

2

06 April 200620

Poor Data Locality Slow down of program execution due to poor distribution of shared

variables excessive remote data accesses

Detection: track number of local and remote access, calculate the ratio and compare it to a pre-defined threshold

Resolution: calculate optimal distribution that leads to smallest local/remote ratio (for entire program)


Roadblocks: Tracking local access is expensive Variable aliasing Determining the threshold value

Reduce Time (B:X)

0

1

2

06 April 200621

General Load Imbalance One or more nodes idle for a period of time one or more nodes

takes longer to complete a block than others

Identifiable with the help of Timeline view

Generalized bottleneck caused by one or more of the cases previously described Global synchronization delay Local redundancy Local delay Remote redundancy Remote delay

Detection: maybe

Mode: tracing

0

1

2

06 April 2006

Multiple Runs Analysis

06 April 200623

Speedup Execution time comparison of program running on different number

of nodes

Several variations Direct time comparison between actual runs Scalability factor calculation Calculates expected performance of higher number of nodes

Comparison possible at various levels Program Block Function

Top 5 Occupied x% of total time

Mode: tracing & profilingBlock 1 Block 1 Block 1

Block 2

Block 2

Block 2

N = 1 N = 2 N = 4

06 April 2006

Conclusions

06 April 200625

Summary Concept of program block simplifies the task of bottleneck

detection

Most single-run bottlenecks characterized will be detected (except local redundancy, local delay, and group synchronization redundancy)

Appropriate single-run bottleneck resolution strategies will be developed Data-transfer reordering Bulk-transfer grouping Optimal data distribution calculation

Scalability comparisons part of multiple run analysis

Missing cases?

06 April 200626

Future Work

Refine bottleneck characterizations as needed

Test and refine detection strategies

Test and refine resolution strategies

Find code transformation techniques

Extend to other programming languages

06 April 2006

Demo

06 April 200628

Optimal Data Distribution (1)

Goal: find data distribution pattern for a shared array which leads to smallest amount of remote access for that particular array

Multiple versions tried Brute force, free

Iterate through all possible combination of data distribution with no consideration of block size (UPC) /array size (SHMEM)

Find the one with overall smallest amount of remote access Pro: optimal distribution can be found Con:

Has exponential/polynomial time complexity of (N^K), where N = number of nodes, K = number of elements in the array

Significant effort is needed to transform the code into one that uses the resulting distribution Brute force, block restricted

Same as “brute force, free” except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution)

Pro: sub-optimal distribution can be found Con:

Still has exponential/polynomial complexity (although faster than brute force, free) Still requires significant effort to transform the code but easier than brute force, free

06 April 200629

Optimal Data Distribution (2) Multiple versions tried (cont.)

Max first, free Heuristic approach: each element is assigned to the node which access it the most often Pro:

Optimal distribution can be found Complexity is only N

Con: some effort needed to transform the code Max first, block restricted

Same as max first, free except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution)

Pro: complexity is only N Con:

Resulting distribution often time not the optimal Some effort needed to transform the code (fewer than max first, free)

Optimal block size Attempt to find the optimal block size (in UPC) that leads to the smallest amount of remote

access (can also be extended to cover SHMEM) Brute force + heuristic approach:

Brute: iterate through all possible block size Heuristic: for each block size, calculate the number of elements that does not reside on the node

that uses it the most often Pro: very easy for user to modify their code Con:

Resulting distribution often time not the optimal Complexity is N (logN) with current method

06 April 200630

Optimal Data Distribution (3)*Color of square indicates which node the element physically resides on (0 = Blue, 1 = Red, 2 = Green,

3 = Black)**Shade of square indicates which node access the element the most often (0 = None, 1 = Slanted, 2 =

Cross, 3 = Vertical)

Original

Max first, free

Max first, block restricted #1

Max first, block restricted #2

Optimal block size (3 in this case)

06 April 200631

Optimal Data Distribution (4)

Approach Time Accuracy Applicability

Brute force, free Very slow Very high SHMEM

Brute force,

block restricted

Slow High SHMEM & UPC

Max first, free Fast High – very high? SHMEM

Max first,

block restricted

Fast Average – high SHMEM & UPC

Optimal block size Average Average SHMEM & UPC

Open issue: How to deal with bulk transfers? Future plan: devise faster, more accurate algorithm

06 April 2006

Q & A

06 april 2006 parallel performance wizard: analysis module professor alan d. george, principal...

Documents