06 april 2006 parallel performance wizard: analysis module professor alan d. george, principal...
TRANSCRIPT
06 April 2006
Parallel Performance Wizard:
Analysis Module
Professor Alan D. George, Principal InvestigatorMr. Hung-Hsun Su, Sr. Research Assistant
Mr. Adam Leko, Sr. Research AssistantMr. Bryan Golden, Research Assistant
Mr. Hans Sherburne, Research AssistantMr. Max Billingsley, Research Assistant
Mr. Josh Hartman, Undergraduate Volunteer
HCS Research LaboratoryUniversity of Florida
06 April 20062
Outline
Introduction
Single run analysis
Multiple run analysis
Conclusions
Demo
Q&A
06 April 2006
Introduction
06 April 20064
Analysis Module
Goal of A module Bottleneck detection Primitive bottleneck resolution Code transformation (future)
To reduce complexity of analysis, the idea of program block is used (similar to BSP model) Block = region between two adjacent
global synchronization point (GSP) More specifically, each block starts
when the first node completes the 1st GSP (i.e. exits sync. wait) and ends when the last node enters the 2nd GSP (i.e. calls sync. notify)
I + M modules: gathering useful event data
P module: displaying data in an intuitive way to the user
A module: bottleneck detection and resolution
Bottleneck detection
Event data from I+M modules
Presentation Module
Bottleneck resolution
Code transformation
06 April 20065
Parallel Program: Regions Using the definition of
block, a parallel program (P) can be divided logically into Startup (S) Block #1 (B:1) GSP #1 (GSP:1) … Block #M-1 (B:M-1) GSP #M-1 (GSP :M-1) Block #M (B:M) Termination (T)
M = number of blocks
0
1
2
Time
Sync. Region (GSP)
Program block region (B)Program start/end
Sync. operation
Startup/termination region (S/T)Num Node ID
S T
B:1 B:2 B:3
GSP:1 GSP:2
06 April 20066
Parallel Program: Time Time (S) & Time (T)
Ideally the same for system with equal number of nodes System-dependent (compiler, network, etc.) Not much users can do to shorten the time
Time (GSP:1 .. GSP:M-1) Ideally the same for system with equal number of nodes System-dependent Not much users can do to shorten the time Possible redundancy
Time (B:1 .. B:M) Varies greatly depending upon local processing (computation & I/O), remote data
transfer, and group synchronization (also point-to-point) operations User actions greatly influence the time Possible redundancy
Time (P) = Time (S) + Time (B:1) + Time (GSP:1) + … + Time (B:N) + Time (T)
06 April 2006
Single Run Analysis
06 April 20068
Optimal Implementation (1) Assumption: system architecture (environment, system size, etc.) is fixed
To classify performance bottlenecks, we start off with a definition of an ideal situation and then characterize each bottleneck as a deviation from the ideal case
Unfortunately, it is nearly impossible to define the absolute ideal program Absolute best algorithm? Best algorithm to solve the program on a particular system environment? (best
algorithm in theory not necessarily an optimal solution on a particular system)
However, if we fixed the algorithm, it is possible to define an optimal implementation for that algorithm
Definition: a code version is an optimal implementation on a particular system environment if it executes with the shortest Time (P) when compared with other versions that implement the same algorithm
06 April 20069
Optimal Implementation (2)
To obtain shortest Time (P)* Smallest M (M = number of blocks)
No global synchronization redundancy Shortest Time (S:1), …, Time (S:M-1)
Synchronization takes minimal time to complete no sync. delay
Shortest Time (B:1), …, Time (B:M) No local redundancy: computation & I/O All local operations takes minimal time to
complete no local delay No remote redundancy: data transfer &
group synchronization All data transfer takes minimal time to
complete no transfer delay Number of data transfer is minimal (good
data locality)
*All variables independent of each other
Min (Time (P)) = Min (Time (S) + Time (T)) + Min (Time (B:1) + Time (GSP:1) + … + Time (B:N))
X=1M Y=1(M-1)
= Const. + Min (ΣTime (B:X) + ΣTime (GSP:Y))
0
1
2
Local computation
Write/Send
Read/Receive
Transfer overhead
06 April 200610
Global Synchronization Redundancy Detect possible global synchronization
Effect: all nodes see all shared variables having the same value after the global synchronization
Definition: a global synchronization is redundant if there is no read/write of same variable across two adjacent programs blocks separated by the global synchronization point
Detection: check existence of read/write to same variable from different node between adjacent blocks
Resolution: highlight possible global synchronization redundancy points
Mode: tracing
Roadblocks: Tracking local access to shared variable Variable aliasing
shared int sh_x; int *x = &sh_x; *x = 2;
Minimize number of blocks M
Global synchronization redundancy
0
1
2
06 April 200611
Local Redundancy Computation
Most good compilers remove computation redundancies
Part of sequential optimization process
Too expensive for our tool to perform
Detection: N/A
I/O Difficult to determine if an I/O
operation is redundant or not (requires checking of content of I/O operation)
Even if an operation is redundant, it might be desirable (e.g. display some information to the screen)
Not practical for our tool to perform Detection: N/A
Reduce Time (B:X)
0
1
2
Local redundancy
06 April 200612
Remote Redundancy: Group Synchronization Similar to global synchronization case except it is for a sub-group of
the nodes (including point-to-point synchronization such as locks)
Additional roadblock (on top of those for global synchronization) Consistency constraint Overlapping group synchronizations
Too expensive and complex to include in the first release
Detection: N/AGroup synchronization
redundancy
0
1
2
Reduce Time (B:X)
06 April 200613
Remote Redundancy: Data Transfer (1) Deals with possible transfer redundancies Within a single program block, operations originated from
Same node Read-Read: removable if no write operation exists for all nodes Read-Write, Write-Read: not removable Write-Write: removable if no read operation exists for all nodes
Different node: not removable Across adjacent program blocks, operations originated from
Same node: Read-Read: removable if no write operation exists for all nodes for both
program blocks Read-Write, Write-Read: not removable Write-Write: removable if no write operation exists for all nodes for both
program blocks Different node: not removable
Combine with global synchronization redundancy checking, only single program block case needed (GSP check Transfer check)
Reduce Time (B:X)
06 April 200614
Remote Redundancy: Data Transfer (2) Detection: group the operations by variable and for each node with only
Reads/Writes, check if any other nodes perform Write/Read in same block Resolution: highlight possible redundant data-transfer operations Mode: tracing Roadblock:
Tracking local access to shared variable Variable aliasing
0
1
W(X) W(X)
0
1
R(X) R(X)
Same node, same block: R-R redundancy
Same node, same block: W-W redundancy
0
1
R(X)
Same node, adjacent block: R-R redundancy
Same node, adjacent block: W-W redundancy
R(X)
0
1
W(X) W(X)
06 April 200615
Global Synchronization Delay (1) Nodes took much longer to exit the global synchronization point for
that program block
Delay most likely due to network congestion/work sharing delay No direct way for user to alleviate this behavior
Detection: compare the actual synchronization time to the expected synchronization time Tracing: each global synchronization Profiling: two possibilities
During execution: each global synchronization After execution: average of all global synchronizations
Resolution: N/A
Mode: tracing & profiling
Reduce Time (GSP:Y)
06 April 200616
Global Synchronization Delay (2)
0
1
2
Global sychronization delay
06 April 200617
Local Delay Computation and I/O delay due to
Context switching Cache misses Resource contention etc.
Detection: use hardware counters as indication Hardware interrupts counter L2 cache miss count Cycles stalled waiting for memory access Cycles stalled waiting for resource etc.
Resolution: N/A
Mode: tracing & profiling
Reduce Time (B:X)
06 April 200618
Data Transfer Delay (1) Data transfer took longer than expected
Possible causes Network delay/work sharing delay Wait on data synchronization (to preserve consistency) Multiple small transfers (when bulk transfer is possible)
Detection: compare the actual time to the expected value (obtained using script file) for that transfer size
Resolution: Suggest alternate order of data transfer operations that leads to minimal
delay (2nd cause, tracing) Determine if bulk transfer is possible (3rd cause, tracing)
Mode: tracing
06 April 200619
Data Transfer Delay (2)
Transfer delay
0
1
2
06 April 200620
Poor Data Locality Slow down of program execution due to poor distribution of shared
variables excessive remote data accesses
Detection: track number of local and remote access, calculate the ratio and compare it to a pre-defined threshold
Resolution: calculate optimal distribution that leads to smallest local/remote ratio (for entire program)
Mode: tracing & profiling
Roadblocks: Tracking local access is expensive Variable aliasing Determining the threshold value
Reduce Time (B:X)
0
1
2
06 April 200621
General Load Imbalance One or more nodes idle for a period of time one or more nodes
takes longer to complete a block than others
Identifiable with the help of Timeline view
Generalized bottleneck caused by one or more of the cases previously described Global synchronization delay Local redundancy Local delay Remote redundancy Remote delay
Detection: maybe
Mode: tracing
0
1
2
06 April 2006
Multiple Runs Analysis
06 April 200623
Speedup Execution time comparison of program running on different number
of nodes
Several variations Direct time comparison between actual runs Scalability factor calculation Calculates expected performance of higher number of nodes
Comparison possible at various levels Program Block Function
Top 5 Occupied x% of total time
Mode: tracing & profilingBlock 1 Block 1 Block 1
Block 2
Block 2
Block 2
N = 1 N = 2 N = 4
06 April 2006
Conclusions
06 April 200625
Summary Concept of program block simplifies the task of bottleneck
detection
Most single-run bottlenecks characterized will be detected (except local redundancy, local delay, and group synchronization redundancy)
Appropriate single-run bottleneck resolution strategies will be developed Data-transfer reordering Bulk-transfer grouping Optimal data distribution calculation
Scalability comparisons part of multiple run analysis
Missing cases?
06 April 200626
Future Work
Refine bottleneck characterizations as needed
Test and refine detection strategies
Test and refine resolution strategies
Find code transformation techniques
Extend to other programming languages
06 April 2006
Demo
06 April 200628
Optimal Data Distribution (1)
Goal: find data distribution pattern for a shared array which leads to smallest amount of remote access for that particular array
Multiple versions tried Brute force, free
Iterate through all possible combination of data distribution with no consideration of block size (UPC) /array size (SHMEM)
Find the one with overall smallest amount of remote access Pro: optimal distribution can be found Con:
Has exponential/polynomial time complexity of (N^K), where N = number of nodes, K = number of elements in the array
Significant effort is needed to transform the code into one that uses the resulting distribution Brute force, block restricted
Same as “brute force, free” except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution)
Pro: sub-optimal distribution can be found Con:
Still has exponential/polynomial complexity (although faster than brute force, free) Still requires significant effort to transform the code but easier than brute force, free
06 April 200629
Optimal Data Distribution (2) Multiple versions tried (cont.)
Max first, free Heuristic approach: each element is assigned to the node which access it the most often Pro:
Optimal distribution can be found Complexity is only N
Con: some effort needed to transform the code Max first, block restricted
Same as max first, free except that the number of elements that can be allocated to each node is fixed (currently the number is the same as the original distribution)
Pro: complexity is only N Con:
Resulting distribution often time not the optimal Some effort needed to transform the code (fewer than max first, free)
Optimal block size Attempt to find the optimal block size (in UPC) that leads to the smallest amount of remote
access (can also be extended to cover SHMEM) Brute force + heuristic approach:
Brute: iterate through all possible block size Heuristic: for each block size, calculate the number of elements that does not reside on the node
that uses it the most often Pro: very easy for user to modify their code Con:
Resulting distribution often time not the optimal Complexity is N (logN) with current method
06 April 200630
Optimal Data Distribution (3)*Color of square indicates which node the element physically resides on (0 = Blue, 1 = Red, 2 = Green,
3 = Black)**Shade of square indicates which node access the element the most often (0 = None, 1 = Slanted, 2 =
Cross, 3 = Vertical)
Original
Max first, free
Max first, block restricted #1
Max first, block restricted #2
Optimal block size (3 in this case)
06 April 200631
Optimal Data Distribution (4)
Approach Time Accuracy Applicability
Brute force, free Very slow Very high SHMEM
Brute force,
block restricted
Slow High SHMEM & UPC
Max first, free Fast High – very high? SHMEM
Max first,
block restricted
Fast Average – high SHMEM & UPC
Optimal block size Average Average SHMEM & UPC
Open issue: How to deal with bulk transfers? Future plan: devise faster, more accurate algorithm
06 April 2006
Q & A