2003/12/5 1 assisting technologies for program parallelization chikayama/taura lab. masakazu hayatsu
DESCRIPTION
3 2003/12/5 Introduction Popularization of parallel computer Commercial computer with very large # of processor Low-end PC with 2-4 processor Performance Progress of speedup of uni-processor is getting sluggish ⇒ Importance of a parallel program is increasing furtherTRANSCRIPT
2003/12/5 1
Assisting technologies for program parallelization
Chikayama/Taura Lab.Masakazu [email protected]
22003/12/5
Agenda
Introduction Difficulty of Program Parallelization Assistant Tools for Program Parallelization
SUIF ExplorerS-CheckUrsa Minor
Conclusion
32003/12/5
Introduction
Popularization of parallel computerCommercial computer with very large # of proces
sorLow-end PC with 2-4 processor
PerformanceProgress of speedup of uni-processor is getting s
luggish
⇒Importance of a parallel program
is increasing further
42003/12/5
Difficulty of Program Parallelization
Dependencydead lockdata race
Avoid these problem
A AB B
X100
1
100?1?
52003/12/5
Automatic Parallelization
Low performanceParallelization
technique is fragileKnowledge out of
code is often required
:for(i=0; i<N; i++){ a[f(i)] = 0; //A a[g(i)] = 1; // B}
:
×?
62003/12/5
Development ProcessDesign & Improve Model
Finding Problems
Manually Optimizing Program
Run
Done
Speedup Evaluation Validity Check○
×
Data Race, Dead Lock …
72003/12/5
Problem of Manual Parallelization
(define (RayTracing ViewPoint Vscan nref energy rgb) (if (<= nref 4) (let ((crashed? (tracer ViewPoint Vscan))) ;crashed ? (if (and (not crashed?) (!= nref 0)) (let* ((hl0 (fcsyn (f+ (f* (vector-ref Vscan 0) (vector-ref Light 0)) (f* (vector-ref Vscan 1) (vector-ref Light 1)) (f* (vector-ref Vscan 2) (vector-ref Light 2))))) (hl (if (f< hl0 0.0) 0.0 hl0)) (ihl (f* hl hl hl energy (car beam)))) (begin (vector-set! rgb 0 (f+ (vector-ref rgb 0) ihl)) (vector-set! rgb 1 (f+ (vector-ref rgb 1) ihl)) (vector-set! rgb 2 (f+ (vector-ref rgb 2) ihl))))) (if crashed? (let* ((P (cdr crashed?)) ;intersection point (m (car crashed?)) ;crashed object (NV (Get-NVector m Vscan P))) (let* ((br (fcsyn (f+ (f* (vector-ref NV 0) (vector-ref Light 0))
(f* (vector-ref NV 1) (vector-ref Light 1)) (f* (vector-ref NV 2) (vector-ref Light 2)))))
(br1 (if (f< br 0.0) 0.0 br)) (bright (if (and (car sh) (Shadow-Check-One-Or-Matrix (car or-Net) P))
0.0 (f* (f+ br1 0.2) energy (vector-ref m 11))))) (begin (utexture m P) (vector-set! rgb 0 (f+ (vector-ref rgb 0) (f* bright (vector-ref m 13)))) (vector-set! rgb 1 (f+ (vector-ref rgb 1) (f* bright (vector-ref m 14)))) (vector-set! rgb 2 (f+ (vector-ref rgb 2) (f* bright (vector-ref m 15))))
User must fully understand many lines of code
It is prone tocause an error
82003/12/5
Important factor for assistant tool
Assist for program parallelizationCombine the benefit of automatic/manual
automatic :can extract information by the
numbers manual :
can use high level information
Extract information, and highlight important information
92003/12/5
Extraction of parallelism;; quick : v — array to be sorted left, right — renge for sort(define (quick v left right) (if (>= left right) v (let ( (new-left left) (new-right right) (pivot (vector-ref v (floor (/ (+ left right) 2)))) ) (do () ((> new-left new-right)) (do () ((>= (vector-ref v new-left) pivot)) (set! new-left (+ new-left 1))) (do () ((<= (vector-ref v new-right) pivot)) (set! new-right (- new-right 1))) (if (<= new-left new-right) (begin (swap v new-left new-right) (set! new-left (+ new-left 1)) (set! new-right (- new-right 1)) ))) (begin (quick v left new-right) (quick v new-left right) ))))
(quick #(4 5 3 1 4 0 5 6 ) 0 7)
;; quick : v — array to be sorted left, right — range for sort(define (quick v left right) (if (>= left right) v (let ( (new-left left) (new-right right) (pivot (vector-ref v (floor (/ (+ left right) 2)))) ) (do () ((> new-left new-right)) (do () ((>= (vector-ref v new-left) pivot)) (set! new-left (+ new-left 1))) (do () ((<= (vector-ref v new-right) pivot)) (set! new-right (- new-right 1))) (if (<= new-left new-right) (begin (swap v new-left new-right) (set! new-left (+ new-left 1)) (set! new-right ( - new-right 1)) ))) (begin (quick v left new-right) (quick v new-left right) ))))
(quick #(4 5 3 1 4 0 5 6 ) 0 7)
( 0R-05-01, 0R-05-02, 0R-05-03 )( 0R-0e-01, 0R-0e-02 )( 0R-0t-02, 0R-0t-03 )( 0R-0w-01, 0R-0w-02 )
Candidate for parallelization
102003/12/5
notice
Different approachOur work: based on dependency analysisToday’s survey: based on profile data
Profile data? Isn't it enough if execution time is known?
112003/12/5
Difficulty in Tuning a Parallel Program (1/2) Coverage
Percentage of total execution time spent in the parallel regions
Amdahl’s law
Granularity Average length of computation
between synchronizations Overhead of communication,
synchronization
10%100
parallel region
122003/12/5
Difficulty in Tuning a Parallel Program (2/2) Critical Path
Top resource-using code segment
Simple consumption of resources does not mean that there is a corresponding potential for improvement
132003/12/5
Assistant Tool for Program Parallelization
SUIF ExplorerCoverage and Granularity
S-CheckEffect of change on allover performance
Ursa MinorExperienced programmer's knowledge
142003/12/5
Assistant Tool for Program Parallelization
SUIF ExplorerCoverage and Granularity
S-CheckEffect of change on allover performance
Ursa MinorExperienced programmer's knowledge
152003/12/5
SUIF Explorer [Liao, et al 1999]
Objective Identify the important loops
Rules of thumbMost of a program’s execution time is spent on
a small percentage of the codeMost of a program’s execution time is spent on
loops
162003/12/5
The SUIF Explorer System
ParallelizingCompiler
ExecutionAnalyzers
ParallelizationGuru
User
SequentialProgram
RivetVisualizer
1. Automaticparallelization
3.Guidance to improvingprogram performance
2.Collecting profile &dynamic dependences
172003/12/5
The Parallelization Guru (1/2)
Parallelization guidanceThe coverage and granularity
Updates the information as new loops are parallelized
A list of loops to parallelize Sorted in order of execution time Have no I/O and are not nested under some
parallel loops
Dependence information on each loop
182003/12/5
The Parallelization Guru (2/2)
User interactionStarts with the loop at the top of the list If (loop have many dependence)
user don’t choose to attempt else
User then determines if the static dependence can be ignored if an array can be privatized …etc. using program slice
192003/12/5
program slice
contribute to the value
202003/12/5
212003/12/5
222003/12/5
The Parallelization Guru
CommentPerformance data & Dependency information
are related closely ⇒ it cut down development cost
It is applicable only to loops
232003/12/5
Assistant Tool for Program Parallelization
SUIF ExplorerCoverage and Granularity
S-CheckEffect of change on allover performance
Ursa MinorExperienced programmer's knowledge
242003/12/5
S-Check [Snelick 1997]
Objective Identify the parts of the program that changes
to them will significantly improve overall performance
Effect predictionDetermine the effect of changes in the code
without actually making the changes
252003/12/5
Sensitive Checker
Insert “delay” into segments of a parallel program, calculate sensitivity to perturbation
AssumptionA program code segment is
highly sensitive to slight perturbations comparable segment improvements⇒ will boost performance correspondingly
262003/12/5
Program Model
Code = Transfer Function Taylor expansion
βj := indicating how sensitive execution is βi,j := interactions between code
...),...,,(1
1
2,,21
k
j
k
ji
k
jjijijjk IXXXXR
272003/12/5
while(x>y){ // A delay(a);}delay(b); send(…); // B ・・・・・・do_computation{delay(c); …}; // C
Insert delays1:ON / 0:OFF
・・・・・・ delay(1) ・・・・・・ delay(1) ・・・・・・ delay(0)
・・・・・・ delay(0) ・・・・・・ delay(1) ・・・・・・ delay(0)
・・・・・・ delay(1) ・・・・・・ delay(0) ・・・・・・ delay(0)
・・・
・・・・・・ delay(0) ・・・・・・ delay(0) ・・・・・・ delay(0)
Analyze ResultsSolve for Effects
Effects Source 0.44 A 4.54 B 0.07 AB 1.21 C 0.02 BC 0.34 AC 0.00 ABC
while(x>y){ }send(…); ・・・・・・do_computation{…};
original parallel program
Mark possiblebottlenecks
Generate & Runnumerous versions
of program
// A
// B
// C
282003/12/5
UserInteract (1/3)
Test code locations are selected manually or automatically
Information provided from profiler
•programming constructs (ex. while, for) •certain library function call (ex. barrier(), send())
292003/12/5
User Interact(2/3)
Set the parameter• delay perturbation patterns• delay value
Trade off (info vs # of run)
302003/12/5
UserInteract(3/3)
Higher effect code is more likely to be a bottleneckDependency is not dealt with
312003/12/5
S-Check
Comment Identify the program segment linking directly
to a performanceKnowledge about the program is required in
order to mark possible bottleneckscode size get bigger, sensitivity test take
longer timeDependence information is not available
322003/12/5
Assistant Tool for Program Parallelization
SUIF ExplorerCoverage and Granularity
S-CheckEffect of change on allover performance
Ursa MinorKnowledge of experienced programmer's
332003/12/5
Ursa Minor [Kim, et al. 2000]
Objective× stop at pointing to problematic code
〇 present with possible causes and solutions
Transfer knowledge to novice programmer from experienced programmer
342003/12/5
UrsaMinor System
DatabaseManager
GUI Manager
MerlinPerformance
Adviser
User
ParallelProgram
Table View
Analyze problemSuggest solution
Database
StaticData
DynamicData
Structure View
Store analyzed data,Map file, etc.
Import/ExportData files fromPolaris or other
352003/12/5
Merlin Performance Advisor
Knowledge databaseknowledge on diagnosis and solutionsTransfer programming experience from
experts to new users (with “MAP” file) Performance model Architecture … etc.
362003/12/5
MerlinSymptom ⇒Diagnostic
Suggestions
372003/12/5
Advisor Map (1/2) Advisor Map
Problem Domain General performance problems from the
viewpoint of programmersDiagnostics Domain
Possible causes of these problemsSolution Group
Possible remedies
382003/12/5
Advisor Map (2/2)
Problem Diagnostics Solution
poor speedup
speedup < 1 Serialization
# of stride-1 accesses < # of non stride-1 accesses
Loop Interchange
speedup < 2.5 Loop Fusion
large # of stalls
::
::
392003/12/5
Expression Evaluator Basic Spreadsheet Operations
Numeric Functions: NEG, ADD, SPDUP, PERCO, ARVG, etc.
Relational Functions: EQ, NE, etc.Query Functions:
PARALLEL, HASIO, HASCALL, HASDEP, etc.Logical Functions: AND, OR, etc.
402003/12/5
Merlin
CommentThe idea which progressed further rather than
indication of a bottleneckWho write the “MAP”?The effect of this technology depends on
quality of the MAP
412003/12/5
Comparison
SUIF Explorer vs. S-CheckNo configuration, dependence informationEfficiency?
Two vs. Ursa MinorPracticalNot kind to beginners
422003/12/5
Conclusion
Several approach to guide the user with smart information
Future work Integration
Profiler and Dependence AnalyzerPortability
Different architecture, OS, performance