carnegie mellon compiler optimization of scalar value communication between speculative threads...
Post on 19-Jan-2016
220 Views
Preview:
TRANSCRIPT
Carnegie Mellon
Compiler Optimization of Scalar Value Communication Between Speculative
Threads
Antonia Zhai, Christopher B. Colohan,
J. Gregory Steffan and Todd C. Mowry
School of Computer ScienceCarnegie Mellon University
Compiler Optimization of Scalar Value Communication…
- 2 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Motivation
Industry is delivering multithreaded processors
Improving throughput on multithreaded processors is straight forward
How can we use multithreaded processors to improve How can we use multithreaded processors to improve the performance of a single application?the performance of a single application?
We need parallel programs
Cache
Proc ProcIBM Power4 processor:
2 processor cores per die
4 dies per module
8 64-bit processors per unit
Compiler Optimization of Scalar Value Communication…
- 3 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Automatic Parallelization
• Finding independent threads from integer programs is Finding independent threads from integer programs is limited bylimited by
Complex control flow Complex control flow
Ambiguous data dependencesAmbiguous data dependences
Runtime inputsRuntime inputs
• More fundamentally
• Parallelization is determined at compile time
Thread-Level Speculation Detecting data dependences at runtime
Compiler Optimization of Scalar Value Communication…
- 4 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Thread-Level Speculation (TLS)
Retry
TLS
T1 T2 T3
Load
Thread1
Thread2
Thread3
Load
StoreTime
How do we communicate value between threads?
Compiler Optimization of Scalar Value Communication…
- 5 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Speculation
TiTi+3
while(1) { …= *q; . . . *p = …;}
Retry
Is efficient when the dependence occurs infrequently
Ti+1 Ti+2
…=*q
*p=…
…=*q
*p=…
…=*q
*p=…
…=*q
…=*q
Compiler Optimization of Scalar Value Communication…
- 6 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
while(1) {
…= a;
a =…;
}
Synchronization
a=
=a
TiTi+1
Is efficient when the dependence occurs frequently
wait(stall)signal
wait(a);
signal(a);
Compiler Optimization of Scalar Value Communication…
- 7 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Critical Forwarding Path in TLS
while(1) { wait(a); …= a; a =…; signal(a);}
wait(a);
…= a;
a = …;
signal(a);
crit
ical
fo
rwar
di n
g p
ath
Compiler Optimization of Scalar Value Communication…
- 8 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Cost of Synchronization
The processors spend 43% of total execution time on synchronization
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
gcc go mcf parser perlbmk twolf vpr AVERAGE
sync
other
Compiler Optimization of Scalar Value Communication…
- 9 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Outline
Compiler optimization Optimization opportunity• Conservative instruction scheduling• Aggressive instruction scheduling
Performance
Conclusions
Compiler Optimization of Scalar Value Communication…
- 10 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Reducing the Critical Forwarding Path
Long Critical Path Short Critical Path
shorter critical forwarding path less execution timeexecution tim
e
execution time
Compiler Optimization of Scalar Value Communication…
- 11 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
A Simplified Example from GCCdo {
counter = 0;
if(p->jmp)
p = p->jmp;
if(!p->real) {
p = p->next;
continue;
}
q = p;
do {
counter++;
q = q->next;
} while(q);
p = p->next;
} while(p);
start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
jmpnext
P
real
jmpnextreal
jmpnextreal
jmpnextreal
Compiler Optimization of Scalar Value Communication…
- 12 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Insert Wait Before First Use of Pstart
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
Compiler Optimization of Scalar Value Communication…
- 13 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Insert Wait Before First Use of Pstart
counter=0wait(p)p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
Compiler Optimization of Scalar Value Communication…
- 14 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Insert Signal After Last Definition of Pstart
counter=0wait(p)p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
Compiler Optimization of Scalar Value Communication…
- 15 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Insert Signal After Last Definition of Pstart
counter=0wait(p)p->jmp?
p=p->jmp
p->real?
p=p->nextsignal(p)
q = p
counter++;q=q->next;q?
p=p->nextsignal(p)
end
Compiler Optimization of Scalar Value Communication…
- 16 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Earlier Placement for Signalsstart
counter=0wait(p)p->jmp?
p=p->jmp
p->real?
p=p->nextsignal(p)
q = p
counter++;q=q->next;q?
p=p->nextsignal(p)
end
How can we systematicallyfind these insertionpoints?
Compiler Optimization of Scalar Value Communication…
- 17 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Outline
Compiler optimization Optimization opportunity Conservative instruction scheduling
How to compute the forwarding value Where to compute the forwarding value
• Aggressive instruction scheduling
Performance
Conclusions
Compiler Optimization of Scalar Value Communication…
- 18 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
A Dataflow Algorithm
Given: Control flow graph entry
exit
For each node in the graph: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward the value at an earlier node?
Compiler Optimization of Scalar Value Communication…
- 19 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Moving Instructions Across Basic Blocks
p=p->nextq = p
signal p
p=p->next
signal psignal p
signal p
end
signal p
endend
(A) (B) (C)
Compiler Optimization of Scalar Value Communication…
- 20 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example: (1)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
endsignal p
Compiler Optimization of Scalar Value Communication…
- 21 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (2)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
endsignal p
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 22 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (3)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 23 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (4)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 24 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (5)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 25 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (6)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 26 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (7)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 27 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Nodes with Multiple Successors
signal p
p=p->next
signal p
p=p->next
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 28 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (7)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 29 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (8)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
Compiler Optimization of Scalar Value Communication…
- 30 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (9)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 31 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (10)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 32 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Nodes with Multiple Successors
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 33 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (10)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 34 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Example (11)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 35 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Finding the Earliest Insertion Point
Find at each node in the CFG: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward value at an earlier node?
Earliest analysis A node is earliest, if on some execution path, no earlier
node can compute the forwarding value
Compiler Optimization of Scalar Value Communication…
- 36 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
The Earliest Insertion Point (1)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 37 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
The Earliest Insertion Point (2)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 38 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
The Earliest Insertion Point (3)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 39 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
The Earliest Insertion Point (4)start
counter=0p->jmp?
p=p->jmp
p->real?
p=p->next
q = p
counter++;q=q->next;q?
p=p->next
end
signal p
p=p->next
signal p
p=p->next
signal p
p=p->nextsignal p
p=p->next
signal p
p=p->next
signal p
p=p->next
signal p
signal p
p=p->next
signal p
p=p->next
p=p->jmp
Compiler Optimization of Scalar Value Communication…
- 40 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
The Resulting Graphstart
counter=0wait(p)p->jmp?
p=p->jmpp1=p->nextsignal(p1)
p->real?
p=p1
q = p
counter++;q=q->next;q?
p=p1
end
p1=p->nextsignal(p1)
Compiler Optimization of Scalar Value Communication…
- 41 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Outline
Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling
Speculate on control flow Speculate on data dependence
Performance
Conclusions
Compiler Optimization of Scalar Value Communication…
- 42 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
We Cannot Compute the Forwarded Value When…
Control Dependence Ambiguous Data Dependence
signal p
p=p->jmp
signal p
p=p->next
update(p)
signal p
p=p->next
Speculation
Compiler Optimization of Scalar Value Communication…
- 43 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Speculating on Control Dependences
violate(p)
signal p
p=p->next
p=p->jmp
signal p
p=p->next
signal p
p=p->next
p=p->nextsignal(p)
violate(p)
p=p->jmpp=p->nextsignal(p)
Compiler Optimization of Scalar Value Communication…
- 44 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Speculating on Data Dependence
update(p)
p=p->nextsignal(p)
p=load(add);
signal(p);
store1
store3
store2
p=load(add);
signal(p);
Hardware support
Compiler Optimization of Scalar Value Communication…
- 45 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Hardware Support
store1
store3
store2
p = load(0xADE8);
0xADE8
mark_load(0xADE8)
p = load(addr);mark_load(addr)
store2(0x8438)
store3(0x88E8)
unmark_load(addr)
unmark_load(0xADE8)
Compiler Optimization of Scalar Value Communication…
- 46 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Outline
Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling
Performance Conservative instruction scheduling Aggressive instruction scheduling Hardware optimization for scalar value communication Hardware optimization for all value communication
Conclusions
Compiler Optimization of Scalar Value Communication…
- 47 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Conservative Instruction Scheduling
sync
failotherbusy
gcc go mcf parser perlbmk twolf vpr AVERAGE
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
U=No Instruction Scheduling
A=Conservative Instruction Scheduling
Improves performance by 15%
Compiler Optimization of Scalar Value Communication…
- 48 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Optimizing Induction Variable Only
sync
failotherbusy
gcc go mcf parser perlbmk twolf vpr AVERAGE
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
U=No Instruction Scheduling
I=Induction Variable Scheduling only
A=Conservative Instruction Scheduling
Is responsible for 10% performance improvement
Compiler Optimization of Scalar Value Communication…
- 49 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Benefits from Global Analysis
Multiscalar instruction scheduling[Vijaykumar, Thesis’98]
Uses local analysis to schedule instructions across basic blocks
Does not allow scheduling of instruction across inner loops
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
gcc goM A M A
sync
failotherbusy
M=Multiscalar Scheduling
A=Conservative Scheduling
Compiler Optimization of Scalar Value Communication…
- 50 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Aggressively Scheduling Instructions Across Control Dependences
sync
failotherbusy
gcc go mcf parser perlbmk twolf vpr AVERAGE
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
A=Conservative Instruction Scheduling
C=Aggressive Instruction Scheduling(Control)
Has no performance improvement over conservative scheduling
Compiler Optimization of Scalar Value Communication…
- 51 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Aggressively Scheduling Instructions Across Control and Data Dependence
gcc go mcf parser perlbmk twolf vpr AVERAGE
sync
failotherbusy
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
A=Conservative Instruction Scheduling
C=Aggressive Instruction Scheduling(Control)
D=Aggressive Instruction Scheduling(Control + Data)
Improves performance by 9% over conservative scheduling
Compiler Optimization of Scalar Value Communication…
- 52 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Hardware Optimization Over Non-Optimized Code
sync
failotherbusy
gcc go mcf parser perlbmk twolf vpr AVERAGE
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
U=No Instruction Scheduling
E=No Instruction Scheduling+Hardware Optimization
Improves performance by 4%
Compiler Optimization of Scalar Value Communication…
- 53 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Hardware Optimization + Conservative Instruction Scheduling
sync
failotherbusy
gcc go mcf parser perlbmk twolf vpr AVERAGE
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
A=Conservative Instruction Scheduling
F=Conservative Instruction Scheduling+Hardware Optimization
Improves performance by 2%
Compiler Optimization of Scalar Value Communication…
- 54 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Hardware Optimization + Aggressive Instruction Scheduling
gcc go mcf parser perlbmk twolf vpr AVERAGE
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100 sync
failotherbusy
D=Aggressive Instruction Scheduling
O=Aggressive Instruction Scheduling+Hardware Optimization
Degrades performance slightly
Hardware optimization is less important with compiler optimization
Compiler Optimization of Scalar Value Communication…
- 55 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Hardware Optimization for Communicating Both Scalar and Memory Values
Reduces cost of violation by 27.6%,
translating into 4.7% performance improvement
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
gcc go mcf parser perlbmk twolf vprcompress crafty gap gzip ijpeg m88k vortex
sync
failotherbusy
Compiler Optimization of Scalar Value Communication…
- 56 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Impact of Hardware and Software Optimizations
6 benchmarks are improved by 6.2-28.5%, 4 others by 2.7-3.6%
Nor
mal
ized
Exe
cutio
n Ti
me
0
50
100
gccgo mcf
parserperlbmk
twolf vprcompress
craftygap
gzipijpeg
m88kvortex
Compiler Optimization of Scalar Value Communication…
- 57 - Zhai, Colohan, Steffan and Mowry
Carnegie Mellon
Conclusions
Critical forwarding path is an important bottleneck in TLS
Loop induction variables serialize parallel threads Can be eliminated with our instruction scheduling algorithm
Non-inductive scalars can benefit from conservative instruction scheduling
Aggressive instruction scheduling should be applied selectively Speculating on control dependence alone is not very effective Speculating on both control and data dependences can reduce
synchronization significantly GCC is the biggest benefactor
Hardware optimization is less important as compiler schedules instructions more aggressively
Critical forwarding path is best addressed by the compiler
top related