carnegie mellon compiler optimization of scalar value communication between speculative threads...

57
Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan and Todd C. Mowry School of Computer Science Carnegie Mellon University

Upload: silvia-glenn

Post on 19-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Carnegie Mellon

Compiler Optimization of Scalar Value Communication Between Speculative

Threads

Antonia Zhai, Christopher B. Colohan,

J. Gregory Steffan and Todd C. Mowry

School of Computer ScienceCarnegie Mellon University

Page 2: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 2 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Motivation

Industry is delivering multithreaded processors

Improving throughput on multithreaded processors is straight forward

How can we use multithreaded processors to improve How can we use multithreaded processors to improve the performance of a single application?the performance of a single application?

We need parallel programs

Cache

Proc ProcIBM Power4 processor:

2 processor cores per die

4 dies per module

8 64-bit processors per unit

Page 3: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 3 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Automatic Parallelization

• Finding independent threads from integer programs is Finding independent threads from integer programs is limited bylimited by

Complex control flow Complex control flow

Ambiguous data dependencesAmbiguous data dependences

Runtime inputsRuntime inputs

• More fundamentally

• Parallelization is determined at compile time

Thread-Level Speculation Detecting data dependences at runtime

Page 4: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 4 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Thread-Level Speculation (TLS)

Retry

TLS

T1 T2 T3

Load

Thread1

Thread2

Thread3

Load

StoreTime

How do we communicate value between threads?

Page 5: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 5 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Speculation

TiTi+3

while(1) { …= *q; . . . *p = …;}

Retry

Is efficient when the dependence occurs infrequently

Ti+1 Ti+2

…=*q

*p=…

…=*q

*p=…

…=*q

*p=…

…=*q

…=*q

Page 6: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 6 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

while(1) {

…= a;

a =…;

}

Synchronization

a=

=a

TiTi+1

Is efficient when the dependence occurs frequently

wait(stall)signal

wait(a);

signal(a);

Page 7: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 7 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Critical Forwarding Path in TLS

while(1) { wait(a); …= a; a =…; signal(a);}

wait(a);

…= a;

a = …;

signal(a);

crit

ical

fo

rwar

di n

g p

ath

Page 8: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 8 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Cost of Synchronization

The processors spend 43% of total execution time on synchronization

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

gcc go mcf parser perlbmk twolf vpr AVERAGE

sync

other

Page 9: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 9 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Outline

Compiler optimization Optimization opportunity• Conservative instruction scheduling• Aggressive instruction scheduling

Performance

Conclusions

Page 10: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 10 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Reducing the Critical Forwarding Path

Long Critical Path Short Critical Path

shorter critical forwarding path less execution timeexecution tim

e

execution time

Page 11: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 11 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

A Simplified Example from GCCdo {

counter = 0;

if(p->jmp)

p = p->jmp;

if(!p->real) {

p = p->next;

continue;

}

q = p;

do {

counter++;

q = q->next;

} while(q);

p = p->next;

} while(p);

start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

jmpnext

P

real

jmpnextreal

jmpnextreal

jmpnextreal

Page 12: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 12 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Insert Wait Before First Use of Pstart

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

Page 13: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 13 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Insert Wait Before First Use of Pstart

counter=0wait(p)p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

Page 14: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 14 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Insert Signal After Last Definition of Pstart

counter=0wait(p)p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

Page 15: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 15 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Insert Signal After Last Definition of Pstart

counter=0wait(p)p->jmp?

p=p->jmp

p->real?

p=p->nextsignal(p)

q = p

counter++;q=q->next;q?

p=p->nextsignal(p)

end

Page 16: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 16 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Earlier Placement for Signalsstart

counter=0wait(p)p->jmp?

p=p->jmp

p->real?

p=p->nextsignal(p)

q = p

counter++;q=q->next;q?

p=p->nextsignal(p)

end

How can we systematicallyfind these insertionpoints?

Page 17: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 17 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Outline

Compiler optimization Optimization opportunity Conservative instruction scheduling

How to compute the forwarding value Where to compute the forwarding value

• Aggressive instruction scheduling

Performance

Conclusions

Page 18: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 18 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

A Dataflow Algorithm

Given: Control flow graph entry

exit

For each node in the graph: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward the value at an earlier node?

Page 19: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 19 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Moving Instructions Across Basic Blocks

p=p->nextq = p

signal p

p=p->next

signal psignal p

signal p

end

signal p

endend

(A) (B) (C)

Page 20: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 20 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example: (1)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

endsignal p

Page 21: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 21 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (2)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

endsignal p

signal p

p=p->next

Page 22: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 22 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (3)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

signal p

p=p->next

Page 23: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 23 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (4)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

Page 24: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 24 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (5)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

Page 25: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 25 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (6)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

Page 26: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 26 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (7)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

Page 27: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 27 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Nodes with Multiple Successors

signal p

p=p->next

signal p

p=p->next

signal p

p=p->next

Page 28: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 28 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (7)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

Page 29: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 29 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (8)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

Page 30: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 30 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (9)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 31: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 31 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (10)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 32: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 32 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Nodes with Multiple Successors

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 33: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 33 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (10)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 34: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 34 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Example (11)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 35: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 35 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Finding the Earliest Insertion Point

Find at each node in the CFG: Can we compute the forwarding value at this node? If so, how do we compute the forwarding value? Can we forward value at an earlier node?

Earliest analysis A node is earliest, if on some execution path, no earlier

node can compute the forwarding value

Page 36: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 36 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

The Earliest Insertion Point (1)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 37: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 37 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

The Earliest Insertion Point (2)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 38: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 38 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

The Earliest Insertion Point (3)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 39: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 39 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

The Earliest Insertion Point (4)start

counter=0p->jmp?

p=p->jmp

p->real?

p=p->next

q = p

counter++;q=q->next;q?

p=p->next

end

signal p

p=p->next

signal p

p=p->next

signal p

p=p->nextsignal p

p=p->next

signal p

p=p->next

signal p

p=p->next

signal p

signal p

p=p->next

signal p

p=p->next

p=p->jmp

Page 40: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 40 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

The Resulting Graphstart

counter=0wait(p)p->jmp?

p=p->jmpp1=p->nextsignal(p1)

p->real?

p=p1

q = p

counter++;q=q->next;q?

p=p1

end

p1=p->nextsignal(p1)

Page 41: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 41 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Outline

Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling

Speculate on control flow Speculate on data dependence

Performance

Conclusions

Page 42: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 42 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

We Cannot Compute the Forwarded Value When…

Control Dependence Ambiguous Data Dependence

signal p

p=p->jmp

signal p

p=p->next

update(p)

signal p

p=p->next

Speculation

Page 43: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 43 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Speculating on Control Dependences

violate(p)

signal p

p=p->next

p=p->jmp

signal p

p=p->next

signal p

p=p->next

p=p->nextsignal(p)

violate(p)

p=p->jmpp=p->nextsignal(p)

Page 44: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 44 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Speculating on Data Dependence

update(p)

p=p->nextsignal(p)

p=load(add);

signal(p);

store1

store3

store2

p=load(add);

signal(p);

Hardware support

Page 45: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 45 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Hardware Support

store1

store3

store2

p = load(0xADE8);

0xADE8

mark_load(0xADE8)

p = load(addr);mark_load(addr)

store2(0x8438)

store3(0x88E8)

unmark_load(addr)

unmark_load(0xADE8)

Page 46: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 46 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Outline

Compiler optimization Optimization opportunity Conservative instruction scheduling Aggressive instruction scheduling

Performance Conservative instruction scheduling Aggressive instruction scheduling Hardware optimization for scalar value communication Hardware optimization for all value communication

Conclusions

Page 47: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 47 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Conservative Instruction Scheduling

sync

failotherbusy

gcc go mcf parser perlbmk twolf vpr AVERAGE

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

U=No Instruction Scheduling

A=Conservative Instruction Scheduling

Improves performance by 15%

Page 48: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 48 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Optimizing Induction Variable Only

sync

failotherbusy

gcc go mcf parser perlbmk twolf vpr AVERAGE

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

U=No Instruction Scheduling

I=Induction Variable Scheduling only

A=Conservative Instruction Scheduling

Is responsible for 10% performance improvement

Page 49: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 49 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Benefits from Global Analysis

Multiscalar instruction scheduling[Vijaykumar, Thesis’98]

Uses local analysis to schedule instructions across basic blocks

Does not allow scheduling of instruction across inner loops

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

gcc goM A M A

sync

failotherbusy

M=Multiscalar Scheduling

A=Conservative Scheduling

Page 50: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 50 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Aggressively Scheduling Instructions Across Control Dependences

sync

failotherbusy

gcc go mcf parser perlbmk twolf vpr AVERAGE

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

A=Conservative Instruction Scheduling

C=Aggressive Instruction Scheduling(Control)

Has no performance improvement over conservative scheduling

Page 51: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 51 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Aggressively Scheduling Instructions Across Control and Data Dependence

gcc go mcf parser perlbmk twolf vpr AVERAGE

sync

failotherbusy

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

A=Conservative Instruction Scheduling

C=Aggressive Instruction Scheduling(Control)

D=Aggressive Instruction Scheduling(Control + Data)

Improves performance by 9% over conservative scheduling

Page 52: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 52 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Hardware Optimization Over Non-Optimized Code

sync

failotherbusy

gcc go mcf parser perlbmk twolf vpr AVERAGE

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

U=No Instruction Scheduling

E=No Instruction Scheduling+Hardware Optimization

Improves performance by 4%

Page 53: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 53 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Hardware Optimization + Conservative Instruction Scheduling

sync

failotherbusy

gcc go mcf parser perlbmk twolf vpr AVERAGE

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

A=Conservative Instruction Scheduling

F=Conservative Instruction Scheduling+Hardware Optimization

Improves performance by 2%

Page 54: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 54 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Hardware Optimization + Aggressive Instruction Scheduling

gcc go mcf parser perlbmk twolf vpr AVERAGE

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100 sync

failotherbusy

D=Aggressive Instruction Scheduling

O=Aggressive Instruction Scheduling+Hardware Optimization

Degrades performance slightly

Hardware optimization is less important with compiler optimization

Page 55: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 55 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Hardware Optimization for Communicating Both Scalar and Memory Values

Reduces cost of violation by 27.6%,

translating into 4.7% performance improvement

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

gcc go mcf parser perlbmk twolf vprcompress crafty gap gzip ijpeg m88k vortex

sync

failotherbusy

Page 56: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 56 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Impact of Hardware and Software Optimizations

6 benchmarks are improved by 6.2-28.5%, 4 others by 2.7-3.6%

Nor

mal

ized

Exe

cutio

n Ti

me

0

50

100

gccgo mcf

parserperlbmk

twolf vprcompress

craftygap

gzipijpeg

m88kvortex

Page 57: Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan

Compiler Optimization of Scalar Value Communication…

- 57 - Zhai, Colohan, Steffan and Mowry

Carnegie Mellon

Conclusions

Critical forwarding path is an important bottleneck in TLS

Loop induction variables serialize parallel threads Can be eliminated with our instruction scheduling algorithm

Non-inductive scalars can benefit from conservative instruction scheduling

Aggressive instruction scheduling should be applied selectively Speculating on control dependence alone is not very effective Speculating on both control and data dependences can reduce

synchronization significantly GCC is the biggest benefactor

Hardware optimization is less important as compiler schedules instructions more aggressively

Critical forwarding path is best addressed by the compiler