datorteknik f1 bild 1 instruction level parallelism scalar-processors –the model so far...
TRANSCRIPT
Datorteknik F1 bild 1
Instruction Level Parallelism
Scalar-processors – the model so far
SuperScalar – multiple execution units in parallel
VLIW – multiple instructions read in parallel
Datorteknik F1 bild 2
Scalar Processors
T = Nq * CPI * Ct– The time to perform a task
– Nq, number of instruction, CPI cycles/instruction, Ct cycle-time
Pipeline– CPI = 1
– Ct determined by critical path
But:– Floating point operations slow in software
– Even in hardware (FPU) takes several cycles
WHY NOT USE SEVERAL FLOATING POINT UNITS?
Datorteknik F1 bild 3
SuperScalar Processors
IF DE
ALU
PFU 1
….
PFU n
DM WB
1-cycle
m-cycles
Each unit may take several cycles for finish
issue completion
Datorteknik F1 bild 4
Instruction VS Machine Parallelism Instruction Parallelism
– Average nr of instructions that can be executed in parallel
– Depends on; “true dependencies” Branches in relation to other instructions
Machine Parallelism– The ability of the hardware to utilize instruction parallelism
– Depends on; Nr of instructions that can be fetched and executed each
cycle• instruction memory bandwidth and instruction buffer (window)
• available resources The ability to spot instruction parallelism
Datorteknik F1 bild 5
Example 1
1) add $t0 $t1 $t22) addi $t0 $t0 13) sub $t3 $t1 $t24) subi $t3 $t3 1
1) add $t0 $t1 $t22) addi $t0 $t0 1
3) sub $t3 $t1 $t24) subi $t3 $t3 1
Concurrently executed
instruction lookaheador “prefetch”
dependent
dependentindependent
Datorteknik F1 bild 6
Issue & Completion Out of order issue, (starts “out of order”)
– RAW hazards
– WAR hazard (write after read)
Out of order completion, (finishes “out of order”)– WAW, Antidependence hazard (result overwritten)
Issue1) add $t0 $t1 $t22) addi $t0 $t0 13) sub $t3 $t1 $t24) subi $t3 $t3 1
1) 3)2) 4)
Completion
1) 3)2) 4)
----
--
2-parallel execution units
4-stage pipeline
Datorteknik F1 bild 7
Tomasulo’s Algorithm
IF DE
A
B DM WB
C
mul $r1 2 3mul $r2 $r1 4mul $r2 5 6
...... ...
AIDLEBIDLECIDLE
... ...
Datorteknik F1 bild 8
Instruction Issue
IF DE
A
B DM WB
C
mul $r1 2 3mul $r2 $r1 4mul $r2 5 6
2
3
mul $r1 2 3
A
...... ...
$r1ABUSYBIDLECIDLE
... ...
Datorteknik F1 bild 9
Instruction Issue
IF DE
A
B DM WB
C
mul $r1 2 3mul $r2 $r1 4mul $r2 5 6
2
3
A4
mul $r1 2 3
mul $r2 A 4
A$r1B$r2
...... ...
ABUSYBWAITCIDLE
... ...
Datorteknik F1 bild 10
Instruction Issue
IF DE
A
B DM WB
C
mul $r1 2 3mul $r2 $r1 4mul $r2 5 6
2
3
A4
ABUSYBWAITCBUSY
mul $r1 2 3
mul $r2 A 4
A$r1C$r2
...... ...... ...
mul $r2 5 6
Reg $r2 getsnewer value
Datorteknik F1 bild 11
Clock until A and B finish
IF DE
A
B DM WB
C
mul $r1 2 3mul $r2 $r1 4mul $r2 5 6
2
3
64
AIDLEBBUSYCIDLE
mul $r1 2 3
mul $r2 6 4
6$r130$r2
...... ...... ...
mul $r2 5 6
Datorteknik F1 bild 12
Clock until B finishes
IF DE
A
B DM WB
C
mul $r1 2 3mul $r2 $r1 4mul $r2 5 6
2
3
62
AIDLEBIDLECIDLE
mul $r2 6 4
6$r130$r2
...... ...... ...
NOT CHANGED!
Datorteknik F1 bild 13
SuperScalar Designs
3-8 times faster than Scalar designs depending on
– Instruction parallelism (upper bound)
– Machine parallelism
Pros– Backward compatible (optimization is done at run time)
Cons– Complex hardware implementation
– Not scaleable (Instruction Parallelism)
Datorteknik F1 bild 14
VLIW
Why not let the compiler do the work?
Use a Very Long Instruction Word (VLIW)– Consisting of many instructions is parallel
– Each time we read one VLIW instruction we actually issue all instructions contained in the VLIW instruction
Datorteknik F1 bild 15
VLIW
IF DE EX DM WB
IF DE EX DM WB
IF DE EX DM WB
IF DE EX DM WB
32
32
32
32
128
VLIWinstruction
Usually the bottleneck
Datorteknik F1 bild 16
VLIW
Let the compiler can do the instruction issuing
– Let it take it’s time we do this only once, ADVANCED
What if we change the architecture– Recompile the code
– Could be done the first time you load a program Only recompiled when architecture changed
We could also let the compiler know about– Cache configuration
Nr levels, line size, nr lines, replacement strategy, writeback/writethrough etc.
Hot Research Area!
Datorteknik F1 bild 17
VLIW
Pros– We get high bandwidth to instruction memory
– Cheap compared to SuperScalar Not much extra hardware needed
– More parallelism We spot parallelism at a higher level (C, MODULA,
JAVA?) We can use advanced algorithms for optimization
– New architectures can be utilized by recompilation
Cons– Software compatibility
– It has not “HIT THE MARKET” (yet).
Datorteknik F1 bild 18
4 State Branch Prediction
loop : A bne 100times loop B j loop
BRA NO BRA
BRA
NO BRABRA
NO BRA
We always predict BRA (1)in the inner loop, when exitwe fail once and go to (2).
Next time we still predict BRA (2) and go to (1)
1 2
NO BRA
Predict Branch
Predict no branch
BRA
Datorteknik F1 bild 19
Branch Prediction
The 4-states are stored in 2 bits in the instruction cache together with the conditional Branch instruction
We predict the branch– We prefetch the predicted instructions
– We issue these before we know if branch taken!
When predicting fails we abort issued instructions
Datorteknik F1 bild 20
Branch Prediction
loop
bne $r1 looppredictbranch taken
1)2)3)
Instructions 1) 2) and 3) are prefetched and may already be issued when we know the value of $r1, since $r1 might be waiting for some unit to finish
In case of prediction failure we have to abort the issued instructions and start fetching 4) 5) and 6)
Datorteknik F1 bild 21
Multiple Branch Targets
loop
bne $r1 loop
1)2)3)
Instructions 1) 2) 3) 4) 5) and 6) is prefetched and may already be issued when we know the value of $r1, since $r1 might be waiting for some unit to finish
As soon as we know $r1 we abort the redundant instructions. VERY COMPLEX!!!