datorteknik f1 bild 1 instruction level parallelism scalar-processors –the model so far...

21
Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors the model so far SuperScalar multiple execution units in parallel VLIW multiple instructions read in parallel

Upload: ahmad-simpson

Post on 19-Jan-2016

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 1

Instruction Level Parallelism

Scalar-processors – the model so far

SuperScalar – multiple execution units in parallel

VLIW – multiple instructions read in parallel

Page 2: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 2

Scalar Processors

T = Nq * CPI * Ct– The time to perform a task

– Nq, number of instruction, CPI cycles/instruction, Ct cycle-time

Pipeline– CPI = 1

– Ct determined by critical path

But:– Floating point operations slow in software

– Even in hardware (FPU) takes several cycles

WHY NOT USE SEVERAL FLOATING POINT UNITS?

Page 3: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 3

SuperScalar Processors

IF DE

ALU

PFU 1

….

PFU n

DM WB

1-cycle

m-cycles

Each unit may take several cycles for finish

issue completion

Page 4: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 4

Instruction VS Machine Parallelism Instruction Parallelism

– Average nr of instructions that can be executed in parallel

– Depends on; “true dependencies” Branches in relation to other instructions

Machine Parallelism– The ability of the hardware to utilize instruction parallelism

– Depends on; Nr of instructions that can be fetched and executed each

cycle• instruction memory bandwidth and instruction buffer (window)

• available resources The ability to spot instruction parallelism

Page 5: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 5

Example 1

1) add $t0 $t1 $t22) addi $t0 $t0 13) sub $t3 $t1 $t24) subi $t3 $t3 1

1) add $t0 $t1 $t22) addi $t0 $t0 1

3) sub $t3 $t1 $t24) subi $t3 $t3 1

Concurrently executed

instruction lookaheador “prefetch”

dependent

dependentindependent

Page 6: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 6

Issue & Completion Out of order issue, (starts “out of order”)

– RAW hazards

– WAR hazard (write after read)

Out of order completion, (finishes “out of order”)– WAW, Antidependence hazard (result overwritten)

Issue1) add $t0 $t1 $t22) addi $t0 $t0 13) sub $t3 $t1 $t24) subi $t3 $t3 1

1) 3)2) 4)

Completion

1) 3)2) 4)

----

--

2-parallel execution units

4-stage pipeline

Page 7: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 7

Tomasulo’s Algorithm

IF DE

A

B DM WB

C

mul $r1 2 3mul $r2 $r1 4mul $r2 5 6

...... ...

AIDLEBIDLECIDLE

... ...

Page 8: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 8

Instruction Issue

IF DE

A

B DM WB

C

mul $r1 2 3mul $r2 $r1 4mul $r2 5 6

2

3

mul $r1 2 3

A

...... ...

$r1ABUSYBIDLECIDLE

... ...

Page 9: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 9

Instruction Issue

IF DE

A

B DM WB

C

mul $r1 2 3mul $r2 $r1 4mul $r2 5 6

2

3

A4

mul $r1 2 3

mul $r2 A 4

A$r1B$r2

...... ...

ABUSYBWAITCIDLE

... ...

Page 10: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 10

Instruction Issue

IF DE

A

B DM WB

C

mul $r1 2 3mul $r2 $r1 4mul $r2 5 6

2

3

A4

ABUSYBWAITCBUSY

mul $r1 2 3

mul $r2 A 4

A$r1C$r2

...... ...... ...

mul $r2 5 6

Reg $r2 getsnewer value

Page 11: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 11

Clock until A and B finish

IF DE

A

B DM WB

C

mul $r1 2 3mul $r2 $r1 4mul $r2 5 6

2

3

64

AIDLEBBUSYCIDLE

mul $r1 2 3

mul $r2 6 4

6$r130$r2

...... ...... ...

mul $r2 5 6

Page 12: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 12

Clock until B finishes

IF DE

A

B DM WB

C

mul $r1 2 3mul $r2 $r1 4mul $r2 5 6

2

3

62

AIDLEBIDLECIDLE

mul $r2 6 4

6$r130$r2

...... ...... ...

NOT CHANGED!

Page 13: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 13

SuperScalar Designs

3-8 times faster than Scalar designs depending on

– Instruction parallelism (upper bound)

– Machine parallelism

Pros– Backward compatible (optimization is done at run time)

Cons– Complex hardware implementation

– Not scaleable (Instruction Parallelism)

Page 14: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 14

VLIW

Why not let the compiler do the work?

Use a Very Long Instruction Word (VLIW)– Consisting of many instructions is parallel

– Each time we read one VLIW instruction we actually issue all instructions contained in the VLIW instruction

Page 15: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 15

VLIW

IF DE EX DM WB

IF DE EX DM WB

IF DE EX DM WB

IF DE EX DM WB

32

32

32

32

128

VLIWinstruction

Usually the bottleneck

Page 16: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 16

VLIW

Let the compiler can do the instruction issuing

– Let it take it’s time we do this only once, ADVANCED

What if we change the architecture– Recompile the code

– Could be done the first time you load a program Only recompiled when architecture changed

We could also let the compiler know about– Cache configuration

Nr levels, line size, nr lines, replacement strategy, writeback/writethrough etc.

Hot Research Area!

Page 17: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 17

VLIW

Pros– We get high bandwidth to instruction memory

– Cheap compared to SuperScalar Not much extra hardware needed

– More parallelism We spot parallelism at a higher level (C, MODULA,

JAVA?) We can use advanced algorithms for optimization

– New architectures can be utilized by recompilation

Cons– Software compatibility

– It has not “HIT THE MARKET” (yet).

Page 18: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 18

4 State Branch Prediction

loop : A bne 100times loop B j loop

BRA NO BRA

BRA

NO BRABRA

NO BRA

We always predict BRA (1)in the inner loop, when exitwe fail once and go to (2).

Next time we still predict BRA (2) and go to (1)

1 2

NO BRA

Predict Branch

Predict no branch

BRA

Page 19: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 19

Branch Prediction

The 4-states are stored in 2 bits in the instruction cache together with the conditional Branch instruction

We predict the branch– We prefetch the predicted instructions

– We issue these before we know if branch taken!

When predicting fails we abort issued instructions

Page 20: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 20

Branch Prediction

loop

bne $r1 looppredictbranch taken

1)2)3)

Instructions 1) 2) and 3) are prefetched and may already be issued when we know the value of $r1, since $r1 might be waiting for some unit to finish

In case of prediction failure we have to abort the issued instructions and start fetching 4) 5) and 6)

Page 21: Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple

Datorteknik F1 bild 21

Multiple Branch Targets

loop

bne $r1 loop

1)2)3)

Instructions 1) 2) 3) 4) 5) and 6) is prefetched and may already be issued when we know the value of $r1, since $r1 might be waiting for some unit to finish

As soon as we know $r1 we abort the redundant instructions. VERY COMPLEX!!!