computer architecture (cs-213) · • how much cycles does first instruction take, in a pipelined...

Computer Architecture (CS-213)

Main Objectives

• Hazards in Pipelining • Types of Hazards • Data Hazards • Forwarding • Stalling • Control Hazards

Revision

• Do you Know ? • Ideally speed-up (performance factor) from non-

pipelined to pipelined is what ? • What limits the Speed-up ? • How much cycles does first instruction take, in a

pipelined processor, to execute ? • How do we manage control signals in pipelining ?

Pipeline Hazards

• Data Hazards – an instruction uses the result of the previous instruction. A hazard occurs exactly when an instruction tries to read a register in its ID stage that an earlier instruction intends to write in its WB stage.

• Control Hazards – the location of an instruction depends on

previous instruction • Structural Hazards – two instructions need to access the same

resource

5

Data Hazards & forwarding

Data Forwarding

• Take the result from the earliest point that it exists in any of the pipeline state registers and forward it to the functional units (e.g., the ALU) that need it that cycle

• For ALU functional unit: the inputs can come from any pipeline register rather than just from ID/EX by – adding multiplexors to the inputs of the ALU – connecting the Rd write data in EX/MEM or MEM/WB

to either (or both) of the EX’s stage Rs and Rt ALU mux inputs

– adding the proper control hardware to control the new muxes

Data Forwarding Control Conditions 1. EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10

Forwards the result from the previous instr. to either input of the ALU

Forwards the result from the second previous instr. to either input of the ALU

2. MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

Forwarding Illustration

I n s t r.

O r d e r

add $1,

sub $4,$1,$5

and $6,$7,$1

ALU IM Reg DM Reg

ALU IM Reg DM Reg

ALU IM Reg DM Reg

EX/MEM hazard forwarding

MEM/WB hazard forwarding

I n s t r.

O r d e r

add $1,$1,$2

ALU IM Reg DM Reg

add $1,$1,$3

add $1,$1,$4

ALU IM Reg DM Reg

ALU IM Reg DM Reg

• Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded?

Complication

Corrected Data Forwarding Control Conditions

2. MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

12

Avoiding/Detecting Data Hazards

• Add two NOP instructions between sub and add instructions

• Detect a hazard – Ex/Mem.RegisterRd = ID/Exe.RegisterRs – Ex/Mem.RegisterRd = ID/Exe.RegisterRt – Mem/WB.RegisterRd = ID/Exe.RegisterRs – Mem/WB.RegisterRd = ID/Exe.RegisterRt

13

Forwarding Data

14

Implementing the forward unit

15

Control values for forwarding multiplexers

Mux Control Source Explanation

ForwardA =00 ID/EX 1st ALU operand comes from register file

ForwardA= 10 EX/MEM 1st ALU operand is forwarded from prior ALU result

ForwardA= 01 MEM/WB 1st ALU operand is forwarded from data memory or an earlier ALU result

16

Control values for forwarding multiplexers

Mux Control Source Explanation

ForwardB =00 ID/EX 2nd ALU operand comes from register file

ForwardB= 10 EX/MEM 2nd ALU operand is forwarded from prior ALU result

ForwardB= 01 MEM/WB 2nd ALU operand is forwarded from data memory or an earlier ALU result

17

Modified Datapath for Forwarding

18

Load Word Hazard and Stall

19

Load Word Hazard and Stall

20

Hazard Detection Unit

21

Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Control Hazards • Branch instructions can cause great performance loss • Branch instructions need two things:

– The result of branch: Taken or Not Taken – Branch target:

• PC + 4 Branch NOT taken • PC + 4 + immediate*4 Branch Taken

• Branch instruction is not detected until the ID stage – At which point a new instruction has already been fetched

• For our original pipeline: – Effective address is not calculated until EX stage – Branch condition get set in the EX/MEM register (EX/MEM.zero) – 3-cycle branch delay

M

IF/ID

Branch Delay – CC1 • Consider the pipelined execution of: beq $1, $3, 100 • During the first cycle, beq is fetched in the IF stage

Rs

Rt

Imm16 Extend

PC =

100

0

Registers

A d d

Dat

a_in

ALU result

ID EX beq $1, $3, 100 MEM

ID/EX EX/MEM

Reg

_dst

1004

Instruction Memory

Address

Instruction

+4

m u x

0

1 A L U m

u x

m u x

m u x

Writeback data

Zero

Op

Main Control M

W

E

W

PCSrc

M

Branch Delay – CC2 • During the second cycle, beq is decoded in the ID stage • The next_1 instruction is fetched in the IF stage

Extend

Registers

Rs

Rt

Imm16

PC =

100

4

A d d

Dat

a_in

ALU result

EX next_1 MEM

ID/EX EX/MEM

Reg

_dst

Instruction Memory

Address

Instruction

1008

+4

m u x

0

1 A L U m

u x

m u x

m u x

Writeback data

Zero

Main Control M

W

E

W

PCSrc

beq $1, $3, 100

IF/ID

beq

$3

$1

100

1004

Imm16

M

Branch Delay – CC3 • During the third cycle, beq is executed in the EX stage • The next_2 instruction is fetched in the IF stage

Extend

Registers

Rs

Rt PC =

100

8

Dat

a_in

ALU result

beq $1, $3, 100 next_2 MEM

EX/MEM

Reg

_dst

Instruction Memory

Address

Instruction

1012

+4

m u x

0

1

Writeback data

Zero

Main Control M

W

E

W

PCSrc

next_1

IF/ID

1008

ID/EX

1004

12

34

1234

10

0

A d d

A L U m

u x

m u x

m u x

Beq = 1

Imm16

Instruction Memory

Address

Instruction

1016

Branch Delay – CC4 • During the fourth cycle, beq reaches MEM stage • The next_3 instruction is fetched in the IF stage

Extend

Registers

Rs

Rt PC =

101

2

Dat

a_in

ALU result

next_1 next_3 beq $1, $3, 100

Reg

_dst

Writeback data

Main Control M

W

E

M

W

Zero = 1

Beq = 1

next_2

IF/ID

1012

ID/EX

1008

A d d

A L U m

u x

m u x

m u x

EX/MEM

1404

0

1

PCSrc

m u x

0

1

+4

Imm16

Instruction Memory

Address

Instruction

1408

Branch Delay – CC5 • During the fifth cycle, branch_target instruction is fetched • Next_1 thru next_3 should be converted into NOPs

Extend

Registers

Rs

Rt PC =

140

4

Dat

a_in

ALU result

next_2 branch_target next_1

Reg

_dst

Writeback data

Main Control M

W

E

M

W

Zero

next_3

IF/ID

1016

ID/EX

1012

A d d

A L U m

u x

m u x

m u x

EX/MEM

PCSrc

m u x

0

1

+4

3-Cycle Branch Delay

beq $1,$3,100

Next_1 // bubble

ALU

DM

IM Reg

Reg

cc1 cc2 cc3 cc4 cc5 cc6

IM Reg

IM Reg

IM Reg

Bubble

Bubble

Bubble

Next_2 // bubble

Next_3 // bubble

Branch_Target

Bubble

Bubble

Bubble

ALU

IM Reg

Bubble

Bubble

cc7

• Next_1 thru Next_3 will be fetched anyway • Pipeline should flush Next_1 thru Next_3 if branch is

taken • Otherwise, they can be executed normally

Reducing the Delay of Branches • Branch delay can be reduced from 3 cycles to just 1 cycle • Branch decision is moved from 4th into 2nd pipeline stage

– Branches can be determined earlier in the ID stage – Branch address calculation adder is moved to ID stage – A comparator in the ID stage to compare the two fetched registers

• To determine branch decision, whether the branch is taken or not

• Only one instruction that follows the branch will be fetched • If the branch is taken then only one instruction is flushed • We need a control signal IF.Flush to zero the IF/ID register

– This will convert the fetched instruction into a NOP, given we need to branch to target address.

PC Instruction memory

4

Registers

M u x

M u x

M u x

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Data memory

M u x

Hazard detection

unit

Forwarding unit

IF.Flush

IF/ID

Sign extend

Control

M u x

=

Shift left 2

M u x

Reducing the Delay of Branches

Branch Hazard Alternatives • Always stall the pipeline until branch direction is known

– Next instruction is always flushed (turned into a NOP) • Predict Branch Not Taken

– Fetch successor instruction: PC+4 already calculated – Almost half of MIPS branches are not taken on average – Flush instructions in pipeline only if branch is actually taken

• Predict Branch Taken – Can predict backward branches in loops which are taken most

of the time – However, branch target address is determined in ID stage so we

Must reduce branch delay from 1 cycle to 0, but how? • Delayed Branch

– Define branch to take place AFTER the following instruction

• Define branch to take place after the next instruction • For a 1-cycle branch delay, we have one delay slot branch instruction branch delay slot – next instruction . . . branch target – if branch taken

• Compiler/assembler fills the branch delay slot By selecting a useful instruction, which must be executed irrelevant of the result of branch.

branch instruction (taken) IF ID EX MEM WB

branch delay slot (next instruction) IF ID EX MEM WB

branch target IF ID EX MEM WB

Delayed Branch

Scheduling the Branch Delay Slot

1. From an independent instruction before the branch

2. From a target instruction when branch is predicted taken

3. From fall through when branch is predicted not taken

add $t2,$t3,$t4

beq $s1, $s0

Delay Slot

beq $s1, $s0

add $t2,$t3,$t4

sub $t4,$t5,$t6

beq $s1, $s0

Delay Slot

beq $s1, $s0

sub $t4,$t5,$t6

Delay Slot

beq $s1, $s0

Sub $t4,$t5,$t6

From

Bef

ore

From

Tar

get

From

Fal

l Thr

ough

beq $s1, $s0

sub $t4,$t5,$t6

More on Delayed Branch • Scheduling delay slot with

– Independent instruction is the best choice • However, not always possible to find an independent instruction

– Target instruction is useful when branch is predicted taken • Such as in a loop branch (e.g., for loop) • Cancel branch delay instruction if branch is not taken

– Fall through is useful when branch is predicted not taken • Cancel branch delay instruction if branch is taken

• Disadvantages of delayed branch – Branch delay can increase to multiple cycles in deeper pipelines – Zero-delay branching + dynamic branch prediction are the most

appropriate solution to the problem

Zero-Delayed Branch • How can we achieve zero-delay for a taken branch …

– If the branch target address is computed in the IF stage ?

• Solution – Check the PC to see if the instruction being fetched is a branch – Store the branch target address in a table in the IF stage – Such a table is called the branch target buffer – If branch is predicted taken then

• Next PC = branch target fetched from target buffer

– Otherwise, if branch is predicted not taken then • Next PC = PC + 4

– Zero-delay is achieved because Next PC is determined in IF stage

Branch Target and Prediction Buffer

• The branch target buffer is implemented as a small cache – That stores the branch target address of taken branches

• We also have a branch prediction buffer, It determines whether we need to take the branch or not. – To store the prediction bits for branch instructions – The prediction bits are dynamically determined by the hardware

PC

mux

+4

Pred

ictio

n Bu

ffer

Branch Target Buffer

PC of Branch Target Address

Lookup

Dynamic Branch Prediction • Prediction of branches at runtime using prediction bits

– One or few prediction bits are associated with a branch instruction • Branch prediction buffer is a small memory

– Indexed by the lower portion of the address of branch instruction • The simplest scheme is to have 1 prediction bit per branch • We don’t know if the prediction bit is correct or not • If correct prediction …

– Continue normal execution – no wasted cycles • If incorrect prediction (misprediction) …

– Flush the instructions that were incorrectly fetched – wasted cycles – Update prediction bit and target address for future use

• Prediction is just a hint that is assumed to be correct • If incorrect then fetched instructions are flushed • 1-bit prediction scheme has a performance shortcoming

– A loop branch is almost always taken, except for last iteration – 1-bit scheme will predict incorrectly twice, rather than once – On the first and last loop iterations

• 2-bit prediction schemes are often used – A prediction must be wrong twice before it is changed – A loop branch is mispredicted only once on the last iteration

2-bit Prediction Scheme

Not Taken

Predict Taken

Predict Taken

Not Taken

Taken

Taken

Taken

Taken

Not Taken

Not Taken

Not Taken

Not Taken

computer architecture (cs-213) · • how much cycles does first instruction take, in a pipelined...

Documents