abstraction question general purpose processors have an abstraction layer fixed at the isa and have...

34
Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the machine Embedded processors tend to have entire control over the code run, compiler (if any), the ISA, and the hardware – so breaking the abstraction layer makes sense. Bottom line becomes who controls what elements of the design. Intel side note

Upload: jasmin-heath

Post on 12-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Abstraction Question

• General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the machine

• Embedded processors tend to have entire control over the code run, compiler (if any), the ISA, and the hardware – so breaking the abstraction layer makes sense.

• Bottom line becomes who controls what elements of the design.

Intel side note

Page 2: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

We’ve been doing mips – how’d intel do this for a CISC x86 ISA? Microops from Pentium Pro on

Page 3: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Hazardsor

“Which way did he go?”

Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Read 5.1, 5.2

Page 4: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Control Dependence

• Just as an instruction will be dependent on other instructions to provide its operands ( dependence), it will also be dependent on other instructions to determine whether it gets executed or not ( dependence or ___________ dependence).

• Control dependences are particularly critical with _____________ branches.

add $5, $3, $2

sub $6, $5, $2

beq $6, $7, somewhere

and $9, $6, $1

... somewhere: or $10, $5, $2

add $12, $11, $9

...

conditional

branchcontrol

data

Page 5: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Dealing With Branch Hazards

• Hardware– stall until you know which direction

– reduce hazard through earlier computation of branch direction

– guess which direction assume not taken (easiest) more educated guess based on history (requires that you know it is a

branch before it is even decoded!)

• Hardware/Software– noops, or instructions that get executed either way (delayed

branch).

Page 6: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Given our current pipeline – let’s assume we stall until we know the branch outcome. How many cycles will you lose per branch?

Stalling the pipeline

Selection

cycles

A 0

B 1

C 2

D 3

E 4

Page 7: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Stalling for Branch Hazards

beq $4, $0, there

and $12, $2, $5

or ...

add ...

sw ...

IM Reg DM Reg

IM Reg

IM Reg

DM

IM Reg

DM Reg

IM Reg

DM Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

Bubble BubbleBubble

Page 8: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Stalling for Branch Hazards

• Seems wasteful, particularly when the branch isn’t taken.

• Makes all branches cost 4 cycles.

• What if we just assume ____________

Why?

branchesAren’t taken

Page 9: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Assume Branch Not Taken

beq $4, $0, there

and $12, $2, $5

or ...

add ...

sw ...

IM Reg

DM Reg

IM Reg

IM Reg

DM

IM Reg DM Reg

IM Reg

DM Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

• works pretty well when you’re right

Page 10: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Assume Branch Not Taken

beq $4, $0, there

and $12, $2, $5

or ...

add ...

there: sub $12, $4, $2

IM Reg

IM Reg

IM

IM Reg

IM Reg

DM Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

Flush

Flush

Flush

• same performance as stalling when you’re wrong

Wrong Path insts

Page 11: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Let’s improve the pipeline so we move branch resolution to Decode.

How many cycles would we lose then on a taken branch?

Stalling the pipeline

Selection

cycles

A 0

B 1

C 2

D 3

E 4

Add drawing ofResolving in decode.

Page 12: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

The Pipeline with flushing for taken branches

• Notice the IF/ID flush line added.

Page 13: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Hazards – Assume Not Taken

• Great if most of your branches aren’t taken.

• What about loops which are taken 95% of the time?– we would like the option of assuming not taken for some branches,

and taken for others, depending on ???

Page 14: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Hazards – Predicting Taken

IM Reg

AL

U DM Reg

IM Reg

AL

U DM Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

beq $2, $1, here

here: lw

Reading quiz

Required information to predict Taken:

1. An instruction is a branch before decode

2. The target of the branch

3. The outcome of the branch

Selection Required knowledge

A 2,3

B 1,2,3

C 1,2

D 2

E None of the above

Page 15: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Target Buffer

• Keeps track of the PCs of recently seen branches and their targets.

• Consult during Fetch (in parallel with Instruction Memory read) to determine:– Is this a branch?

– If so, what is the target

Page 16: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Hazards – Predict Taken

• Static policy:– Forward branches (if statements) predict not taken

– Backward branches (loops) predict taken

• Dynamic prediction (coming soon)

• First – Branch Delay Slots

Page 17: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Eliminating the Branch Stall

• There’s no rule that says we have to see the effect of the branch immediately. Why not wait an extra instruction before branching?

• The original SPARC and MIPS processors each used a single branch delay slot to eliminate single-cycle stalls after branches.

• The instruction after a conditional branch is always executed in those machines, regardless of whether the branch is taken or not!

Page 18: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Delay Slot

beq $4, $0, there

and $12, $2, $5

there: or ...

add ...

sw ...

IM Reg

DM Reg

IM Reg

IM Reg

DM

IM Reg

DM Reg

IM Reg

DM Reg

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8

Branch delay slot instruction (next instruction after a branch) is executed evenif the branch is taken.

Page 19: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Filling the branch delay slot

1 add $5, $3, $7

2 add $9, $1, $3

3 sub $6, $1, $4

4 and $7, $8, $2

5 beq $6, $7, there

nop /* branch delay slot */

6 add $9, $1, $2

7 sub $2, $9, $5

...

there:

8 mult $2, $10, $11

No-R7 WAR

Safe, $1 and $3 are fine

No-R6

No-R7

Not safe ($9 on nt path)

Not safe (needs $9 not yet produced)

Not safe ($2 is used before overwritten)

E is the correct answer

Selection Safe instructions

A 1,2

B 2,6

C 6,8

D 1,2,7,8

E None of the above

* It is not safe to assume anything about the … code

Page 20: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Filling the branch delay slot• The branch delay slot is only useful if you can find

something to put there.

• If you can’t find anything, you must put a noop to insure correctness.

Page 21: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Delay Slots

• This works great for this implementation of the architecture, but becomes a permanent part of the ISA.

• What about the MIPS R10000, which has a 5-cycle branch penalty, and executes 4 instructions per cycle???

• Bottom line: Exposed a detail of the hardware implementation to the ISA.

Page 22: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Dynamic Branch Prediction

• Can we guess the outcome of branches?

• What should we base that guess on?

Page 23: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Prediction

101

program counter

for (i=0;i<10;i++) {......}

...

...add $i, $i, #1beq $i, #10, loop

Accuracy?

Too easily swayed

Page 24: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Two-bit predictors give better loop prediction

for (i=0;i<10;i++) {......}

...

...add $i, $i, #1beq $i, #10, loop

Strongly Taken11

Weakly Taken10

Weakly Not Taken01

Strongly Not Taken00

Decr

em

ent

when

not

taken

Incr

em

ent

when t

aken

branch address

00

PHT

Better (less sway)Slower learning

Page 25: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Suppose we have the following branch patterns for 3 branches (A, B, C). What is the accuracy of a 1-bit and 2-bit Branch History Table. Assume initial values of 1 (1-bit) and (10) 2-bit.

Strongly Taken11

Weakly Taken10

Weakly Not Taken01

Strongly Not Taken00

Decr

em

ent

when

not

taken

Incr

em

ent

when t

aken

1-bitA. T T T T N B. T T N T N

C. N T N T N

2-bitA. T T T T N B. T T N T N

C. N T N T N

Page 26: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Modern Branch Prediction - Pentium 4

• Performance dependent on accurate Branch Prediction

• 20 Stage Pipeline – 3-way issue– 60 instructions in flight (12 branches)

– 17th stage is branch resolution

– ~17*3=51 instructions lost on mispredict

X

1 2 3 4 5 6 7 8 9 1011121314151617181920

Page 27: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Branch Prediction

• Latest branch predictors are significantly more sophisticated, using more advanced correlating techniques, larger structures, and even AI techniques

• Use patterns of branches (local history) and recent other branch history (global history) to make predictions

Page 28: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

For a given program on our 5-stage MIPS pipeline processor:

20% of insts are loads, 50% of instructions following a load are arithmetic instructions depending on the load

20% of instructions are branches. Using dynamic branch prediction, we achieve 80% prediction accuracy.

What is the CPI of your program?Selection CPI

A 0.76

B 0.9

C 1.0

D 1.1

E 1.14

Putting it all together.

Page 29: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Control Hazards -- Key Points

• Control (or branch) hazards arise because we must fetch the next instruction before we know if we are branching or where we are branching.

• Control hazards are detected in hardware.

• We can reduce the impact of control hazards through:– early detection of branch address and condition

– branch prediction

– branch delay slots

Page 30: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Given our 5-stage MIPS pipeline – what is the steady state CPI for the following code? Assume the branch is taken thousands of times. Recall – a processor is in steady state when all stages are active.

Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop Selection CPI

A 1

B 1.25

C 1.5

D 1.75

E None of the above

Steady-State CPI = (#insts+#stalls+#flushed_insts) #insts

Page 31: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

IF = 200psID = 100psEX = 100psM = 200ps

WB = 100ps

Hardware engineers determine these to be the execution times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 and M1 M2.) The most important code run by the company is (assume branch is taken most of the time):Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop

Selection CPI CT

A Increase Increase

B Increase Decrease

C Decrease Increase

D Decrease Decrease

E Increase No Change

What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID.

Isomorphic

Page 32: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

IF = 200psID = 100psEX = 200psM = 200ps

WB = 100ps

Hardware engineers determine these to be the execution times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 and M1 M2.) The most important code run by the company is (assume branch is taken most of the time):Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop

Selection CPI CT

A Increase Increase

B Increase Decrease

C Decrease Increase

D Decrease Decrease

E Increase No Change

What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID.

Page 33: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop

7-stage Pipeline For diagramming the code in the last slide. Drive home – more stages = more hazards

Page 34: Abstraction Question General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the

Pipelining -- Key Points

• Pipelining focuses on improving instruction throughput, not individual instruction latency.

• Data hazards can be handled by hardware or software – but most modern processors have hardware support for stalling and forwarding.

• Control hazards can be handled by hardware or software – but most modern processors use Branch Target Buffers and advanced dynamic branch prediction to reduce the hazard.

• ET = IC*CPI*CT