lecture 08: risc-v pipeline implementation€¦ · lecture 08: risc-v pipeline implementation csce...

Lecture08:RISC-VPipelineImplementation

CSCE513ComputerArchitecture

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://passlab.github.io/CSCE513

1

Acknowledgement

• SlidesadaptedfromComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakis fromUCBerkeley

2

Review

• CPUperformancefactors– Instructioncount• DeterminedbyISAandcompiler

– CPIandCycletime• DeterminedbyCPUhardware

• Threegroupsofinstructions– Memoryreference:lw,sw– Arithmetic/logical:add,sub,and,or,slt– Controltransfer:jal,jalr,b*

• CPI– Single-cycle,CPI=1,andnormallylongercycle– 5stageunpipelined,CPI=5– 5stagepipelined,CPI=1

CPU Time = InstructionsProgram

* CyclesInstruction

*TimeCycle

3

Review:Unpipelined Datapath forRISC-V

4

0x4

RegWriteEn

AddAdd

clk

WBSelMemWrite

addr

wdata

rdataDataMemory

we

WASel Op2SelImmSelOpCode

clk

clk

addrinst

Inst.Memory

PC rd1

GPRs

rs1rs2

wawd rd2

we

ImmSelect

ALU

ALUControl

PCSelbrrindjabspc+4

Bcomp?BrLogic

Review:HardwiredControlTable

5

Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel

ALUALUiLWSWBEQtrueBEQfalseJJALJALR

Op2Sel=Reg /Imm WBSel =ALU/Mem /PCWASel =rd /X1 PCSel =pc+4/br /rind/jabs

* * * no yes rindPC rdjabs* * * no yes PC X1

jabs* * * no no * *pc+4SBType12 * * no no * *brSBType12 * * no no * *pc+4SType12 Imm + yes no * *

pc+4* Reg Func no yes ALU rdIType12 Imm Op pc+4no yes ALU rd

pc+4IType12 Imm + no yes Mem rd

AnIdealPipeline

• Allobjectsgothroughthesamestages• Nosharingofresourcesbetweenanytwostages• Propagationdelaythroughallpipelinestagesisequal• Theschedulingofanobjectenteringthepipelineisnotaffectedbytheobjectsinotherstages

6

stage1

stage2

stage3

stage4

TheseconditionsgenerallyholdforindustrialassemblylinesForlaundrypipeline,twoloadsdonotdependoneachother.

Butinstructionsdependoneachother!

TechnologyAssumptions

7

Thus,thefollowingtimingassumptionisreasonable

• Asmallamountofveryfastmemory(caches)backedupbyalarge,slowermemory

• FastALU(atleastforintegers)• Multiported Registerfiles(slower)

tIF ~=tID/RF ~=tEX ~=tMEM ~=tWB

A5-stagepipelinewillbe focusofourdetaileddesignSomecommercialdesignshaveover30pipelinestagestodoanintegeradd!

5-StagePipelinedExecution:ResourceUsageTheWholePipelineResourcesareUsedby5InstructionsinEveryCycle!

8

time t0 t1 t2 t3 t4 t5 t6 t7 ....Instruction1 IF1 ID1 EX1 ME1 WB1Instruction2 IF2 ID2 EX2 ME2 WB2Instruction3 IF3 ID3 EX3 ME3 WB3Instruction4 IF4 ID4 EX4 ME4 WB4Instruction5 IF5 ID5 EX5 ME5 WB5

Write-Back

(WB)- I1I-Fetch(IF)- I5

Execute(EX)- I3

Decode,Reg.Fetch(ID)- I4

Memory(ME)- I2

addr

wdata

rdataDataMemory

weALU

ImmSelect

0x4Add

addrrdata

Inst.Memory

rd1

GPRs

rs1rs2wawdrd2

we

IRPC

InstructionRegisterinEachStage

9

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataDataMemory

we

• AninstructionReg (IR) ineachstagetocontaintheinstructioninthatstage

Instruction1

ConnectControlsfromInstructionRegister

10

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

DataMemorywdata

addr

wdata

rdata

we

ImmSel Op2Sel

WBSelMemWrite

RegWriteEn

F D E M W

ALUControl

Instruction1

Instruction4Instruction3

Instruction2

ComparedWithControlLogicinUnpipelined

• Unpipelined:– Singlecontrollogicuses

theinstructionfromIF• Pipelined:– Distributedlogicsthat

uses instructionsfromIRsineachstage

11

InstructionsInteractWithEachOtherin Pipeline:DealingwithHazards

• Aninstructionmayneedaresourcebeingusedbyanotherinstructionà structuralhazard– Solution#1:Stallingnewerinstructiontillolderinstructionfinishes– Solution#2:Addingmorehardwaretodesign• E.g.,separatememoryintoI-memoryandD-memory

– Our5-stagepipelinehasnostructuralhazardsbydesign

• Aninstructiondependsonsomethingproducedbyanearlierone– Dependencemaybeforadatavalueorforusingsameregister(notthe

value)àdatahazard• SolutionsforRAWhazards:#1,interlocking(bubbledelay),and#2,forwarding

• WARandWRWhazards:notpossiblefor5-stagepipeline

– Dependencemaybeforthenextinstruction’saddressà controlhazard(branches,exceptions)• Solutions:#delay,prediction,etc

12

Read-After-Write(RAW)DataHazards

13

...x1 ¬ x0 + 10x4 ¬ x1 + 17...

x1 in GPRs contains stale value since the passing of value between two instructions has to go through GPRs (register file).

x1 ¬ …x4 ¬ x1 …

IRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataData Memory

we

ToResolveDataHazards:#1,Interlocking,i.e.StallPipelinebyInsertingBubbles

14

stalled stages

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I3 I3 I3 I4 I5ID I1 I2 I2 I2 I2 I3 I4 I5EX I1 - - - I2 I3 I4 I5ME I1 - - - I2 I3 I4 I5WB I1 - - - I2 I3 I4 I5

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) x1 ¬ (x0) + 10IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ (x1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3 WB3(I4) IF4 ID4 EX4 ME4 WB4(I5) IF5 ID5 EX5 ME5 WB5

Resource Usage

- Þ pipeline bubble

InsertBubbleforInterlocking

15

IRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataData Memory

we

bubble

...x1 ¬ x0 + 10x4 ¬ x1 + 17...

Stall Condition Bubble:Nop instruction,e.g.addx0,x0,x0BubbleareinsertedatIDandbyenteringEXEstage

InterlockControlLogic

16

IRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataData Memory

we

bubble

Compare the source registers of the instruction in the decode stage with the destination register of the uncommittedinstructions.

stallCstall

ws

rs2rs1 ?

...x1 ¬ x0 + 10x4 ¬ x1 + 17...

InterlockControlLogicignoringjumps&branches

17

Shouldwealwaysstallif anrs fieldmatchessomerd?

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stallCstall

wsWB

rs1rs2 ?

weWB

re1 re2Cre

wsEXweMEM wsMEM

Cdest CdestweEX

noteveryinstructionwritesaregister=>wenoteveryinstructionreadsaregister=>re

ImmSelect

we:writeenable,1-biton/offws:writeselect,5-bitregisternumberre:readenable,1-biton/offrs:readselect,5-bitregisternumber

...x1 ¬ x0 + 10x4 ¬ x1 + 17...

Source&DestinationRegisters

18

ALUI/LW/JALRALU

SW/Bcond

func7 rs2 rs1 func3 rd opcode

immediate12 rs1 func3 rd opcode

imm rs2 rs1 func3 imm

Jump Offset[19:0]

opcode

rd opcode

source(s) destinationALU rd <=rs1func10rs2 rs1,rs2 rdALUI rd <=rs1opimm rs1 rdLW rd <=M[rs1+imm] rs1 rdSW M[rs1+imm]<=rs2 rs1,rs2 -Bcond rs1,rs2 rs1,rs2 -

true: PC<=PC+immfalse: PC<=PC+4

JAL x1<=PC,PC<=PC+imm - rdJALR rd <=PC,PC<=rs1+imm rs1 rd

DerivingtheStallSignal

19

Cdestws =rdwe=Caseopcode

ALU,ALUi,LW,JALR=>on... =>off

Crere1=Caseopcode

ALU,ALUi,LW,SW,Bcond,JALR=>onJAL=>off

re2=CaseopcodeALU,SW,Bcond =>on… =>off

Cstall stall=((rs1D==wsEX)&&weEX +(rs1D==wsMEM)&&weMEM +(rs1D==wsWB)&&weWB)&&re1D +((rs2D==wsEX)&&weEX +(rs2D==wsMEM)&&weMEM +(rs2D==wsWB)&&weWB)&&re2D

source(s) destinationALU rd <=rs1func10rs2 rs1,rs2 rdALUI rd <=rs1opimm rs1 rdLW rd <=M[rs1+imm] rs1 rdSW M[rs1+imm]<=rs2 rs1,rs2 -Bcond rs1,rs2 rs1,rs2 -

true: PC<=PC+immfalse: PC<=PC+4

JAL x1<=PC,PC<=PC+imm - rdJALR rd <=PC,PC<=rs1+imm rs1 rd

NoneedtheWBforinterlockcontrolsinceweonlyneedtodealwithhazardbetweenMEM-EXEandEXE-EXE.FortwoinstructionswhichareinWBandEXE,andhaveRAWhazard,thedependencyarehandledthroughtheregisterfile.

ToResolveDataHazards:#2,Forwarding(Bypassing)

20

Eachstallorkillintroducesabubbleinthepipeline=>CPI>1

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ x1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3(I4) stalled stages IF4 ID4 EX4(I5) IF5 ID5

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ x1 + 17 IF2 ID2 EX2 ME2 WB2(I3) IF3 ID3 EX3 ME3 WB3(I4) IF4 ID4 EX4 ME4 WB4(I5) IF5 ID5 EX5 ME5 WB5

Anewdatapath,i.e.,abypass,cangetthedatafromtheoutputoftheALUtoitsinput

Review:HardwareSupportforForwarding,andDetectingRAWHazardswithPreviousand2nd PreviousInstructions

• Slide48oflecture05-06

21

AddingaBypass(ToBypassRegisterFiles)

22

ASrc

...(I1) x1<=x0+10(I2) x4<=x1+17

x4<=x1... x1<=...

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stall

D

E M W

Whendoesthisbypasshelp?x1<=M[x0+10]x4<=x1+17

JAL500x4<=x1+17

Yes No,LoadàEXE-Use No

TheBypassSignal:DerivingitfromtheStallSignal

23

ASrc =(rs1D==wsE)&&weE &&re1D

we=CaseopcodeALU,ALUi,LW,,JALJALR=>on...=>off

NobecauseonlyALUandALUi instructionscanbenefitfromthisbypass

Isthiscorrect?

SplitweE intotwocomponents:we-bypass,we-stall

stall=(((rs1D==wsE)&&weE +(rs1D==wsM)&&weM +(rs1D==wsW)&&weW)&&re1D+((rs2D==wsE)&&weE +(rs2D==wsM)&&weM +(rs2D==wsW)&&weW)&&re2D)

ws =rd

BypassandStallSignals

24

we-bypassE =CaseopcodeEALU,ALUi =>on... =>off

ASrc =(rs1D==wsE)&&we-bypassE &&re1D

SplitweE intotwocomponents:we-bypass,we-stall

stall= ((rs1D==wsE)&&we-stallE +(rs1D==wsM)&&weM +(rs1D==wsW)&&weW) &&re1D

+((rs2D==wsE)&&weE +(rs2D==wsM)&&weM +(rs2D==wsW)&&weW) &&re2D

we-stallE =CaseopcodeELW,JAL,JALR=>onJAL =>on... =>off

FullyBypassedDatapath

25

ASrcIRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALU

ImmSelect

rd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stall

D

E M W

PCforJAL,...

BSrc

Istherestillaneedforthestallsignal? stall=(rs1D==wsE) &&(opcodeE==LWE)&&(wsE!=0 )&&re1D

+(rs2D==wsE)&&(opcodeE==LWE)&&(wsE!=0 )&&re2D

ControlHazards:BranchesandJumps

• JAL:unconditionaljumptoPC+immediate

• JALR:indirectjumptors1+immediate

• Branch:if(rs1conds rs2),branchtoPC+immediate

26

InfoforControlTransfer

27

Instruction Takenknown? Targetknown?

JALJALRB<cond.>

Twopiecesofinfo:1.TakenorNotTaken2.Targetaddress?

• JAL:unconditionaljumptoPC+immediate• JALR:indirectjumptors1+immediate• Branch:if(rs1conds rs2),branchtoPC+immediate

AfterInst.Decode

AfterInst.Decode AfterInst.Decode

AfterInst.Decode AfterReg.Fetch

After Execute

SpeculateNextAddressisPC+4

28

A jump instruction kills (not stalls)the following instruction

stall

How?

I2

I1

104

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

Jump?

PCSrc (pc+4 / jabs / rind/ br)

I1 096 ADD I2 100 J 304I3 104 ADD

I4 304 ADD

kill

PipeliningJumps

29

I2

I1

104

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

Jump?

PCSrc (pc+4 / jabs / rind/ br)

IRSrcD = Case opcodeDJAL Þ bubble... Þ IM

To kill a fetched instruction -- Insert a mux before IR

Any interaction between stall and jump?

bubble

IRSrcD

I2 I1

304bubble

I1 096 ADD I2 100 J 304I3 104 ADD

I4 304 ADD

kill

JumpPipelineDiagrams

30

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5ID I1 I2 - I4 I5EX I1 I2 - I4 I5ME I1 I2 - I4 I5WB I1 I2 - I4 I5

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: J 304 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 - - - -(I4) 304: ADD IF4 ID4 EX4 ME4 WB4

Resource Usage


PipeliningConditionalBranches

31

I1 096 ADDI2 100 BEQx1,x2+200I3 104 ADDI4 304 ADD

BEQ?

I2

I1

104

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

PCSrc (pc+4/jabs/rind/br)

bubble

IRSrcD

Branchconditionisnotknownuntiltheexecutestage

AYALU

Taken?


32

I1 096 ADDI2 100 BEQx1,x2+200I3 104 ADDI4 304 ADD

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd


bubble

IRSrcD

AYALU

Taken?

Ifthebranchistaken:- Killthetwofollowinginstructions- TheinstructionatthedecodestageisnotvalidÞ stallsignalisnotvalid

I2 I1

108I3

Bcond?

?


33

I1: 096 ADDI2: 100 BEQx1,x2+200I3: 104 ADDI4: 304 ADD

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E M


bubble AYALU

Taken?I2 I1

108I3

Bcond?

Jump?

IRSrcD

IRSrcE

Ifthebranchistaken- killthetwofollowinginstructions- theinstructionatthedecodestageisnotvalidÞ stallsignalisnotvalid

Add

PC

BranchPipelineDiagrams(resolvedinexecutestage)

34

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5ID I1 I2 I3 - I5EX I1 I2 - - I5ME I1 I2 - - I5WB I1 I2 - - I5

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: BEQ +200 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 ID3 - - -(I4) 108: IF4 - - - -(I5) 304: ADD IF5 ID5 EX5 ME5 WB5

Resource Usage


UseSimplerBranches:E.g.OnlyCompareOneRegisterAgainstZeroinIDStage

35

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5ID I1 I2 - I4 I5EX I1 I2 - I4 I5ME I1 I2 - I4 I5WB I1 I2 - I4 I5

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: BEQZ +200 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 - - - -(I4) 300: ADD IF4 ID4 EX4 ME4 WB4

Resource Usage


PipelinedMIPSDatapath

Adder

IF/ID

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

X

Data

Mem

ory

MU

X

SignExtend

Zero?

MEM

/WB

EX/M

EM

4

Adder

Next SEQ PC

RD RD RD

Next PC

Address

RS1

RS2

Imm

MU

X

ID/EX

(I1) 096: ADD(I2) 100: BEQZ +200 (I3) 104: ADD(I4) 300: ADD

36

ControlHazardDelaySummary

• JAL:unconditionaljumptoPC+immediate– 1cycledelayofpipeline• JALR:indirectjumptors1+immediate– 1cycledelay• Branch:if(rs1conds rs2),branchtoPC+immediate– 2cyclesdelay– 1cycledelayforsimplerbranch(BEQZ)withpipeline

improvement

37

ReducingControlFlowPenalty

• Softwaresolutions– Eliminatebranches- loopunrolling• Increasestherunlength

– Reduceresolutiontime- instructionscheduling• Computethebranchconditionasearlyaspossible(oflimitedvaluebecausebranchesoftenincriticalpaththroughcode)

• Hardwaresolutions– Findsomethingelsetodo- delayslots• Replacespipelinebubbleswithusefulwork(requiressoftwarecooperation)

– Speculate- branchprediction• Speculativeexecutionofinstructionsbeyondthebranch

38

AdditionalMaterials– BranchPrediction

39

BranchPrediction

• Motivation– Branchpenaltieslimitperformanceofdeeplypipelinedprocessors– Modernbranchpredictorshavehighaccuracy– (>95%)andcanreducebranchpenaltiessignificantly

• Requiredhardwaresupport:– Predictionstructures:• Branchhistorytables,branchtargetbuffers,etc.

• Mispredict recoverymechanisms:– Keepresultcomputationseparatefromcommit– Killinstructionsfollowingbranchinpipeline– Restorestatetothatfollowingbranch

40

StaticBranchPrediction

41

Overallprobabilityabranchistakenis~60-70%but:

ISAcanattachpreferreddirectionsemanticstobranches,e.g.,MotorolaMC88110

bne0 (preferredtaken) beq0 (nottaken)

backward90%

forward50%

WhatC++statementdoesthislooklike

WhatC++statementdoesthislooklike

DynamicBranchPredictionlearningbasedonpastbehavior

• Temporalcorrelation(time)– IfItellyouthatacertainbranchwastakenlasttime,doesthishelp?

– Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecution

• Spatialcorrelation(space)– Severalbranchesmayresolveinahighlycorrelatedmanner– Forinstance,apreferredpathofexecution

42

DynamicBranchPrediction

• 1-bitpredictionscheme– Low-portionaddressasaddressforaone-bitflagforTakenor

NotTaken historically– Simple• 2-bitprediction– Misstwicetochange

43

BranchPredictionBits

• Assume2BPbitsperinstruction• Changethepredictionaftertwoconsecutivemistakes!

44

¬takewrongtaken

¬taken

taken

taken

taken¬takeright

takeright

takewrong

¬taken

¬taken¬taken

BPstate:(predict take/¬take)x(lastprediction right/wrong)

BranchHistoryTable

45

4K-entryBHT,2bits/entry,~80-90%correctpredictions

0 0FetchPC

Branch? TargetPC

+

I-Cache

Opcode offsetInstruction

kBHTIndex

2k-entryBHT,2bits/entry

Taken/¬Taken?

ExploitingSpatialCorrelationYehandPatt,1992

46

Iffirstconditionfalse,secondconditionalsofalse

Historyregister,H,recordsthedirectionofthelastNbranchesexecutedbytheprocessor

if (x[i] < 7) theny += 1;

if (x[i] < 5) thenc -= 4;

Two-LevelBranchPredictor

47

PentiumProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)

0 0

kFetchPC

ShiftinTaken/¬Takenresultsofeachbranch

2-bitglobalbranchhistoryshiftregister

Taken/¬Taken?

SpeculatingBothDirections• Analternativetobranchpredictionistoexecutebothdirectionsofabranchspeculatively– resourcerequirementisproportionaltothenumberofconcurrent

speculativeexecutions– onlyhalftheresourcesengageinusefulworkwhenbothdirections

ofabranchareexecutedspeculatively– branchpredictiontakeslessresourcesthanspeculativeexecution

ofbothpaths

• Withaccuratebranchprediction,itismorecosteffectivetodedicateallresourcestothepredicteddirection!– Whatwouldyouchoosewith80%accuracy?

48

AreWeMissingSomething?

• Knowingwhetherabranchistakenornotisgreat,butwhatelsedoweneedtoknowaboutit?

Branchtargetaddress

49

BranchTargetBuffer

50

BPbitsarestoredwiththepredictedtargetaddress.

IFstage:If(BP=taken)thennPC=targetelsenPC=PC+4Later:checkprediction,ifwrongthenkilltheinstructionandupdateBTB&BPb elseupdateBPb

IMEM

PC

BranchTargetBuffer(2k entries)

k

BPbpredicted

target BP

target

AddressCollisions(MisPrediction)

51

Whatwillbefetchedaftertheinstructionat1028?BTBprediction =Correcttarget =

=>

Assumea128-entryBTB

BPbtargettake236

1028Add.....

132Jump+104

InstructionMemory

2361032

kill PC=236andfetch PC=1032

Isthisacommonoccurrence?

BTBisonlyforControlInstructions

• Isevenbranchpredictionfastenoughtoavoidbubbles?• WhendoweindextheBTB?– i.e.,whatstateisthebranchin,inordertoavoidbubbles?

• BTBcontainsusefulinformationforbranchandjumpinstructionsonly=> Donotupdateitforotherinstructions

• ForallotherinstructionsthenextPCisPC+4!

• Howtoachievethiseffectwithoutdecodingtheinstruction?

52

BranchTargetBuffer(BTB)

53

• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded

2k-entry direct-mapped BTB(can also be associative)

I-Cache PC

k

Valid

valid

EntryPC

=

match

predicted

target

targetPC

CombiningBTBandBHT

• BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)

• BHTcanholdmanymoreentriesandismoreaccurate

54

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

BTB

BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch

BTB/BHTonlyupdatedafterbranchresolvesinEstage

UsesofJumpRegister(JR)

• Switchstatements(jumptoaddressofmatchingcase)

• Dynamicfunctioncall(jumptorun-timefunctionaddress)

• Subroutinereturns(jumptoreturnaddress)

55

HowwelldoesBTBworkforeachofthesecases?

BTBworkswellifsamecaseusedrepeatedly

BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)

BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!

SubroutineReturnStack

SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.

56

&fb()

&fc()

Pushcalladdresswhenfunctioncallexecuted

Popreturnaddresswhensubroutinereturndecoded

fa() { fb(); }fb() { fc(); }fc() { fd(); }

&fd() kentries(typicallyk=8-16)

lecture 08: risc-v pipeline implementation€¦ · lecture 08: risc-v pipeline implementation csce...

Documents