lecture 08: risc-v pipeline implementation€¦ · lecture 08: risc-v pipeline implementation csce...
TRANSCRIPT
Lecture08:RISC-VPipelineImplementation
CSCE513ComputerArchitecture
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://passlab.github.io/CSCE513
1
Acknowledgement
• SlidesadaptedfromComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakis fromUCBerkeley
2
Review
• CPUperformancefactors– Instructioncount• DeterminedbyISAandcompiler
– CPIandCycletime• DeterminedbyCPUhardware
• Threegroupsofinstructions– Memoryreference:lw,sw– Arithmetic/logical:add,sub,and,or,slt– Controltransfer:jal,jalr,b*
• CPI– Single-cycle,CPI=1,andnormallylongercycle– 5stageunpipelined,CPI=5– 5stagepipelined,CPI=1
CPU Time = InstructionsProgram
* CyclesInstruction
*TimeCycle
3
Review:Unpipelined Datapath forRISC-V
4
0x4
RegWriteEn
AddAdd
clk
WBSelMemWrite
addr
wdata
rdataDataMemory
we
WASel Op2SelImmSelOpCode
clk
clk
addrinst
Inst.Memory
PC rd1
GPRs
rs1rs2
wawd rd2
we
ImmSelect
ALU
ALUControl
PCSelbrrindjabspc+4
Bcomp?BrLogic
Review:HardwiredControlTable
5
Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel
ALUALUiLWSWBEQtrueBEQfalseJJALJALR
Op2Sel=Reg /Imm WBSel =ALU/Mem /PCWASel =rd /X1 PCSel =pc+4/br /rind/jabs
* * * no yes rindPC rdjabs* * * no yes PC X1
jabs* * * no no * *pc+4SBType12 * * no no * *brSBType12 * * no no * *pc+4SType12 Imm + yes no * *
pc+4* Reg Func no yes ALU rdIType12 Imm Op pc+4no yes ALU rd
pc+4IType12 Imm + no yes Mem rd
AnIdealPipeline
• Allobjectsgothroughthesamestages• Nosharingofresourcesbetweenanytwostages• Propagationdelaythroughallpipelinestagesisequal• Theschedulingofanobjectenteringthepipelineisnotaffectedbytheobjectsinotherstages
6
stage1
stage2
stage3
stage4
TheseconditionsgenerallyholdforindustrialassemblylinesForlaundrypipeline,twoloadsdonotdependoneachother.
Butinstructionsdependoneachother!
TechnologyAssumptions
7
Thus,thefollowingtimingassumptionisreasonable
• Asmallamountofveryfastmemory(caches)backedupbyalarge,slowermemory
• FastALU(atleastforintegers)• Multiported Registerfiles(slower)
tIF ~=tID/RF ~=tEX ~=tMEM ~=tWB
A5-stagepipelinewillbe focusofourdetaileddesignSomecommercialdesignshaveover30pipelinestagestodoanintegeradd!
5-StagePipelinedExecution:ResourceUsageTheWholePipelineResourcesareUsedby5InstructionsinEveryCycle!
8
time t0 t1 t2 t3 t4 t5 t6 t7 ....Instruction1 IF1 ID1 EX1 ME1 WB1Instruction2 IF2 ID2 EX2 ME2 WB2Instruction3 IF3 ID3 EX3 ME3 WB3Instruction4 IF4 ID4 EX4 ME4 WB4Instruction5 IF5 ID5 EX5 ME5 WB5
Write-Back
(WB)- I1I-Fetch(IF)- I5
Execute(EX)- I3
Decode,Reg.Fetch(ID)- I4
Memory(ME)- I2
addr
wdata
rdataDataMemory
weALU
ImmSelect
0x4Add
addrrdata
Inst.Memory
rd1
GPRs
rs1rs2wawdrd2
we
IRPC
InstructionRegisterinEachStage
9
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawd rd2
we
wdata
addr
wdata
rdataDataMemory
we
• AninstructionReg (IR) ineachstagetocontaintheinstructioninthatstage
Instruction1
ConnectControlsfromInstructionRegister
10
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawd rd2
we
DataMemorywdata
addr
wdata
rdata
we
ImmSel Op2Sel
WBSelMemWrite
RegWriteEn
F D E M W
ALUControl
Instruction1
Instruction4Instruction3
Instruction2
ComparedWithControlLogicinUnpipelined
• Unpipelined:– Singlecontrollogicuses
theinstructionfromIF• Pipelined:– Distributedlogicsthat
uses instructionsfromIRsineachstage
11
InstructionsInteractWithEachOtherin Pipeline:DealingwithHazards
• Aninstructionmayneedaresourcebeingusedbyanotherinstructionà structuralhazard– Solution#1:Stallingnewerinstructiontillolderinstructionfinishes– Solution#2:Addingmorehardwaretodesign• E.g.,separatememoryintoI-memoryandD-memory
– Our5-stagepipelinehasnostructuralhazardsbydesign
• Aninstructiondependsonsomethingproducedbyanearlierone– Dependencemaybeforadatavalueorforusingsameregister(notthe
value)àdatahazard• SolutionsforRAWhazards:#1,interlocking(bubbledelay),and#2,forwarding
• WARandWRWhazards:notpossiblefor5-stagepipeline
– Dependencemaybeforthenextinstruction’saddressà controlhazard(branches,exceptions)• Solutions:#delay,prediction,etc
12
Read-After-Write(RAW)DataHazards
13
...x1 ¬ x0 + 10x4 ¬ x1 + 17...
x1 in GPRs contains stale value since the passing of value between two instructions has to go through GPRs (register file).
x1 ¬ …x4 ¬ x1 …
IRIR IR
PCA
B
Y
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawd rd2
we
wdata
addr
wdata
rdataData Memory
we
ToResolveDataHazards:#1,Interlocking,i.e.StallPipelinebyInsertingBubbles
14
stalled stages
timet0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I3 I3 I3 I4 I5ID I1 I2 I2 I2 I2 I3 I4 I5EX I1 - - - I2 I3 I4 I5ME I1 - - - I2 I3 I4 I5WB I1 - - - I2 I3 I4 I5
timet0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) x1 ¬ (x0) + 10IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ (x1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3 WB3(I4) IF4 ID4 EX4 ME4 WB4(I5) IF5 ID5 EX5 ME5 WB5
Resource Usage
- Þ pipeline bubble
InsertBubbleforInterlocking
15
IRIR IR
PCA
B
Y
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawd rd2
we
wdata
addr
wdata
rdataData Memory
we
bubble
...x1 ¬ x0 + 10x4 ¬ x1 + 17...
Stall Condition Bubble:Nop instruction,e.g.addx0,x0,x0BubbleareinsertedatIDandbyenteringEXEstage
InterlockControlLogic
16
IRIR IR
PCA
B
Y
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawd rd2
we
wdata
addr
wdata
rdataData Memory
we
bubble
Compare the source registers of the instruction in the decode stage with the destination register of the uncommittedinstructions.
stallCstall
ws
rs2rs1 ?
...x1 ¬ x0 + 10x4 ¬ x1 + 17...
InterlockControlLogicignoringjumps&branches
17
Shouldwealwaysstallif anrs fieldmatchessomerd?
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR ALUrd1
GPRs
rs1rs2
wawd rd2
we
wdata
addr
wdata
rdataDataMemory
we
bubble
stallCstall
wsWB
rs1rs2 ?
weWB
re1 re2Cre
wsEXweMEM wsMEM
Cdest CdestweEX
noteveryinstructionwritesaregister=>wenoteveryinstructionreadsaregister=>re
ImmSelect
we:writeenable,1-biton/offws:writeselect,5-bitregisternumberre:readenable,1-biton/offrs:readselect,5-bitregisternumber
...x1 ¬ x0 + 10x4 ¬ x1 + 17...
Source&DestinationRegisters
18
ALUI/LW/JALRALU
SW/Bcond
func7 rs2 rs1 func3 rd opcode
immediate12 rs1 func3 rd opcode
imm rs2 rs1 func3 imm
Jump Offset[19:0]
opcode
rd opcode
source(s) destinationALU rd <=rs1func10rs2 rs1,rs2 rdALUI rd <=rs1opimm rs1 rdLW rd <=M[rs1+imm] rs1 rdSW M[rs1+imm]<=rs2 rs1,rs2 -Bcond rs1,rs2 rs1,rs2 -
true: PC<=PC+immfalse: PC<=PC+4
JAL x1<=PC,PC<=PC+imm - rdJALR rd <=PC,PC<=rs1+imm rs1 rd
DerivingtheStallSignal
19
Cdestws =rdwe=Caseopcode
ALU,ALUi,LW,JALR=>on... =>off
Crere1=Caseopcode
ALU,ALUi,LW,SW,Bcond,JALR=>onJAL=>off
re2=CaseopcodeALU,SW,Bcond =>on… =>off
Cstall stall=((rs1D==wsEX)&&weEX +(rs1D==wsMEM)&&weMEM +(rs1D==wsWB)&&weWB)&&re1D +((rs2D==wsEX)&&weEX +(rs2D==wsMEM)&&weMEM +(rs2D==wsWB)&&weWB)&&re2D
source(s) destinationALU rd <=rs1func10rs2 rs1,rs2 rdALUI rd <=rs1opimm rs1 rdLW rd <=M[rs1+imm] rs1 rdSW M[rs1+imm]<=rs2 rs1,rs2 -Bcond rs1,rs2 rs1,rs2 -
true: PC<=PC+immfalse: PC<=PC+4
JAL x1<=PC,PC<=PC+imm - rdJALR rd <=PC,PC<=rs1+imm rs1 rd
NoneedtheWBforinterlockcontrolsinceweonlyneedtodealwithhazardbetweenMEM-EXEandEXE-EXE.FortwoinstructionswhichareinWBandEXE,andhaveRAWhazard,thedependencyarehandledthroughtheregisterfile.
ToResolveDataHazards:#2,Forwarding(Bypassing)
20
Eachstallorkillintroducesabubbleinthepipeline=>CPI>1
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ x1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3(I4) stalled stages IF4 ID4 EX4(I5) IF5 ID5
time t0 t1 t2 t3 t4 t5 t6 t7 . . . .(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ x1 + 17 IF2 ID2 EX2 ME2 WB2(I3) IF3 ID3 EX3 ME3 WB3(I4) IF4 ID4 EX4 ME4 WB4(I5) IF5 ID5 EX5 ME5 WB5
Anewdatapath,i.e.,abypass,cangetthedatafromtheoutputoftheALUtoitsinput
Review:HardwareSupportforForwarding,andDetectingRAWHazardswithPreviousand2nd PreviousInstructions
• Slide48oflecture05-06
21
AddingaBypass(ToBypassRegisterFiles)
22
ASrc
...(I1) x1<=x0+10(I2) x4<=x1+17
x4<=x1... x1<=...
IRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR
ImmSelect
ALUrd1
GPRs
rs1rs2
wawd rd2
we
wdata
addr
wdata
rdataDataMemory
we
bubble
stall
D
E M W
Whendoesthisbypasshelp?x1<=M[x0+10]x4<=x1+17
JAL500x4<=x1+17
Yes No,LoadàEXE-Use No
TheBypassSignal:DerivingitfromtheStallSignal
23
ASrc =(rs1D==wsE)&&weE &&re1D
we=CaseopcodeALU,ALUi,LW,,JALJALR=>on...=>off
NobecauseonlyALUandALUi instructionscanbenefitfromthisbypass
Isthiscorrect?
SplitweE intotwocomponents:we-bypass,we-stall
stall=(((rs1D==wsE)&&weE +(rs1D==wsM)&&weM +(rs1D==wsW)&&weW)&&re1D+((rs2D==wsE)&&weE +(rs2D==wsM)&&weM +(rs2D==wsW)&&weW)&&re2D)
ws =rd
BypassandStallSignals
24
we-bypassE =CaseopcodeEALU,ALUi =>on... =>off
ASrc =(rs1D==wsE)&&we-bypassE &&re1D
SplitweE intotwocomponents:we-bypass,we-stall
stall= ((rs1D==wsE)&&we-stallE +(rs1D==wsM)&&weM +(rs1D==wsW)&&weW) &&re1D
+((rs2D==wsE)&&weE +(rs2D==wsM)&&weM +(rs2D==wsW)&&weW) &&re2D
we-stallE =CaseopcodeELW,JAL,JALR=>onJAL =>on... =>off
FullyBypassedDatapath
25
ASrcIRIR IR
PC A
BY
R
MD1 MD2
addrinst
InstMemory
0x4Add
IR ALU
ImmSelect
rd1
GPRs
rs1rs2
wawd rd2
we
wdata
addr
wdata
rdataDataMemory
we
bubble
stall
D
E M W
PCforJAL,...
BSrc
Istherestillaneedforthestallsignal? stall=(rs1D==wsE) &&(opcodeE==LWE)&&(wsE!=0 )&&re1D
+(rs2D==wsE)&&(opcodeE==LWE)&&(wsE!=0 )&&re2D
ControlHazards:BranchesandJumps
• JAL:unconditionaljumptoPC+immediate
• JALR:indirectjumptors1+immediate
• Branch:if(rs1conds rs2),branchtoPC+immediate
26
InfoforControlTransfer
27
Instruction Takenknown? Targetknown?
JALJALRB<cond.>
Twopiecesofinfo:1.TakenorNotTaken2.Targetaddress?
• JAL:unconditionaljumptoPC+immediate• JALR:indirectjumptors1+immediate• Branch:if(rs1conds rs2),branchtoPC+immediate
AfterInst.Decode
AfterInst.Decode AfterInst.Decode
AfterInst.Decode AfterReg.Fetch
After Execute
SpeculateNextAddressisPC+4
28
A jump instruction kills (not stalls)the following instruction
stall
How?
I2
I1
104
IR IR
PC addrinst
InstMemory
0x4Add
bubble
IR
E MAdd
Jump?
PCSrc (pc+4 / jabs / rind/ br)
I1 096 ADD I2 100 J 304I3 104 ADD
I4 304 ADD
kill
PipeliningJumps
29
I2
I1
104
stall
IR IR
PC addrinst
InstMemory
0x4Add
bubble
IR
E MAdd
Jump?
PCSrc (pc+4 / jabs / rind/ br)
IRSrcD = Case opcodeDJAL Þ bubble... Þ IM
To kill a fetched instruction -- Insert a mux before IR
Any interaction between stall and jump?
bubble
IRSrcD
I2 I1
304bubble
I1 096 ADD I2 100 J 304I3 104 ADD
I4 304 ADD
kill
JumpPipelineDiagrams
30
timet0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I4 I5ID I1 I2 - I4 I5EX I1 I2 - I4 I5ME I1 I2 - I4 I5WB I1 I2 - I4 I5
timet0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: J 304 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 - - - -(I4) 304: ADD IF4 ID4 EX4 ME4 WB4
Resource Usage
- Þ pipeline bubble
PipeliningConditionalBranches
31
I1 096 ADDI2 100 BEQx1,x2+200I3 104 ADDI4 304 ADD
BEQ?
I2
I1
104
stall
IR IR
PC addrinst
InstMemory
0x4Add
bubble
IR
E MAdd
PCSrc (pc+4/jabs/rind/br)
bubble
IRSrcD
Branchconditionisnotknownuntiltheexecutestage
AYALU
Taken?
PipeliningConditionalBranches
32
I1 096 ADDI2 100 BEQx1,x2+200I3 104 ADDI4 304 ADD
stall
IR IR
PC addrinst
InstMemory
0x4Add
bubble
IR
E MAdd
PCSrc (pc+4/jabs/rind/br)
bubble
IRSrcD
AYALU
Taken?
Ifthebranchistaken:- Killthetwofollowinginstructions- TheinstructionatthedecodestageisnotvalidÞ stallsignalisnotvalid
I2 I1
108I3
Bcond?
?
PipeliningConditionalBranches
33
I1: 096 ADDI2: 100 BEQx1,x2+200I3: 104 ADDI4: 304 ADD
stall
IR IR
PC addrinst
InstMemory
0x4Add
bubble
IR
E M
PCSrc (pc+4/jabs/rind/br)
bubble AYALU
Taken?I2 I1
108I3
Bcond?
Jump?
IRSrcD
IRSrcE
Ifthebranchistaken- killthetwofollowinginstructions- theinstructionatthedecodestageisnotvalidÞ stallsignalisnotvalid
Add
PC
BranchPipelineDiagrams(resolvedinexecutestage)
34
timet0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I4 I5ID I1 I2 I3 - I5EX I1 I2 - - I5ME I1 I2 - - I5WB I1 I2 - - I5
timet0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: BEQ +200 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 ID3 - - -(I4) 108: IF4 - - - -(I5) 304: ADD IF5 ID5 EX5 ME5 WB5
Resource Usage
- Þ pipeline bubble
UseSimplerBranches:E.g.OnlyCompareOneRegisterAgainstZeroinIDStage
35
timet0 t1 t2 t3 t4 t5 t6 t7 . . . .
IF I1 I2 I3 I4 I5ID I1 I2 - I4 I5EX I1 I2 - I4 I5ME I1 I2 - I4 I5WB I1 I2 - I4 I5
timet0 t1 t2 t3 t4 t5 t6 t7 . . . .
(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: BEQZ +200 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 - - - -(I4) 300: ADD IF4 ID4 EX4 ME4 WB4
Resource Usage
- Þ pipeline bubble
PipelinedMIPSDatapath
Adder
IF/ID
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X
SignExtend
Zero?
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC
RD RD RD
Next PC
Address
RS1
RS2
Imm
MU
X
ID/EX
(I1) 096: ADD(I2) 100: BEQZ +200 (I3) 104: ADD(I4) 300: ADD
36
ControlHazardDelaySummary
• JAL:unconditionaljumptoPC+immediate– 1cycledelayofpipeline• JALR:indirectjumptors1+immediate– 1cycledelay• Branch:if(rs1conds rs2),branchtoPC+immediate– 2cyclesdelay– 1cycledelayforsimplerbranch(BEQZ)withpipeline
improvement
37
ReducingControlFlowPenalty
• Softwaresolutions– Eliminatebranches- loopunrolling• Increasestherunlength
– Reduceresolutiontime- instructionscheduling• Computethebranchconditionasearlyaspossible(oflimitedvaluebecausebranchesoftenincriticalpaththroughcode)
• Hardwaresolutions– Findsomethingelsetodo- delayslots• Replacespipelinebubbleswithusefulwork(requiressoftwarecooperation)
– Speculate- branchprediction• Speculativeexecutionofinstructionsbeyondthebranch
38
AdditionalMaterials– BranchPrediction
39
BranchPrediction
• Motivation– Branchpenaltieslimitperformanceofdeeplypipelinedprocessors– Modernbranchpredictorshavehighaccuracy– (>95%)andcanreducebranchpenaltiessignificantly
• Requiredhardwaresupport:– Predictionstructures:• Branchhistorytables,branchtargetbuffers,etc.
• Mispredict recoverymechanisms:– Keepresultcomputationseparatefromcommit– Killinstructionsfollowingbranchinpipeline– Restorestatetothatfollowingbranch
40
StaticBranchPrediction
41
Overallprobabilityabranchistakenis~60-70%but:
ISAcanattachpreferreddirectionsemanticstobranches,e.g.,MotorolaMC88110
bne0 (preferredtaken) beq0 (nottaken)
backward90%
forward50%
WhatC++statementdoesthislooklike
WhatC++statementdoesthislooklike
DynamicBranchPredictionlearningbasedonpastbehavior
• Temporalcorrelation(time)– IfItellyouthatacertainbranchwastakenlasttime,doesthishelp?
– Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecution
• Spatialcorrelation(space)– Severalbranchesmayresolveinahighlycorrelatedmanner– Forinstance,apreferredpathofexecution
42
DynamicBranchPrediction
• 1-bitpredictionscheme– Low-portionaddressasaddressforaone-bitflagforTakenor
NotTaken historically– Simple• 2-bitprediction– Misstwicetochange
43
BranchPredictionBits
• Assume2BPbitsperinstruction• Changethepredictionaftertwoconsecutivemistakes!
44
¬takewrongtaken
¬taken
taken
taken
taken¬takeright
takeright
takewrong
¬taken
¬taken¬taken
BPstate:(predict take/¬take)x(lastprediction right/wrong)
BranchHistoryTable
45
4K-entryBHT,2bits/entry,~80-90%correctpredictions
0 0FetchPC
Branch? TargetPC
+
I-Cache
Opcode offsetInstruction
kBHTIndex
2k-entryBHT,2bits/entry
Taken/¬Taken?
ExploitingSpatialCorrelationYehandPatt,1992
46
Iffirstconditionfalse,secondconditionalsofalse
Historyregister,H,recordsthedirectionofthelastNbranchesexecutedbytheprocessor
if (x[i] < 7) theny += 1;
if (x[i] < 5) thenc -= 4;
Two-LevelBranchPredictor
47
PentiumProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)
0 0
kFetchPC
ShiftinTaken/¬Takenresultsofeachbranch
2-bitglobalbranchhistoryshiftregister
Taken/¬Taken?
SpeculatingBothDirections• Analternativetobranchpredictionistoexecutebothdirectionsofabranchspeculatively– resourcerequirementisproportionaltothenumberofconcurrent
speculativeexecutions– onlyhalftheresourcesengageinusefulworkwhenbothdirections
ofabranchareexecutedspeculatively– branchpredictiontakeslessresourcesthanspeculativeexecution
ofbothpaths
• Withaccuratebranchprediction,itismorecosteffectivetodedicateallresourcestothepredicteddirection!– Whatwouldyouchoosewith80%accuracy?
48
AreWeMissingSomething?
• Knowingwhetherabranchistakenornotisgreat,butwhatelsedoweneedtoknowaboutit?
Branchtargetaddress
49
BranchTargetBuffer
50
BPbitsarestoredwiththepredictedtargetaddress.
IFstage:If(BP=taken)thennPC=targetelsenPC=PC+4Later:checkprediction,ifwrongthenkilltheinstructionandupdateBTB&BPb elseupdateBPb
IMEM
PC
BranchTargetBuffer(2k entries)
k
BPbpredicted
target BP
target
AddressCollisions(MisPrediction)
51
Whatwillbefetchedaftertheinstructionat1028?BTBprediction =Correcttarget =
=>
Assumea128-entryBTB
BPbtargettake236
1028Add.....
132Jump+104
InstructionMemory
2361032
kill PC=236andfetch PC=1032
Isthisacommonoccurrence?
BTBisonlyforControlInstructions
• Isevenbranchpredictionfastenoughtoavoidbubbles?• WhendoweindextheBTB?– i.e.,whatstateisthebranchin,inordertoavoidbubbles?
• BTBcontainsusefulinformationforbranchandjumpinstructionsonly=> Donotupdateitforotherinstructions
• ForallotherinstructionsthenextPCisPC+4!
• Howtoachievethiseffectwithoutdecodingtheinstruction?
52
BranchTargetBuffer(BTB)
53
• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded
2k-entry direct-mapped BTB(can also be associative)
I-Cache PC
k
Valid
valid
EntryPC
=
match
predicted
target
targetPC
CombiningBTBandBHT
• BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)
• BHTcanholdmanymoreentriesandismoreaccurate
54
A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute
BTB
BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch
BTB/BHTonlyupdatedafterbranchresolvesinEstage
UsesofJumpRegister(JR)
• Switchstatements(jumptoaddressofmatchingcase)
• Dynamicfunctioncall(jumptorun-timefunctionaddress)
• Subroutinereturns(jumptoreturnaddress)
55
HowwelldoesBTBworkforeachofthesecases?
BTBworkswellifsamecaseusedrepeatedly
BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)
BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!
SubroutineReturnStack
SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.
56
&fb()
&fc()
Pushcalladdresswhenfunctioncallexecuted
Popreturnaddresswhensubroutinereturndecoded
fa() { fb(); }fb() { fc(); }fc() { fd(); }
&fd() kentries(typicallyk=8-16)