lecture 08: risc-v single-cycle implementaon - passlab · – slides for risc-v single-cycle...
TRANSCRIPT
Lecture08:RISC-VSingle-CycleImplementa9on
CSE564ComputerArchitectureSummer2017
DepartmentofComputerScienceandEngineeringYonghongYan
[email protected]/~yan
1
Acknowledgements
• ThenotescoverAppendixCofthetextbook,butweuseRISC-VinsteadofMIPSISA– SlidesforgeneralRISCISAimplementaLonareadaptedfrom
Lectureslidesfor“ComputerOrganizaLonandDesign,FiRhEdiLon:TheHardware/SoRwareInterface”textbookforgeneralRISCISAimplementaLon
– SlidesforRISC-Vsingle-cycleimplementaLonareadaptedfromComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakisfromUCBerkeley
2
Introduc9on
• CPUperformancefactors– InstrucLoncount
• DeterminedbyISAandcompiler– CPIandCycleLme
• DeterminedbyCPUhardware
• Simplesubset,showsmostaspects– Memoryreference:lw,sw– ArithmeLc/logical:add,sub,and,or,slt– Controltransfer:beq,j
CPU Time = InstructionsProgram
* CyclesInstruction
*TimeCycle
Processor
Control
Datapath
ComponentsofaComputer
4
PC
Registers
ArithmeLc&LogicUnit(ALU)
MemoryInput
Output
Bytes
Enable?Read/Write
Address
WriteData
ReadData
Processor-MemoryInterface I/O-MemoryInterfaces
Program
Data
TheCPU
• Processor(CPU):theacLvepartofthecomputerthatdoesallthework(datamanipulaLonanddecision-making)
• Datapath:porLonoftheprocessorthatcontainshardwarenecessarytoperformoperaLonsrequiredbytheprocessor(thebrawn)
• Control:porLonoftheprocessor(alsoinhardware)thattellsthedatapathwhatneedstobedone(thebrain)
5
DatapathandControl
• DatapathdesignedtosupportdatatransfersrequiredbyinstrucLons
• Controllercausescorrecttransferstohappen
Controller opcode, funct
inst
ruct
ion
mem
ory
+4
rt rs rd
regi
ster
s ALU
Dat
a m
emor
y
imm
PC
6
Logic Design Basics
• Information encoded in binary – Low voltage = 0, High voltage = 1 – One wire per bit – Multi-bit data encoded on multi-wire buses
• Combinational circuit – Operate on data – Output is a function of input
• State (sequential) circuit – Store information
Combinational Circuits
• AND-gate – Y = A & B
Chapter 4 — The Processor — 8
A B Y
I0 I1 Y
M u x
S
n MulLplexern Y=S?I1:I0
A
B Y +
A
B
Y ALU
F
n Addern Y=A+B
n ArithmeLc/LogicUnitn Y=F(A,B)
Sequential Circuits
• Register: stores data in a circuit – Uses a clock signal to determine when to update the
stored value – Edge-triggered: update when Clk changes from 0 to 1
D
Clk
Q Clk
D
Q
Edge-TriggeredDFlipFlops
• ValueofDissampledonposi9veclockedge.
• Qoutputssampledvalueforrestofcycle.
D Q
CLK
D
Q
Sequential Circuits
• Register with write control – Only updates on clock edge when write control input is 1 – Used when stored value is required later
D
Clk
Q Write
Write
D
Q
Clk
Clocking Methodology • Combinational logic transforms data during clock
cycles – Between clock edges – Input from state elements, output to state element – Longest delay determines clock period
Singlecycledatapaths
Processorusessynchronouslogicdesign(a“clock”). f! T!
1 MHz! 1 μs!10 MHz! 100 ns!
100 MHz! 10 ns!1 GHz! 1 ns!
AllstateelementsactlikeposiLveedge-triggeredflipflops.
D Q
clk
Reset ?
HardwareElementsofCPU
• CombinaLonalcircuits– Mux,Decoder,ALU,...
• Synchronousstateelements– Flipflop,Register,Registerfile,SRAM,DRAM
Edge-triggered:Dataissampledattherisingedge
Clk
D
Q
Enff
Q
D
ClkEn
OpSelect-Add,Sub,...-And,Or,Xor,Not,...-GT,LT,EQ,Zero,...
Result
Comp?
A
B
ALU
Sel
OA0A1An-1
Mux...
lg(n)
A
Decode
r ...
O0O1On-1
lg(n)
RegisterFiles
• ReadsarecombinaLonal
15
ReadData1ReadSel1ReadSel2
WriteSel
Registerfile
2R+1W
ReadData2
WriteData
WEClock
rd1rs1
rs2
ws
wd
rd2
we
ff
Q0
D0
ClkEn
ff
Q1
D1
ff
Q2
D2
ff
Qn-1
Dn-1
...
...
...
register
RegisterFileImplementa9on
• RISC-VintegerinstrucLonshaveatmost2registersourceoperands
16
reg31
rd clk
reg1
wdata
we
rs1rdata1 rdata2
reg0
…
32
…
5 32 32
…
rs255
ASimpleMemoryModel
17
MAGICRAM
ReadData
WriteData
Address
WriteEnableClock
Readsandwritesarealwayscompletedinonecycle• aReadcanbedoneanyLme(i.e.combinaLonal)• aWriteisperformedattherisingclockedgeifitisenabled
=>thewriteaddressanddata mustbestableattheclockedge
Laterinthecoursewewillpresentamorerealis:cmodelofmemory
FiveStagesofInstruc9onExecu9on
• Stage1:InstrucLonFetch• Stage2:InstrucLonDecode• Stage3:ALU(ArithmeLc-LogicUnit)
• Stage4:MemoryAccess
• Stage5:RegisterWrite
18
StagesofExecu9ononDatapath
inst
ruct
ion
mem
ory
+4
rt rs rd
regi
ster
s
ALU
Dat
a m
emor
y
imm
1.InstrucLonFetch
2.Decode/Register
Read
3.Execute 4.Memory 5.RegisterWrite
PC
19
StagesofExecu9on(1/5)
• ThereisawidevarietyofinstrucLons:sowhatgeneralstepsdotheyhaveincommon?
• Stage1:InstrucLonFetch– The32-bitinstrucLonwordmustfirstbefetchedfrom
memory• thecache-memoryhierarchy
– also,thisiswhereweIncrementPC• PC=PC+4,topointtothenextinstrucLon:byteaddressingso+4
20
StagesofExecu9on(2/5)
• Stage2:InstrucLonDecode:gatherdatafromthefields(decodeallnecessaryinstrucLondata)1. readtheopcodetodetermineinstrucLontypeandfield
lengths2. readindatafromallnecessaryregisters
• foradd,readtworegisters• foraddi,readoneregister• forjal,noreadsnecessary
21
StagesofExecu9on(3/5)
• Stage3:ALU(ArithmeLc-LogicUnit):therealworkofmostinstrucLonsisdonehere– ALoperaLons:
• arithmeLc(+,-,*,/),shiRing,logic(&,|),comparisons(slt)
– loadsandstores• lw$t0,40($t1)• theaddressweareaccessinginmemory=thevaluein$t1PLUSthevalue40
• AddiLonisdoneinthisstage– CondiLonalbranch
• Comparisonisdoneinthisstage(onesoluLon)
22
StagesofExecu9on(4/5)
• Stage4:MemoryAccess:onlyloadandstoreinstrucLons– theothersremainidleduringthisstageorskipitalltogether– sincetheseinstrucLonshaveauniquestep,weneedthisextra
stagetoaccountforthem– asaresultofthecachesystem,thisstageisexpectedtobe
fast
23
StagesofExecu9on(5/5)
• Stage5:RegisterWrite– mostinstrucLonswritetheresultofsomecomputaLonintoa
register– examples:arithmeLc,logical,shiRs,loads,slt– whataboutstores,branches,jumps?
• don’twriteanythingintoaregisterattheend• theseremainidleduringthisfiRhstageorskipitalltogether
24
StagesofExecu9ononDatapath
inst
ruct
ion
mem
ory
+4
rt rs rd
regi
ster
s
ALU
Dat
a m
emor
y
imm
1.InstrucLonFetch
2.Decode/Register
Read
3.Execute 4.Memory 5.RegisterWrite
PC
25
Instruction Execution
• PC → instruction memory, fetch instruction • Register numbers → register file, read registers • Depending on instruction class
– Use ALU to calculate • Arithmetic result • Memory address for load/store • Branch condition and target address
– Access data memory for load/store – PC ← target address or PC + 4
CPU Components
Multiplexers
n Can’tjustjoinwirestogethern UsemulLplexers
Control Signals
Building a Datapath
• Datapath– Elementsthatprocessdataandaddresses
intheCPU• Registers,ALUs,mux’s,memories,…
• WewillbuildaRISCVdatapathincrementally– Refiningtheoverviewdesign
Instruction Fetch
32-bitregister
Incrementby4fornextinstrucLon
R-Format Instructions
• Read two register operands • Perform arithmetic/logical operation • Write register result
Load/Store Instructions
• Read register operands • Calculate address using 12-bit offset
– Use ALU, but sign-extend offset • Load: Read memory and update register • Store: Write register value to memory
Branch Instructions
• Read register operands • Compare operands
– Use ALU, subtract and check Zero output • Calculate target address
– Sign-extend displacement – Shift left 2 places (word displacement) – Add to PC + 4
• Already calculated by instruction fetch
Branch Instructions
Justre-routeswires
Sign-bitwirereplicated
Composing the Elements
• First-cut data path does an instruction in one clock cycle – Each datapath element can only do one function at a time – Hence, we need separate instruction and data memories
• Use multiplexers where alternate data sources are used for different instructions
R-Type/Load/Store Datapath
Full Datapath
ALU Control
• ALU used for – Load/Store: F = add – Branch: F = subtract – R-type: F depends on funct field
ALU control Function 0000 AND 0001 OR 0010 add 0110 subtract 0111 set-on-less-than 1100 NOR
ALU Control
• Assume 2-bit ALUOp derived from opcode – Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU control lw 00 load word XXXXXX add 0010
sw 00 store word XXXXXX add 0010 beq 01 branch equal XXXXXX subtract 0110 R-type 10 add 100000 add 0010
subtract 100010 subtract 0110 AND 100100 AND 0000 OR 100101 OR 0001
set-on-less-than 101010 set-on-less-than 0111
Datapath:Reg-RegALUInstruc9ons
41
RegWriteTiming?
0x4Add
clk
addrinst
Inst.Memory
PC
Inst<19:15>Inst<24:20>
Inst<11:7>
Inst<14:12>
OpCode
ALU
ALUControl
RegWriteEn
clk
rd1
GPRs
rs1rs2
wawd rd2
we
7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd ← (rs1) func (rs2) 31 25 24 20 19 15 14 12 11 7 6 0
Datapath:Reg-ImmALUInstruc9ons
42
ImmSelect
ImmSel
Inst<31:20>
OpCode
0x4Add
clk
addrinst
Inst.Memory
PCALU
RegWriteEn
clk
rd1
GPRs
rs1rs2
wawd rd2
weInst<19:15>
Inst<11:7>
Inst<14:12> ALUControl
12 5 3 5 7 immediate12 rs1 func3 rd opcode rd ← (rs1) op immediate 31 20 19 15 14 12 11 7 6 0
ConflictsinMergingDatapath
43
ImmSelect
ImmSelOpCode
0x4Add
clk
addrinst
Inst.Memory
PCALU
RegWrite
clk
rd1
GPRs
rs1rs2
wawd rd2
weInst<19:15>
Inst<11:7>
Inst<31:20>
Inst<14:12> ALUControl
Introduce muxes
Inst<24:20>
7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd ← (rs1) func (rs2) immediate12 rs1 func3 rd opcode rd ← (rs1) op immediate 31 20 19 15 14 12 11 7 6 0
DatapathforALUInstruc9ons
44
<14:12>
Op2SelReg/Imm
ImmSelect
ImmSelOpCode
0x4Add
clk
addrinst
Inst.Memory
PCALU
RegWriteEnclk
rd1
GPRs
rs1rs2
wawd rd2
we<19:15><24:20>
ALUControl
<11:7>
<6:0>
7 5 5 3 5 7 func7 rs2 rs1 func3 rd opcode rd ← (rs1) func (rs2) immediate12 rs1 func3 rd opcode rd ← (rs1) op immediate 31 20 19 15 14 12 11 7 6 0
Inst<31:20>
Load/StoreInstruc9ons
45
WBSelALU/Mem
rs1 is the base register rd is the destination of a Load, rs2 is the data source for a Store
Op2Sel
“base”
disp
ImmSelOpCode
ALUControl
ALU
0x4Add
clk
addrinst
Inst.Memory
PC
RegWriteEn
clk
rd1
GPRs
rs1rs2
wawd rd2
we
ImmSelect
clk
MemWrite
addr
wdata
rdataDataMemory
we
7 5 5 3 5 7 imm rs2 rs1 func3 imm opcode Store (rs1) + displacement immediate12 rs1 func3 rd opcode Load 31 20 19 15 14 12 11 7 6 0
RISC-VCondi9onalBranches
• Comparetwointegerregistersforequality(BEQ/BNE)orsignedmagnitude(BLT/BGE)orunsignedmagnitude(BLTU/BGEU)
• 12-bitimmediateencodesbranchtargetaddressasasignedoffsetfromPC,inunitsof16-bits(i.e.,shiRleRby1thenaddtoPC).
46
7
6 0opcode
5
11 7imm
3
14 12func3
5
19 15rs1
5
24 20rs2
7
31 25imm
BEQ/BNEBLT/BGEBLTU/BGEU
Condi9onalBranches(BEQ/BNE/BLT/BGE/BLTU/BGEU)
47
0x4
Add
PCSel
clk
WBSelMemWrite
addr
wdata
rdataDataMemory
we
Op2SelImmSelOpCode
Bcomp?
clk
clk
addrinst
Inst.Memory
PC rd1
GPRs
rs1rs2
wawd rd2
we
ImmSelect
ALU
ALUControl
Add
br
pc+4
RegWrEn
BrLogic
FullRISCV1StageDatapath
48
+4
Instruction Mem
RegFile
IType SignExtend
DecoderData Mem
ir[24:20]
branch
pc+4
pc_s
el
ir[31:20]
rs1
ALU
ControlSignals
wb_s
el
RegFile
rf_we
n
val
mem
_rw
PC
mem
_val
addrwdata
rdata
Inst
JumpTargGen
BranchTargGen
ir[19:15]
ir[31:25],ir[11:7]
PC+4jalr
rs2
BranchCondGen
br_eq?br_lt?
co-p
roce
ssor
(CSR
) reg
ister
s
ir[11
:7]
jump
ir[31:12]
Execute Stage
br_ltu?PC
addr
ir[31:12]
JumpRegTargGen
Op2Sel
Op1SelAluFun
data
wa
wd
en
addr data
UType
Note: for simplicity, the CSR File (control and status registers) and associated datapath is not shown
RISC-V Sodor 1-Stage
exception
SType SignExtend
ir[31:20]
PC
rs2rs1
rs2
HardwiredControlispureCombina9onalLogic
49
combinaLonallogic
opcode
Equal?
ImmSel
Op2Sel
FuncSel
MemWrite
WBSel
WASel
RegWriteEn
PCSel
ALUControl&ImmediateExtension
50
Inst<6:0>(Opcode)
DecodeMap
Inst<14:12>(Func3)
ALUop
0?
+
FuncSel(Func,Op,+,0?)
ImmSel(IType12,SType12,UType20)
HardwiredControlTable
51
Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel
ALU ALUi LW SW BEQtrue
BEQfalse
J JAL
JALR
Op2Sel=Reg/Imm WBSel=ALU/Mem/PCWASel=rd/X1 PCSel=pc+4/br/rind/jabs
* * * no yes rindPC rdjabs* * * no yes PC X1
jabs* * * no no * *pc+4SBType12 * * no no * *
brSBType12 * * no no * *
pc+4SType12 Imm + yes no * *
pc+4* Reg Func no yes ALU rdIType12 Imm Op pc+4no yes ALU rd
pc+4IType12 Imm + no yes Mem rd
Single-CycleHardwiredControl
clockperiodissufficientlylongforallofthefollowingstepstobe“completed”:1. InstrucLonfetch2. Decodeandregisterfetch3. ALUoperaLon4. Datafetchifrequired5. Registerwrite-backsetupLme
=>tC>tIFetch+tRFetch+tALU+tDMem+tRWB
Attherisingedgeofthefollowingclock,thePC,registerfileandmemoryareupdated
52
Implementa9oninReal
• Load-StoreRISCISAsdesignedforefficientpipelinedimplementaLons– InspiredbyearlierCraymachines(CDC6600/7600)
• RISC-VISAimplementedusingChiselhardwareconstrucLonlanguage– Chisel:h}ps://chisel.eecs.berkeley.edu/– Ge~ngstarted:
• h}ps://chisel.eecs.berkeley.edu/2.2.0/ge~ng-started.html– Checkresourcepageforslidesandotherinfo
53
Chiselinoneslides
• Module• IO• Wire• Reg• Mem
54
UCBRISC-VSodor
• h}ps://github.com/ucb-bar/riscv-sodor– Single-cycle:
• h}ps://github.com/ucb-bar/riscv-sodor/tree/master/src/rv32_1stage
55