designing a multicycle processor adapted from dave patterson (http.cs.berkeley.edu/~patterson)...
TRANSCRIPT
Designing a Multicycle Processor
Adapted from
Dave Patterson (http.cs.berkeley.edu/~patterson)
lecture slides
Dave Patterson serving as ACM president
Recap: Processor Design is a Process
° Bottom-up• assemble components in target technology to establish critical
timing
° Top-down• specify component behavior from high-level requirements
° Iterative refinement• establish partial solution, expand and improve
datapath control
processorInstruction SetArchitecture
=>
Reg. File Mux ALU Reg Mem Decoder Sequencer
Cells Gates
Recap: A Single Cycle Datapath
° We have everything except control signals (underlined)• How to generate the control signals?
32
ALUctr
Clk
busW
RegWr
32
32
busA
32
busB
55 5
Rw Ra Rb
32 32-bitRegisters
Rs
Rt
Rt
RdRegDst
Exten
der
Mu
x
Mux
3216imm16
ALUSrc
ExtOp
Mu
x
MemtoReg
Clk
Data InWrEn
32
Adr
DataMemory
32
MemWrA
LU
InstructionFetch Unit
Clk
Zero
Instruction<31:0>
0
1
0
1
01<
21:25>
<16:20>
<11:15>
<0:15>
Imm16RdRsRt
nPC_sel
The “Truth Table” for the Main Control
R-type ori lw sw beq jump
RegDst
ALUSrc
MemtoReg
RegWrite
MemWrite
Branch
Jump
ExtOp
ALUop (Symbolic)
1
0
0
1
0
0
0
x
“R-type”
0
1
0
1
0
0
0
0
Or
0
1
1
1
0
0
0
1
Add
x
1
x
0
1
0
0
1
Add
x
0
x
0
0
1
0
x
Subtract
x
x
x
0
0
0
1
x
xxx
op 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010
ALUop <2> 1 0 0 0 0 x
ALUop <1> 0 1 0 0 0 x
ALUop <0> 0 0 0 0 1 x
MainControl
op
6
ALUControl(Local)
func
3
6
ALUop
ALUctr
3
RegDst
ALUSrc
:
Conditions
PLA Implementation of the Main Control
op<0>
op<5>. .op<5>. .<0>
op<5>. .<0>
op<5>. .<0>
op<5>. .<0>
op<5>. .<0>
R-type ori lw sw beq jumpRegWrite
ALUSrc
MemtoReg
MemWrite
Branch
Jump
RegDst
ExtOp
ALUop<2>
ALUop<1>
ALUop<0>
Systematic Generation of Control
° In our single-cycle processor, each instruction is realized by exactly one control command or “microinstruction”
• in general, the controller is a finite state machine
• microinstruction can also control sequencing
Control Logic / Store(PLA, ROM)
OPcode
Datapath
Inst
ruct
ion
Decode
Con
ditio
nsControlPoints
microinstruction
Behavioral models of Datapath Components
entity adder16 isgeneric (ccOut_delay : TIME := 12 ns; adderOut_delay: TIME := 12 ns);port(A, B: in vlbit_1d(15 downto 0); DOUT: out vlbit_1d(15 downto 0); CIN: in vlbit; COUT: out vlbit);end adder16;
architecture behavior of adder16 isbegin
adder16_process: process(A, B, CIN)
variable tmp : vlbit_1d(18 downto 0);variable adder_out : vlbit_1d(15 downto 0);variable carry: vlbit;
begintmp := addum (addum (A, B), CIN);
adder_out := tmp(15 downto 0);carry :=tmp(16);
COUT <= carry after ccOut_delay; DOUT <= adder_out after adderOut_delay; end process;end behavior;
16
1616
A B
DOUT
CinCout
addum adds two n-bit vectors to produce an n+1 bit vector
Note on VHDL variables
° Can only be used in processes
° Are declared before the begin keyword of a process statement
° Assignment operator is :=
° Can be assigned an initial value by adding the := and some constant expression after the type part of the declaration
° Do not have or cause events and are modified differently than signals
° In the following code, if x is a bit signal, then the process will count the number of rising and falling edges that occur on the signal x
count: process (x) variable changes : integer := -1; begin changes:=changes+1; end process;
Behavioral Specification of Control Logic
° Decode / Control-store address modeled by Case statement
° Each arm drives control signals for that operation• just like the microinstruction
• either can be symbolic
entity maincontrol isport(opcode: in vlbit_1d(5 downto 0); equal_cond: in vlbit;
extop out vlbit;ALUsrc out vlbit;ALUop out vlbit_1d(2 downto 0);MEMwr out vlbit;MemtoReg out vlbit;RegWr out vlbit;RegDst out vlbit;nPC out vlbit;
end maincontrol;
Abstract View of our single cycle processor
° looks like a FSM with PC as state
PC
Nex
t P
C
Reg
iste
rF
etch ALU Reg
. W
rt
Mem
Acc
ess
Dat
aM
emInst
ruct
ion
Fet
ch
Res
ult
Sto
re
AL
Uct
r
Reg
Dst
AL
US
rc
Ext
Op
Mem
WrEq
ual
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
MainControl
ALUcontrol
op
fun
Ext
What’s wrong with our CPI=1 processor?
° Long Cycle Time
° All instructions take as much time as the slowest
° Real memory is not so nice as our idealized memory• cannot always get the job done in one (short) cycle
PC Inst Memory mux ALU Data Mem mux
PC Reg FileInst Memory mux ALU mux
PC Inst Memory mux ALU Data Mem
PC Inst Memory cmp mux
Reg File
Reg File
Reg File
Arithmetic & Logical
Load
Store
Branch
Critical Path
setup
setup
It seems the timing of Branch has been deliberately shortened with the addition of a highly efficient comparator.
Memory Access Time
° Physics => fast memories are small (large memories are slow)
• question: register file vs. memory
° => Use a hierarchy of memories
Storage Array
selected word line
addressstorage cell
bit line
sense ampsaddressdecoder
CacheProcessor
1 cycle
proc
. bu
s
L2Cache
mem
. bu
s
2-3 cycles 20 - 50 cycles
memory
Reducing Cycle Time
° Cut combinational dependency graph and insert register / latch
° Do same work in two fast cycles, rather than one slow one
storage element
Acyclic CombinationalLogic
storage element
storage element
Acyclic CombinationalLogic (A)
storage element
storage element
Acyclic CombinationalLogic (B)
=>
Basic Limits on Cycle Time° Next address logic
• PC = (branch == 1) ? PC + 4 + offset : PC + 4
° Instruction Fetch• InstructionReg <= Mem[PC]
° Register Access• A <= R[rs]
° ALU operation• S <= A + B
PC
Nex
t P
C
Ope
rand
Fet
ch Exec Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
Inst
ruct
ion
Fet
ch
Res
ult
Sto
re
AL
Uct
r
Reg
Dst
AL
US
rc
Ext
Op
Mem
Wr
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
Control
Partitioning the CPI=1 Datapath° Add registers between smallest steps
PC
Nex
t P
C
Ope
rand
Fet
ch Exec Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
Inst
ruct
ion
Fet
ch
Res
ult
Sto
re
AL
Uct
r
Reg
Dst
AL
US
rc
Ext
Op
Mem
Wr
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
Example Multicycle Datapath
PC
Nex
t P
C
Ope
rand
Fet
ch
Ext
ALU Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
Inst
ruct
ion
Fet
ch
Res
ult
Sto
re
AL
Uct
r
Reg
Dst
AL
US
rc
Ext
Op
nPC
_sel
Reg
Wr
Mem
Wr
Mem
Rd
IRA
B
S
M
RegFile
Mem
ToR
eg
Equ
al
Recall: Step-by-step Processor Design
Step 1: ISA => Logical Register Transfers
Step 2: Components of the Datapath
Step 3: RTL + Components => Datapath
Step 4: Datapath + Logical RTs => Physical RTs
Step 5: Physical RTs => Control
Step 4: R-type (add, sub, . . .)° Logical Register Transfer
° Physical Register Transfers
inst Logical Register Transfers
ADDU R[rd] <– R[rs] + R[rt]; PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[PC]
ADDU A<– R[rs]; B <– R[rt]
S <– A + B
R[rd] <– S; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Step 4:Logical immed° Logical Register Transfer
° Physical Register Transfers
inst Logical Register Transfers
ORi R[rt] <– R[rs] OR zx(Im16); PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[pc]
ORi A<– R[rs]; B <– R[rt]
S <– A OR ZeroExt(Im16)
R[rt] <– S; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Step 4 : Load° Logical Register Transfer
° Physical Register Transfers
inst Logical Register Transfers
LW R[rt] <– MEM(R[rs] + sx(Im16));
PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[PC]
LW A<– R[rs]; B <– R[rt]
S <– A + SignEx(Im16)
M <– MEM[S]
R[rt] <– M; PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Step 4 : Store° Logical Register Transfer
° Physical Register Transfers
inst Logical Register Transfers
SW MEM(R[rs] + sx(Im16) <– R[rt];
PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[PC]
SW A<– R[rs]; B <– R[rt]
S <– A + SignEx(Im16);
MEM[S] <– B PC <– PC + 4
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Step 4 : Branch° Logical Register Transfer
° Physical Register Transfers
inst Logical Register Transfers
BEQ if R[rs] == R[rt]
then PC <= PC + 4 + sx(Im16) << 2
else PC <= PC + 4
inst Physical Register Transfers
IR <– MEM[PC]
A<– R[rs]; B <– R[rt]
BEQ|Eq PC <– PC + 4
inst Physical Register Transfers
IR <– MEM[PC]
A<– R[rs]; B <– R[rt]
BEQ|Eq PC <– PC + 4 + sx(Im16)<<2
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Alternative datapath: Multiple Cycle Datapath
° Minimizes Hardware: 1 memory (Princeton organization), 1 adder (ALU)
IdealMemoryWrAdrDin
RAdr
32
32
32Dout
MemWr
32
AL
U
3232
ALUOp
ALUControl
Instru
ction R
eg
32
IRWr
32
Reg File
Ra
Rw
busW
Rb5
5
32busA
32busB
RegWr
Rs
Rt
Mu
x
0
1
Rt
Rd
PCWr
ALUSelA
Mux 01
RegDst
Mu
x
0
1
32
PC
MemtoReg
Extend
ExtOp
Mu
x
0
132
0
1
23
4
16Imm 32
<< 2
ALUSelB
Mu
x1
0
Target32
Zero
ZeroPCWrCond PCSel BrWr
32
IorD
AL
U O
ut
2
Note: On p. 323 of COD ed. 3, RAdr & WrAdr are combined into Address.
Seminal works of Mealy and Moore
° G.H. Mealy, A method for synthesizing sequential circuits, Bell Technical Journal, vol. 34, pp. 1045~1079, 1955.
° E.F. Moore, Gedanken-experiments on sequential machines, Automata Studies (C.E. Shannon & J. McCarthy, eds), pp.29~153, Princeton University Press, 1956.
Note: Gedanken = thought, mind
Aphorisms:
° "Anything on earth you want to do is play. Anything on earth you have to do is work. Play will never kill you, work will. I never worked a day in my life." - Dr. Leila Denmark, 105 (in 2004), USA's oldest practicing physician.
° “The purpose of computing is insight, not numbers” and “Anything a faculty member can learn, a student can easily.” - Richard Hamming, inventor of Hamming code.
Control Model° State specifies control points for Register Transfer
° Transfer occurs upon exiting state (same falling edge)
Control State
Next StateLogic
Output Logic
inputs (conditions)
outputs (control points)
State X
Register TransferControl Points
Depends on Input
Step 4 => Control Specification for multicycle proc
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + 4 +SX << 2
SW
“instruction fetch”
“decode / operand fetch”
Exe
cute
Mem
ory
read
/writ
e
Writ
e-ba
ck
Traditional FSM Controller
state
6
4
11nextstate
op
Equal
control signals
(current) state
op condnextstate control signals
Truth Table
datapath state
Step 5: datapath + state diagram => control
° Translate RTs (register transfers) into control points
° Assign states
° Then go build the controller
Mapping RTs to Control Signals
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + 4 +SX << 2
SW
“instruction fetch”
“decode / operand fetch”
Exe
cute
Mem
ory
read
/writ
e
Writ
e-ba
ck
imem_rd, IRen
Aen, Ben
ALUfun, Sen
RegDst, RegWr,PCen
Assigning States (State Assignment)
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + 4 +SX << 2
SW
“instruction fetch”
“decode / operand fetch”
Exe
cute
Mem
ory
read
/writ
e
Writ
e-ba
ck
0000
0001
0100
0101
0110 1000 1011 00110010
0111 1010
11001001
Detailed Control Specification
0000 ?????? ? 0001 10001 BEQ 0 0011 1 10001 BEQ 1 0010 1 10001 R-type x 0100 1 10001 orI x 0110 1 10001 LW x 1000 1 10001 SW x 1011 1 10010 xxxxxx x 0000 1 10011 xxxxxx x 0000 1 00100 xxxxxx x 0101 0 1 fun 10101 xxxxxx x 0000 1 0 0 1 10110 xxxxxx x 0111 0 0 or 10111 xxxxxx x 0000 1 0 0 1 01000 xxxxxx x 1001 1 0 add 11001 xxxxxx x 1010 1 0 11010 xxxxxx x 0000 1 0 1 1 01011 xxxxxx x 1100 1 0 add 11100 xxxxxx x 0000 1 0 0 1
State Op field Eq Next IR PC Ops Exec Mem Write-Backen en sel AenBenEx Sr ALU SenR W MenM-R Wr Dst
R:
ORi:
LW:
SW:
-all same in Moore machine
Performance Evaluation
° What is the average CPI?• state diagram gives CPI for each instruction type
• workload gives frequency of each type
Type CPIi for type Frequency CPIi x freqi
Arith/Logic 4 40% 1.6
Load 5 30% 1.5
Store 4 10% 0.4
branch 3 20% 0.6
Average CPI:4.1
Controller Design° The state diagrams that define the controller for an instruction s
et processor are highly structured
° Use this structure to construct a simple “microsequencer”
° Control reduces to programming this very simple device• microprogramming
sequencercontrol
datapath control
micro-PCsequencer
microinstruction
Example: Jump-Counter
op-codeMapping ROM
Counterzeroincload
0000i
i+1
i
Using a Jump Counter
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= BPC <= PC + 4
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + SX || 00
SW
“instruction fetch”
“decode”
Exe
cute
Mem
ory
Writ
e-ba
ck
0000
0001
0100
0101
0110
0111
1000
1001
1010
0011 00101011
1100
inc
load inc
zero zero zero zero
zero zeroinc inc inc inc
inc
Our Microsequencer
op-code
Mapping ROM
Micro-PC
Z I Ldatapath control
taken
注意:此圖之控制記憶體已較 Traditional FSM Controller
投影片所示者縮小甚多,請注意
比較此二圖之差異。
5
Microprogram Control Specification
0000 ? inc 10001 1 inc0001 0 load0010 1 zero 1 10011 0 zero 1 00100 x inc 0 1 fun 10101 x zero 1 0 0 1 10110 x inc 0 0 or 10111 x zero 1 0 0 1 01000 x inc 1 0 add 11001 x inc 1 0 11010 x zero 1 0 1 1 01011 x inc 1 0 add 11100 x zero 1 0 0 1
µPC Taken Next IR PC Ops Exec Mem Write-Backen sel A B Ex Sr ALU S R W M M-R Wr Dst
R:
ORi:
LW:
SW:
BEQ
註:表中各暫存器 (IR 、 A、 B、 S、 M) 之載入致能訊號皆省略en。
Mapping ROM– mapping opcode to PC
R-type 000000 0100
BEQ 000100 0011
Ori 001101 0110
LW 100011 1000
SW 101011 1011
opcode PC
(input) (output)
註: Mapping ROM 輸出須配合 load 訊號存入 PC中
Example: Controlling Memory
PC
InstructionMemory
Inst. Reg
addr
Data (instruction)
IR_en
InstMem_rd
IM_wait
Controller handles non-ideal memory
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= B
BEQ & EqualBEQ & ~Equal
PC <= PC + 4 PC <= PC + SX || 00
SW
“instruction fetch”
“decode / operand fetch”
Exe
cute
Mem
ory
Writ
e-ba
ck
~wait wait
~wait wait
PC <= PC + 4
~wait wait
Really Simple Time-State Control
inst
ruct
ion
fet
chde
code
Exe
cute
Mem
ory
IR <= MEM[PC]
R-type
A <= R[rs]B <= R[rt]
S <= A fun B
R[rd] <= SPC <= PC + 4
S <= A or ZX
R[rt] <= SPC <= PC + 4
ORi
S <= A + SX
R[rt] <= MPC <= PC + 4
M <= MEM[S]
LW
S <= A + SX
MEM[S] <= B
BEQ & EqualBEQ & ~Equal
PC <= PC + 4PC <= PC + SX || 00
SW
~wait wait
wait
PC <= PC + 4
wait
writ
e-ba
ck
Time-state Control Path° Local decode and control at each stage
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
S
M
Reg
File
Equ
al
PC
Nex
t P
C
IR
Inst
. M
em
Valid
IRex
Dcd
Ctr
l
IRm
em
Ex
Ctr
l
IRw
b
Mem
Ctr
l
WB
Ctr
l
Valid 位元指示 IR、 IRex 、 IRmem 、 IRwb 中之控制位元
是否已有效;此種方式常用於 pipeline 控制單元之設計。
Overview of Control
° Control may be designed using one of several initial representations. The choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique.
Initial Representation Finite State Diagram Microprogram
Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs
(即Mapping ROMs)
Logic Representation Logic Equations Truth Tables
Implementation PLA ROM Technique
“hardwired control” “microprogrammed control”
Summary
° Disadvantages of the Single Cycle Processor
• Long cycle time
• Cycle time is too long for all instructions except the Load
° Multiple Cycle Processor:
• Divide the instructions into smaller steps
• Execute each step (instead of the entire instruction) in one cycle
° Partition datapath into equal-size chunks to minimize cycle time• ~10 levels of logic between latches
° Follow same 5-step method for designing “real” processors
Summary (cont’d)
° Control is specified by finite state diagram
° Specialize state-diagrams easily captured by microsequencer• simple increment & “branch” fields
• datapath control fields
° Control design reduces to Microprogramming
° Control is more complicated with:• complex instruction sets
• restricted datapaths (see the book)
° Simple Instruction set and powerful datapath => simple control• could try to reduce hardware (see the book)
• rather go for speed => many instructions at once (Pipeline) !