multi-cycle cpu breaking up is hard to do…. mc - not in the textbook – we’ll get into some...
TRANSCRIPT
Multi-cycle CPU
Breaking up is hard to do….
MC - Not in the textbook – we’ll get into somedetail in lecture and suggested hw problems
Reading 4.9 just p. 384-386 (stop at pipelined implementation)
Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Single-Cycle CPU Summary
• Easy, particularly the control
• Which instruction takes the longest? By how much? Why is that a problem?
• ET = IC * CPI * CT
• What else can we do?
• When does a multi-cycle implementation make sense?– e.g., 70% of instructions take 75 ns, 30% take 200 ns?
– suppose 20% overhead for extra latches
• Real machines have much more variable instruction latencies than this.
200 vs. (200*.3+75*.7)*1.2 (60+50)*1.2 ~ 135
You’ve been walking through history
• Someone needed to run a program
You’ve been walking through history
• Someone needed to run a program
• Simple instructions were designed for very simple hardware (limited transistors)
You’ve been walking through history
• Someone needed to run a program
• Simple instructions were designed for very simple hardware (limited transistors)
• Someone wants to run a new program, but not create all new hardware
You’ve been walking through history
• Someone needed to run a program
• Simple instructions were designed for very simple hardware (limited transistors)
• Someone wants to run a new program, but not create all new hardware
• More instructions added
LAB!
You’ve been walking through history
• Someone needed to run a program
• Simple instructions were designed for very simple hardware (limited transistors)
• Someone wants to run a new program, but not create all new hardware
• More instructions added
• More transistors enable more complex hardware
• More complex instructions are desired as instruction memory is limited and costly
The story continues..
Why a Multiple Clock Cycle CPU?• the problem => single-cycle cpu has a cycle time long
enough to complete the longest instruction in the machine
• the solution => break up execution into smaller tasks, each task taking a cycle, different instructions requiring different numbers of cycles or tasks
• other advantages => reuse of functional units (e.g., alu, memory)
• ET = IC * CPI * CT
Breaking Execution Into Clock Cycles
• We will have five execution steps (not all instructions use all five)– fetch
– decode & register fetch
– execute
– memory access
– write-back
Single Cycle vs. Multi-cycle
Single Cycle
Multi-cycle
lw
sw
add
r-type
CPI CT
Draw stages and how they get cutup
Cutting up Single Cycle
Draw how we’d most logically cut this upThen point out wait – if I cut the cycle time, how do I keepWhat I’ve done?
Breaking Execution Into Clock Cycles
• Introduces extra registers when:– signal is computed in one clock cycle and used in another, AND
– the inputs to the functional block that outputs this signal can change before the signal is written into a state element.
• Significantly complicates control. Why?
• The goal is to balance the amount of work done each cycle.
Multicycle datapath
Intermediate latches.One ALUOne memory (give hint about self-modifying code)
Multicycle datapath – Load word Load word, write RTL below per cycle
Summary of execution steps
Step R-type Memory Branch Instruction Fetch IR = Mem[PC]
PC = PC + 4 Instruction Decode/ register fetch
A = Reg[IR[25-21]] B = Reg[IR[20-16]]
ALUout = PC + (sign-extend(IR[15-0]) << 2) Execution, address computation, branch completion
ALUout = A op B ALUout = A + sign-
extend(IR[15-0])
if (A==B) then PC=ALUout
Memory access or R-type completion
Reg[IR[15-11]] = ALUout
memory-data = Mem[ALUout]
or Mem[ALUout]=
B
Write-back Reg[IR[20-16]] = memory-data
•We can use Register-Transfer-Language (RTL) to describe these steps
Talk through each – esp. the early branch computation
Step R-type Memory Branch Instruction Fetch IR = Mem[PC]
PC = PC + 4 Instruction Decode/ register fetch
A = Reg[IR[25-21]] B = Reg[IR[20-16]]
ALUout = PC + (sign-extend(IR[15-0]) << 2) Execution, address computation, branch completion
ALUout = A op B ALUout = A + sign-
extend(IR[15-0])
if (A==B) then PC=ALUout
Memory access or R-type completion
Reg[IR[15-11]] = ALUout
memory-data = Mem[ALUout]
or Mem[ALUout]=
B
Write-back Reg[IR[20-16]] = memory-data
Peer instructionWhy are the firstTwo the same?
Selection Why are the first two stages always the same (best answer)?
A All instructions do the same thing at the start
B The instruction is not determined until after the 2nd cycle
C To decrease the complexity of the control logic
D Trick question – they aren’t always the same
E None of the above
Complete Multicycle Datapath
(don’t be intimidated – it all makes sense…)
Complete Multicycle Datapath
R-type – 1st cycle Draw active path
Complete Multicycle Datapath
R-type – 2nd cycle
Complete Multicycle Datapath
R-type –3rd cycle
Complete Multicycle Datapath
R-type – 4th cycle
Which inst.does PCWritestuck at 1 break? A. LwB. R-typeC. BeqD. Both A & BE. A,B,&C
Multicycle Control
• Single-cycle control used combinational logic
• Multi-cycle control uses a Finite State Machine.
• FSM defines a succession of states, transitions between states (based on inputs), and outputs (based on state)
• First two states same for every instruction, next state depends on opcode
IF = 200psID = 50psEX = 100psM = 200psWB = 50ps
Breaking a single cycle processor into stages, hardware engineers determine these to be the execution time per stage. The code below is the most commonly executed code by the company.Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero
Selection Good idea? Reason
A Yes CPI stays the same. CT decreases (factor of 4)
B Yes CPI increases (factor of 4). CT decreases (factor of 5)
C No CPI increases (factor of 4). CT decreases (factor of 3)
D No CPI decreases (factor of 5 ). CT increases (factor of 5)
E No CPI stays the same. CT stays the same. Complexity increases.
Your boss is interested in changing to the MIPS multi-cycle processor. He asks you whether or not this would be a good idea. You say?
Isomorphic
IF = 200psID = 200psEX = 200psM = 200psWB = 200ps
Breaking a single cycle processor into stages, hardware engineers determine these to be the execution time per stage. The code below is the most commonly executed code by the company.Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero
Selection Good idea? Reason
A Yes CPI stays the same. CT decreases (factor of 4)
B Yes CPI increases (factor of 4). CT decreases (factor of 5)
C No CPI increases (factor of 4). CT decreases (factor of 3)
D No CPI decreases (factor of 5 ). CT increases (factor of 5)
E No CPI stays the same. CT stays the same. Complexity increases.
Your boss is interested in changing to the MIPS multi-cycle processor. He asks you whether or not this would be a good idea. You say?
Balanced cycles explanationDraw single-cycle wasted timeDraw multi-cycle potential wasted time (200,50,100,200,50)
Multi-cycle Questions
• How many cycles will it take to execute this code?
lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume not takenadd $t5, $t2, $t3sw $t5, 8($t3)
Label: ... Selection Number of Cycles
A 5
B 21
C 22
D 25
E None of the above
Multi-cycle Questions
• What is going on in cycle 8?
lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume not takenadd $t5, $t2, $t3sw $t5, 8($t3)
Label: ...Selection Number of Cycles
A PC=PC+4; IR=M[pc]
B A=R[t3]; B=R[t3]
C ALUOut=R[t3]+4
D R[t3]=M[ALUOut]
E None of the above
Suppose you work on an embedded multi-cycle MIPS processor and your software team tells you that every program which executes has to go through memory and zero 1k bytes of data fairly often (averages 10% of ET). You realize you could just have a single instruction do this called zero1k (rs) which does:M[rs] = 0 … M[rs+1020] = 0.Your coworker thinks you are crazy. You reply?
Selection Crazy? Reason
A Yes The complexity of such an instruction combined with no performance gain is silly.
B Yes The complexity of such an instruction combined with minimal performance gain (<5%) is silly.
C No The minimal performance gains (<5%) rationalize this simple instruction.
D No The significant performance gains (>5%) rationalize this complex instruction.
E Maybe None of the above.
Remember to ask about single-cycleAnswer - D
Show code, then cycle analysis.
• Implementation:
Finite State Machine for Control
• ROM = "Read Only Memory"– values of memory locations are fixed ahead of time
• A ROM can be used to implement a truth table– if the address is m-bits, we can address 2m entries in the ROM.– our outputs are the bits of data that the address points to.
2m is the "height", and n is the "width"
ROM Implementation
m n
0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1
• How many inputs are there?6 bits for opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 different addresses)
• How many outputs are there?16 datapath-control outputs, 4 state bits = 20
outputs
• ROM is 210 x 20 = 20K bits (and a rather unusual size)
• Rather wasteful, since for lots of the entries, the outputs are the same
— i.e., opcode is often ignored
ROM Implementation
Multicycle CPU Key Points
• Performance gain achieved from variable-length instructions
• ET = IC * CPI * cycle time
• Required very few new state elements
• More, and more complex, control signals
• Control requires FSM