multi-cycle cpu breaking up is hard to do…. mc - not in the textbook – we’ll get into some...

Multi-cycle CPU

Breaking up is hard to do….

MC - Not in the textbook – we’ll get into somedetail in lecture and suggested hw problems

Reading 4.9 just p. 384-386 (stop at pipelined implementation)

Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

http://creativecommons.org/choose/www.peerinstruction4cs.org

http://creativecommons.org/licenses/by-nc-sa/3.0/



Single-Cycle CPU Summary

• Easy, particularly the control

• Which instruction takes the longest? By how much? Why is that a problem?

• ET = IC * CPI * CT

• What else can we do?

• When does a multi-cycle implementation make sense?– e.g., 70% of instructions take 75 ns, 30% take 200 ns?

– suppose 20% overhead for extra latches

• Real machines have much more variable instruction latencies than this.

200 vs. (200*.3+75*.7)*1.2 (60+50)*1.2 ~ 135

You’ve been walking through history

• Someone needed to run a program



• Simple instructions were designed for very simple hardware (limited transistors)




• Someone wants to run a new program, but not create all new hardware





• More instructions added

LAB!





• More instructions added

• More transistors enable more complex hardware

• More complex instructions are desired as instruction memory is limited and costly

The story continues..

Why a Multiple Clock Cycle CPU?• the problem => single-cycle cpu has a cycle time long

enough to complete the longest instruction in the machine

• the solution => break up execution into smaller tasks, each task taking a cycle, different instructions requiring different numbers of cycles or tasks

• other advantages => reuse of functional units (e.g., alu, memory)

• ET = IC * CPI * CT

Breaking Execution Into Clock Cycles

• We will have five execution steps (not all instructions use all five)– fetch

– decode & register fetch

– execute

– memory access

– write-back

Single Cycle vs. Multi-cycle

Single Cycle

Multi-cycle

lw

sw

add

r-type

CPI CT

Draw stages and how they get cutup

Cutting up Single Cycle

Draw how we’d most logically cut this upThen point out wait – if I cut the cycle time, how do I keepWhat I’ve done?

Breaking Execution Into Clock Cycles

• Introduces extra registers when:– signal is computed in one clock cycle and used in another, AND

– the inputs to the functional block that outputs this signal can change before the signal is written into a state element.

• Significantly complicates control. Why?

• The goal is to balance the amount of work done each cycle.

Multicycle datapath

Intermediate latches.One ALUOne memory (give hint about self-modifying code)

Multicycle datapath – Load word Load word, write RTL below per cycle

Summary of execution steps

Step R-type Memory Branch Instruction Fetch IR = Mem[PC]

PC = PC + 4 Instruction Decode/ register fetch

A = Reg[IR[25-21]] B = Reg[IR[20-16]]

ALUout = PC + (sign-extend(IR[15-0]) << 2) Execution, address computation, branch completion

ALUout = A op B ALUout = A + sign-

extend(IR[15-0])

if (A==B) then PC=ALUout

Memory access or R-type completion

Reg[IR[15-11]] = ALUout

memory-data = Mem[ALUout]

or Mem[ALUout]=

B

Write-back Reg[IR[20-16]] = memory-data

•We can use Register-Transfer-Language (RTL) to describe these steps

Talk through each – esp. the early branch computation

Step R-type Memory Branch Instruction Fetch IR = Mem[PC]

PC = PC + 4 Instruction Decode/ register fetch

A = Reg[IR[25-21]] B = Reg[IR[20-16]]

ALUout = PC + (sign-extend(IR[15-0]) << 2) Execution, address computation, branch completion

ALUout = A op B ALUout = A + sign-

extend(IR[15-0])

if (A==B) then PC=ALUout

Memory access or R-type completion

Reg[IR[15-11]] = ALUout

memory-data = Mem[ALUout]

or Mem[ALUout]=

B

Write-back Reg[IR[20-16]] = memory-data

Peer instructionWhy are the firstTwo the same?

Selection Why are the first two stages always the same (best answer)?

A All instructions do the same thing at the start

B The instruction is not determined until after the 2nd cycle

C To decrease the complexity of the control logic

D Trick question – they aren’t always the same

E None of the above

Complete Multicycle Datapath

(don’t be intimidated – it all makes sense…)


R-type – 1st cycle Draw active path


R-type – 2nd cycle


R-type –3rd cycle


R-type – 4th cycle

Which inst.does PCWritestuck at 1 break? A. LwB. R-typeC. BeqD. Both A & BE. A,B,&C

Multicycle Control

• Single-cycle control used combinational logic

• Multi-cycle control uses a Finite State Machine.

• FSM defines a succession of states, transitions between states (based on inputs), and outputs (based on state)

• First two states same for every instruction, next state depends on opcode

IF = 200psID = 50psEX = 100psM = 200psWB = 50ps

Breaking a single cycle processor into stages, hardware engineers determine these to be the execution time per stage. The code below is the most commonly executed code by the company.Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero

Selection Good idea? Reason

A Yes CPI stays the same. CT decreases (factor of 4)

B Yes CPI increases (factor of 4). CT decreases (factor of 5)

C No CPI increases (factor of 4). CT decreases (factor of 3)

D No CPI decreases (factor of 5 ). CT increases (factor of 5)

E No CPI stays the same. CT stays the same. Complexity increases.

Your boss is interested in changing to the MIPS multi-cycle processor. He asks you whether or not this would be a good idea. You say?

Isomorphic

IF = 200psID = 200psEX = 200psM = 200psWB = 200ps

Breaking a single cycle processor into stages, hardware engineers determine these to be the execution time per stage. The code below is the most commonly executed code by the company.Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero

Selection Good idea? Reason

A Yes CPI stays the same. CT decreases (factor of 4)

B Yes CPI increases (factor of 4). CT decreases (factor of 5)

C No CPI increases (factor of 4). CT decreases (factor of 3)

D No CPI decreases (factor of 5 ). CT increases (factor of 5)

E No CPI stays the same. CT stays the same. Complexity increases.

Your boss is interested in changing to the MIPS multi-cycle processor. He asks you whether or not this would be a good idea. You say?

Balanced cycles explanationDraw single-cycle wasted timeDraw multi-cycle potential wasted time (200,50,100,200,50)

Multi-cycle Questions

• How many cycles will it take to execute this code?

lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume not takenadd $t5, $t2, $t3sw $t5, 8($t3)

Label: ... Selection Number of Cycles

A 5

B 21

C 22

D 25

E None of the above

Multi-cycle Questions

• What is going on in cycle 8?

lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume not takenadd $t5, $t2, $t3sw $t5, 8($t3)

Label: ...Selection Number of Cycles

A PC=PC+4; IR=M[pc]

B A=R[t3]; B=R[t3]

C ALUOut=R[t3]+4

D R[t3]=M[ALUOut]

E None of the above

Suppose you work on an embedded multi-cycle MIPS processor and your software team tells you that every program which executes has to go through memory and zero 1k bytes of data fairly often (averages 10% of ET). You realize you could just have a single instruction do this called zero1k (rs) which does:M[rs] = 0 … M[rs+1020] = 0.Your coworker thinks you are crazy. You reply?

Selection Crazy? Reason

A Yes The complexity of such an instruction combined with no performance gain is silly.

B Yes The complexity of such an instruction combined with minimal performance gain (<5%) is silly.

C No The minimal performance gains (<5%) rationalize this simple instruction.

D No The significant performance gains (>5%) rationalize this complex instruction.

E Maybe None of the above.

Remember to ask about single-cycleAnswer - D

Show code, then cycle analysis.

• Implementation:

Finite State Machine for Control

• ROM = "Read Only Memory"– values of memory locations are fixed ahead of time

• A ROM can be used to implement a truth table– if the address is m-bits, we can address 2m entries in the ROM.– our outputs are the bits of data that the address points to.

2m is the "height", and n is the "width"

ROM Implementation

m n

0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1

• How many inputs are there?6 bits for opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 different addresses)

• How many outputs are there?16 datapath-control outputs, 4 state bits = 20

outputs

• ROM is 210 x 20 = 20K bits (and a rather unusual size)

• Rather wasteful, since for lots of the entries, the outputs are the same

— i.e., opcode is often ignored

ROM Implementation

Multicycle CPU Key Points

• Performance gain achieved from variable-length instructions

• ET = IC * CPI * cycle time

• Required very few new state elements

• More, and more complex, control signals

• Control requires FSM

multi-cycle cpu breaking up is hard to do…. mc - not in the textbook – we’ll get into some...

Documents

program slide

new hardware slide

program simple instructions

cycle time

ct slide

multicycle cpu

cutup slide

multicycle implementation