1 ee457 discussion fall 2006 final review brandon franzke, maryam soltan, usc2006 and wei-jen hsu,...

1

EE457 DiscussionFall 2006

Final Review

Brandon Franzke, Maryam Soltan, USC2006

and Wei-Jen Hsu, USC 2005

2

Review Questions

• Question 2, Fall 2004 (Multi-cycle CPU)

• Question 3, Summer 2004 (Pipeline CPU)

• Question 1, Fall 2004 (Based on lab 7 pipeline)

• Carry Look Ahead Adder

• Question 5, Summer 2004 (CLA)

3

Example - Multicycle CPU• Modifications to the 2nd Edition CU (state

diagram) and DPU.– Mr Trojan already modified DPU

• Notice: Standalone registers (MDR, ALUout) are fast even though RegFile is not.– Standalone = instantaneous– Register File = ½ clock

• So we want to skip states 4 (lw) & 7 (r-type)– implement “posted-write” (next page)

4

Posted-Write• We needed states 4 & 7 because Register

Writing takes ½ clock– But we already have the data stored in MDR and ALUout for these states.

– Can we delay writing until the beginning of the next instruction? (state 0)

– What about control signals?

• This is a “Posted-Write”– a write operation “posted” (scheduled) to occur later

5

Posted-Write Implementation• Well, we just save the control signals for 1 extra

clock with Flip-Flops!– RegDst, RegWrite, MemToReg

• Now the signals are available for 1 clock extra

6

Questions• DPU modifications are complete, modify

the CU to implement register posted-write.– DPU and CU next pages

• What justification did Mr. T tell his boss for using Positive Edge-triggered flip-flops?– The design team says that positive-edged

FF’s cost extra. Can Mr. Trojan use negative-edged FF instead?

9

When to load FF• Ms. Bruin suggested a RegWrite_FF_Write

as shown below. Comment on the design and its necessity.

10

Posted Write for sw• Ms. Bruin was given another chance by the lead

engineer. She tried to copy Mr. Trojan and suggested saving a clock in the sw instruction by skipping state 5 and adding the following 2 FF. Advice?

11

Example – Pipeline CPU A new 4-stage Pipeline

• MEM before EX• No spurious stalls• New R-Type instr. (ex.:addm,…)

– Use memory operand as a source operand

• Writing to RegFile takes very little time => No separate WB stage

• Memory : One read port• Beq in Ex stage• EAC not possible

=> Revised lw and sw

12

New 4-stage Pipeline ….

• addm

• Investigate data dependencies and implement HDU and FU

• Avoid any spurious stalls. (really dependent)

• No internal forwarding in memory– Cannot write and read to/from memory

simultaneously.

14

New 4-stage Pipeline …. (sw, lw)

• BEQ is executing after _____stage in ____ stage.

• Where should we execute sw?

• Where should we execute lw?

beq rs,rt, Target;

sw rt, (rs); MEM[(rs)]<= (rt)

15

New 4-stage Pipeline …. (Hazard and stalling)

Regular pipeline 4-stage pipeline

Dependencies/RAW hazards for register operand

Dependencies/RAW hazards for register operand

Instruction to activate MemRead

Stalling instruction in ______ stage

Condition of stalling

16

New 4-stage Pipeline …. (Hazard and stalling…)

17

sw $1, ($2);lw $4, ($2);addm $8, ($2), $4;subm $16, ($8), $4;

21

Lab7, modified

• Now implement SUB3 and SUB6 instructions (SUB3 in EX1 and EX2).– still have NOP

• Optimize performance by performing SUB3 in EX1 or EX2 (i.e. minimize stalling)

• The new stalling policy:– Never stall SUB3 and stall SUB6 iff it is

dependent on the preceding instruction.

23

Logic Blocks

• Postponing logic– assertions to perform SUB3 in EX1 or EX2– prefer EX1 so data is available to forward.

• HDU– Stall only dependant SUB6 instructions

• FU1 and FU2– forwarding from EX2→EX1 and WB→EX2

24

Stall vs. Flushing

• When do you flush and when do you stall?– How many instructions do you flush at a time?– How many instructions in the pipe do you stall?– Do flushing & stalling have anything in common?– Which of them result in producing bubbles?– Is the penalty due to flushing / stalling more severe in

deeper pipelines? (say 7-10 stages)– How do delay slots affect the penalty?

25

1-bit CLA adder

(+)

A B

CinS

p g

• p: propagator => p = A+B (If either A or B is 1, Cin = 1 causes Cout = 1)• g: generator => g = AB (If both A and B are 1, Cout = 1 for sure)• p, g are generated in 1 gate delay after we have A, B. • Note that Cin is not needed to produce p and g.• S is generated in 2 gate delay after we get Cin (SOP).

26

4-bit CLA

(+)

A0 B0

C0(+)

A1 B1

(+)

A2 B2

(+)

A3 B3

CLL (carry look-ahead logic)

p0 g0p1 g1p2 g2p3 g3

C1C2C3

S3 S2 S1 S0

• The CLL takes p,g from all 4 bits and C0 as input to generate all Cs in 2 gate delay. • C1=g0+p0C0,• C2=g1+p1g0+p1p0C0,• C3=g2+p2g1+p2p1g0+p2p1p0c0,• C4=g3+p3g2+p3p2g1+p3p2p1g0+p3p2p1p0c0 (Note: C4 is too complicated, however it is a 2-level SOP representation)

27

4-bit CLA

(+)

A0 B0

C0(+)

A1 B1

(+)

A2 B2

(+)

A3 B3

CLL (carry look-ahead logic)

p0 g0p1 g1p2 g2p3 g3

• Given A,B’s, all p,g’s are generated in 1 gate delay in parallel.

C1C2C3

• Given all p,g’s, all C’s are generated in 2 gate delay in parallel.

S3 S2 S1 S0

• Given all C’s, all S’s are generated in 2 gate delay in parallel.

• Key virtue of CLA: sequential operation in RCA is broken into parallel operation!!

28

16-bit CLA

• Same as before, p,g’s are generated in parallel in 1 gate delay

• The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s” for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from P,G’s is exactly the same to generating C’s from p,g’s!)

• With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in 2 gate delay• With all C’s in place, S’s are calculated in 2 gate delay

Therefore, totally1+2+2+2+2=9 gate delayto finish the whole thing!!

• Now, without input carry, the first-tier CLL cannot generate C’s…… Instead they generate P,G’s (group propagator and group generator) in 2 gate delay P => This group will propagate the input carry to the group P=p0p1p2p3 G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0

29

Example - 64bit-CLA

• S39 takes longer to become valid.• List of primary and intermediate signals in producing S39: (Back

tracking: S39 = A39 B39 C39 , S39<-C39<-C36…)– Do we need P39_36* and G39_36*?– Primary inputs:– Gate delay to generate p38_0, g38_0 :– Gate delay for second level P*, G*:

– Gate delay for second level P**, G**:

– Gate delay C32:

– p38 ,p37 ,p36 , and g38 ,g37 ,g36

– C32 C36 C39

Delay:

30

Other Topics

• Usually there is a question on non-linear pipeline.

• Please make sure that you are comfortable with cache and virtual memory organization.

1 ee457 discussion fall 2006 final review brandon franzke, maryam soltan, usc2006 and wei-jen hsu,...

Documents

sw slide

cla slide

later slide

page slide

clock extra slide

write operation

pipeline cpu question

stage pipeline mem