1 ee457 discussion fall 2006 final review brandon franzke, maryam soltan, usc2006 and wei-jen hsu,...
Post on 21-Dec-2015
217 views
TRANSCRIPT
1
EE457 DiscussionFall 2006
Final Review
Brandon Franzke, Maryam Soltan, USC2006
and Wei-Jen Hsu, USC 2005
2
Review Questions
• Question 2, Fall 2004 (Multi-cycle CPU)
• Question 3, Summer 2004 (Pipeline CPU)
• Question 1, Fall 2004 (Based on lab 7 pipeline)
• Carry Look Ahead Adder
• Question 5, Summer 2004 (CLA)
3
Example - Multicycle CPU• Modifications to the 2nd Edition CU (state
diagram) and DPU.– Mr Trojan already modified DPU
• Notice: Standalone registers (MDR, ALUout) are fast even though RegFile is not.– Standalone = instantaneous– Register File = ½ clock
• So we want to skip states 4 (lw) & 7 (r-type)– implement “posted-write” (next page)
4
Posted-Write• We needed states 4 & 7 because Register
Writing takes ½ clock– But we already have the data stored in MDR and ALUout for these states.
– Can we delay writing until the beginning of the next instruction? (state 0)
– What about control signals?
• This is a “Posted-Write”– a write operation “posted” (scheduled) to occur later
5
Posted-Write Implementation• Well, we just save the control signals for 1 extra
clock with Flip-Flops!– RegDst, RegWrite, MemToReg
• Now the signals are available for 1 clock extra
6
Questions• DPU modifications are complete, modify
the CU to implement register posted-write.– DPU and CU next pages
• What justification did Mr. T tell his boss for using Positive Edge-triggered flip-flops?– The design team says that positive-edged
FF’s cost extra. Can Mr. Trojan use negative-edged FF instead?
7
8
9
When to load FF• Ms. Bruin suggested a RegWrite_FF_Write
as shown below. Comment on the design and its necessity.
10
Posted Write for sw• Ms. Bruin was given another chance by the lead
engineer. She tried to copy Mr. Trojan and suggested saving a clock in the sw instruction by skipping state 5 and adding the following 2 FF. Advice?
11
Example – Pipeline CPU A new 4-stage Pipeline
• MEM before EX• No spurious stalls• New R-Type instr. (ex.:addm,…)
– Use memory operand as a source operand
• Writing to RegFile takes very little time => No separate WB stage
• Memory : One read port• Beq in Ex stage• EAC not possible
=> Revised lw and sw
12
New 4-stage Pipeline ….
• addm
• Investigate data dependencies and implement HDU and FU
• Avoid any spurious stalls. (really dependent)
• No internal forwarding in memory– Cannot write and read to/from memory
simultaneously.
13
14
New 4-stage Pipeline …. (sw, lw)
• BEQ is executing after _____stage in ____ stage.
• Where should we execute sw?
• Where should we execute lw?
beq rs,rt, Target;
sw rt, (rs); MEM[(rs)]<= (rt)
15
New 4-stage Pipeline …. (Hazard and stalling)
Regular pipeline 4-stage pipeline
Dependencies/RAW hazards for register operand
Dependencies/RAW hazards for register operand
Instruction to activate MemRead
Stalling instruction in ______ stage
Condition of stalling
16
New 4-stage Pipeline …. (Hazard and stalling…)
17
sw $1, ($2);lw $4, ($2);addm $8, ($2), $4;subm $16, ($8), $4;
18
19
20
21
Lab7, modified
• Now implement SUB3 and SUB6 instructions (SUB3 in EX1 and EX2).– still have NOP
• Optimize performance by performing SUB3 in EX1 or EX2 (i.e. minimize stalling)
• The new stalling policy:– Never stall SUB3 and stall SUB6 iff it is
dependent on the preceding instruction.
22
23
Logic Blocks
• Postponing logic– assertions to perform SUB3 in EX1 or EX2– prefer EX1 so data is available to forward.
• HDU– Stall only dependant SUB6 instructions
• FU1 and FU2– forwarding from EX2→EX1 and WB→EX2
24
Stall vs. Flushing
• When do you flush and when do you stall?– How many instructions do you flush at a time?– How many instructions in the pipe do you stall?– Do flushing & stalling have anything in common?– Which of them result in producing bubbles?– Is the penalty due to flushing / stalling more severe in
deeper pipelines? (say 7-10 stages)– How do delay slots affect the penalty?
25
1-bit CLA adder
(+)
A B
CinS
p g
• p: propagator => p = A+B (If either A or B is 1, Cin = 1 causes Cout = 1)• g: generator => g = AB (If both A and B are 1, Cout = 1 for sure)• p, g are generated in 1 gate delay after we have A, B. • Note that Cin is not needed to produce p and g.• S is generated in 2 gate delay after we get Cin (SOP).
26
4-bit CLA
(+)
A0 B0
C0(+)
A1 B1
(+)
A2 B2
(+)
A3 B3
CLL (carry look-ahead logic)
p0 g0p1 g1p2 g2p3 g3
C1C2C3
S3 S2 S1 S0
• The CLL takes p,g from all 4 bits and C0 as input to generate all Cs in 2 gate delay. • C1=g0+p0C0,• C2=g1+p1g0+p1p0C0,• C3=g2+p2g1+p2p1g0+p2p1p0c0,• C4=g3+p3g2+p3p2g1+p3p2p1g0+p3p2p1p0c0 (Note: C4 is too complicated, however it is a 2-level SOP representation)
27
4-bit CLA
(+)
A0 B0
C0(+)
A1 B1
(+)
A2 B2
(+)
A3 B3
CLL (carry look-ahead logic)
p0 g0p1 g1p2 g2p3 g3
• Given A,B’s, all p,g’s are generated in 1 gate delay in parallel.
C1C2C3
• Given all p,g’s, all C’s are generated in 2 gate delay in parallel.
S3 S2 S1 S0
• Given all C’s, all S’s are generated in 2 gate delay in parallel.
• Key virtue of CLA: sequential operation in RCA is broken into parallel operation!!
28
16-bit CLA
• Same as before, p,g’s are generated in parallel in 1 gate delay
• The second-tier CLL takes the P,G’s from first-tier CLLs and C0 to generate “seed C’s” for first-tier CLLs in 2 gate delay. (note that the logic for generating “seed C’s” from P,G’s is exactly the same to generating C’s from p,g’s!)
• With the seed C’s as input, the first-tier CLLs use Cin and p,g’s to generate C’s in 2 gate delay• With all C’s in place, S’s are calculated in 2 gate delay
Therefore, totally1+2+2+2+2=9 gate delayto finish the whole thing!!
• Now, without input carry, the first-tier CLL cannot generate C’s…… Instead they generate P,G’s (group propagator and group generator) in 2 gate delay P => This group will propagate the input carry to the group P=p0p1p2p3 G => This group will generate an output carry G=g3+p3g2+p3p2g1+p3p2p1g0
29
Example - 64bit-CLA
• S39 takes longer to become valid.• List of primary and intermediate signals in producing S39: (Back
tracking: S39 = A39 B39 C39 , S39<-C39<-C36…)– Do we need P39_36* and G39_36*?– Primary inputs:– Gate delay to generate p38_0, g38_0 :– Gate delay for second level P*, G*:
– Gate delay for second level P**, G**:
– Gate delay C32:
– p38 ,p37 ,p36 , and g38 ,g37 ,g36
– C32 C36 C39
Delay:
30
Other Topics
• Usually there is a question on non-linear pipeline.
• Please make sure that you are comfortable with cache and virtual memory organization.