cover page ee457 mt exam (~24.5%)
TRANSCRIPT
EE457 MT - Spring 2021 1 / 12 C Copyright 2021 Gandhi Puvvada
Cover page
EE457 MT Exam (~24.5%)Spring 2021
Instructor: Gandhi PuvvadaSaturday, 3/27/2021 05:00 PM - 08:00 PM (with overhead 04:45-08:20 PM PST) on Zoom
Viterbi School of Engineering University of Southern California
Ques# Topic Page# Time Points Score
1 Lab 7 Part 1modified 2-62 SW (Store Word) - HDU Pseudo code 73 FIFO 8-94 Cache 9-105 Virtual Memory 11
Total Cover+11+1
Perfect Score
I have previously read the Viterbi Code of Integrity and other related material at the sites (a) https://viterbischool.usc.edu/academic-integrity/ (b) https://sjacs.usc.edu/students/academic-integrity/ and I will abide by these rules of conduct. I will neither seek help from others nor offer help to others in my exams.
_____________________________ <== Student’s signature, DEN D2L username: @usc.edu
EE457 MT - Spring 2021 2 / 12 C Copyright 2021 Gandhi Puvvada
1 ( points) min. Lab 7 Part 1 modified
This design is derived from Lab 7 Part1 (the three element adder)
We changed the design quite a bit but there is no big complexity in this design. We added an additional SUB3 (Subtract 3) stage as EX2 and renamed the original EX2 stage as EX3 stage. The EX2 and EX3 have a bypass path around each of them, so that two out of the four instructions below can bypass going through the subtracter in EX2 and/or the adder in EX3. Instructions have either two (X, Y) or three (X, Y, Z) source registers. All instructions add X and Y in EX1. Some instructions subtract 3 in EX2. Some instructions add Z in EX3.AXY = Add X and Y; AXYS3 = Add X to Y and then subtract 3; AXYZ = Add X to Y and then Z; AXYZS3 = Add X to Y and then subtract 3 and then add Z; BXYZ = Branch to JJJJ if XY sum (XplusY) equals zero (this has one delay slot).
Instruction Operation One-Hot Coded Result ready in
NOP 0 0 0 0 0
AXY $R, $X, $Y; ($R) <= ($X)+($Y) 1 0 0 0 0 EX2
AXYS3 $R, $X, $Y; ($R) <= ($X)+($Y)-3 0 1 0 0 0 EX3
AXYZ $R, $X, $Y, $Z; ($R) <= ($X)+($Y)+($Z) 0 0 1 0 0 WB
AXYZS3 $R, $X, $Y, $Z; ($R) <= ($X)+($Y)-3+($Z) 0 0 0 1 0 WB
BXYZ $X, $Y, JJJJ; (PC) <= JJJJ if (XplusY == 0) 0 0 0 0 1 Note: Here PC is 16 bits in size. JJJJ is 16 bits.This branch has one delay slot.
The 2-source instruction, AXY, has its result ready in EX2 itself and can help his junior in EX1, whereas the AXYS3 can help from EX3. AXYZ and AXYZS3 can only help from WB. Unlike in the Lab 7 Part 1, there is no overflow here, so there is no converting to a bubble because of overflow.
Instead of having all comparison units in a comparison station in the ID stage, here, we went back to the Lab 6 Part 4 method, where needed register ID comparisons are done in the individual stages (though it amounts to replication of comparison units) (though timing-wise, it’s not the best). There is a HDU in the ID stage and a FU in each of the three stages: EX1, EX2, and EX3. So, you can write (EX2_ZA = WB_RA) in the pseudo code/gate-level design of the FU in the EX2 like in Lab 6 Part 4. To facilitate the Lab 6 method, we carried the source register IDs (XA, YA, and ZA) through the stage registers so that you can tap them as needed for comparison with the destination register IDs (RAs) of their seniors. I carried all three source register IDs all the way up to the WB stage. You cross off portions of these three lines as appropriate.
FU_EX1HDU_ID FU_EX1 FU_EX1
Write a number in each of the 4 small boxes indicating the number of register ID comparison
Now arrive at the bare minimum number of comparison units we would need if we followed
units on the next page and transcribe them here. $0 is not a special register here.
our Lab 7 Part 1 method of pooling all comparison units in a comparison station in ID stage.comp station in ID stage
Q1P2 Page total pts
EE457 MT - Spring 2021 3 / 12
CC
opyright 2021 Gandhi Puvvada
PC
ZA
YA
XA
ZD
YD
XD
Reg. File
ZA
YA
XA
RA
RD
R-Write
EN_IDEX1
0
1
0
1
0
1
0
1
A
B
Add
er
S
A
B
S
Add
er
IF ID EX1 EX3
Y2Mux
X2Mux
Z2MuxZ3Mux
INS-
ME
MWB
ID_XA
ID_YA
ID_ZA
EN_IFID
Writ
eR
D
XD
YD
ZDST
ALL
_D
STALL_DB
FX2M
FY2M
FZ2MFZ3M
WB
_RD
WB
_WR
ITE
0
1
R3Mux
AX
YA
XY
S3A
XY
ZA
XY
ZS3
0
1
0
1
Y1Mux
X1Mux
FX1M
FY1M
RR R R
REX1_Write
BXY
Z0
1
Z1Mux
EN_EX1EX2
XPY
(X p
lus
Y)
FZ1M
R
0
1
0
1
Y3Mux
X3Mux
FX3M
FY3M
0
1
XY3Mux
FXY3M
WB
_RA
WB
_RD
WB_WRITEEX2_Write
EX2
XYZ
+1
16
0
1
16
BR1
16
(XplusY is Zero)
BTAddr
XAYAZARA
XAYAZARA
XAYAZARA
XAYAZARA
XAYAZARA
EX2_XAEX2_YAEX2_ZA
EX2_RA
EX1_XAEX1_YAEX1_ZA
EX1_RA
ID_XAID_YAID_ZA
ID_RA
EX3_XAEX3_YAEX3_ZA
EX3_RA
WB_XA
WB_YAWB_ZA
WB_RA
FU_EX1
EX3_Write
EN_EX2EX3
Lab 7 modified
1. Cross off unneeded/unwanted forwarding mux(es).2. Complete forwarding paths to the remaining (surviving) forwarding muxes.
9. Generate STALL_D (Stall for RAW dependency).10. For the three FUs (forwarding units), draw the input (only input, no output) pins and generate one output per category.
On this page
On the next few pages
BI_IF
BI_
ID
EN_PC
EN_EX3WB
5. Cross off unneeded BI (Bubble-Injecting) AND gates and complete the rest.
A
B
SUB3 D
iff 0
1
R2Mux
SKIP
23
BI_
EX1
EX2_
BXY
Z
AX
YA
XY
S3A
XY
ZA
XY
ZS3
BXY
Z
AX
YA
XY
S3A
XY
ZA
XY
ZS3
BXY
Z
AX
YA
XY
S3A
XY
ZA
XY
ZS3
BTAddr
ZD
BTAddr
ZD
BTAddr
BTAddr
BI_EX1
BI_
EX2
BI_
EX3
16
Produce and complete the rest of the BI_ (Bubble Injecting) control signals here.
3. Cross off unneeded/unwanted portions of source address conveyance lines
8. Complete the two skip controld (SKIP2 and SKIP3).
SKIP
3
(line with suffixes _XA, _YA, or _ZA) (for example do you need WB_XA?)
7. Complete the 6 enables for the PC, IF/ID, ID/EX1, EX1/EX2, EX2/EX3, EX3/WB
FU_EX2 FU_EX3HDU_ID
6. Cross off unneeded/unwanted portions of BTAddr (Branch Target Address) conveyance lines and the associated FFs. Complete BTAddr path to the PC.
4. Write number of comparison units next to HDU_ID, FU_EX1, FU_EX2, FU_EX3.
XPY
(X p
lus
Y) o
r XPY
min
us 3
ID_Write
Q1P3
Page total pts
EE457 MT - Spring 2021 4 / 12 C Copyright 2021 Gandhi Puvvada
BTAddr (Branch Target Address JJJJ) was also carried all the way up to the WB stage. You cross off portions of it as appropriate.
Read the incomplete block diagram on the previous page thoroughly. There are several loose-ends to connect to. Ten loose-ends (2 related to BTA mux, 6 related to EN pins of the PC and the 5 stage registers, and 2 Skips) are marked as . But, since you will decide which forwarding muxes will survive and which bubble-injecting AND gates will survive, I have not marked the loose ends associated with these items with . You may need a few gates sometimes (for
example to generate BR1) .
1. In this modified Lab 7 design here, we _________ (do/don’t) need a WBFF on IF/ID stage register ________ (like/unlike) our 5-stage early branch design. The power-on active-low reset signal on the stage registers here _______ (resets/sets/either) _________ (A/B) A = the entire stage register though it is expensive, B = only the control signal portion. We declared one branch delay slot for the BXYZ here. This declaration __________________ (has an impact / does not have any impact) on whether to have or not have any WBFF(s) here.
2. The redundant muxes discussion in class is relevant in ________________ (A/B/both/neither). A = early-branch design, B = late-branch design. Depth of the pipeline (7-stage vs. 9-stage for example) may have an effect on number of redundant mux pairs ____ (T/F). Here we have a late branch. Hence, we __________ (expect/do not expect) to discuss about redundant muxes.
3. If you are the compiler designer, and if you are going to write AXYZ instruction, and if one of the three registers is the destination register of its senior #1, would you use it for the X or the Y or the Z register of your AXYZ? ___________________________________________________________________________________________________________________________________________________________
4. Conflict can occur between a successful branch and the HDU in the ID stage can occur in _______ (A/B/C/D). A= our 5-stage early branch design, B = our 5-stage late branch design, C = both, D = neither. How about in our current modified Lab 7 Part 1 with a late branch BXYZ?__________ (E/F) E = Conflict can occur and designer can fix it; F = Conflict cannot occur.
One student wanted to change our EX1 stage as shown below. You ________ (agree/disagree).
R
Our currentdesign
One studentwanted to change it like this
________________________________________________________________________________________________________________________________________________
Explain:
Q1P4 Page total pts
EE457 MT - Spring 2021 5 / 12 C Copyright 2021 Gandhi Puvvada
You can add, delete modify the given partial design. Draw gates
STALL_DDependency
in the ID stage
PS1: ProblematicSenior #1
PS2
PS3
STALL
FU_EX1
EX1_XA _RA_Write
_RA_Write
EX1_EX1_
_RA_Write
9. Generate STALL_D (Stall for RAW dependency).10. For the three FUs (forwarding units), draw the input (only input, no output) pins and generate one output per category.
On the next few pagesNow let us do this =======>
0
1
0
1
Y2Mux
X2Mux
FX2M
FY2M
0
1
0
1
Y1Mux
X1Mux
FX1M
FY1M
0
1
Z1Mux
FZ1M
0
1
0
1
Y3Mux
X3Mux
FX3M
FY3M
EX1 stage: Generate select lines for one of the surviving X_Muxes, one of the
surviving Y_Muxes, and the only Z1Mux if it survives.
Q1P5 Page total pts
EE457 MT - Spring 2021 6 / 12 C Copyright 2021 Gandhi Puvvada
Let us continue with the remaining two forwarding units.
A long sequence of unrelated (not inter-dependent) AXYZS3 instructions run at the same maximum rate of CPI = 1 as a long sequence of unrelated (not inter-dependent) AXY instructions. ____ (T / F).
To show the worst case of a long sequence of inter-dependent instructions (with no branches), would you consider any of the four instructions (AXY, AXYS3, AXYZ, AXYZS3) or .... You will consider _______________________________________
Since addition and subtraction are associative and commutative (can be done in any order, is it _____ (A/D) A = Advantageous, D = Disadvantageous, for the AXYZS3 instruction, if we switch EX2 and EX3 stages (i.e. add Z first and subtract 3 at the end).Explain: ___________________________________________________________________________________________________________________________
FU_EX2
EX2_ _RA_Write
_RA_Write
EX2_EX2_
_RA_Write
0
1
Z2Mux
FZ2M
Generate select line FZ2M if the only Z Mux in EX2 survives.
FZ2M
Note: Not every instruction uses "Z". But simpler logic is always favored!
FU_EX3
EX3_ _RA_Write
_RA_Write
EX3_EX3_
_RA_Write
0
1
Z3Mux
FZ3M
Generate select lines FZ3M and/or FXY3M if one or two of the two muxes
FZ3M
Z3Mux and/or XY3Mux survive!
0
1
XY3Mux
FXY3M
FXY3M
Q1P6 Page total pts
EE457 MT - Spring 2021 7 / 12 C Copyright 2021 Gandhi Puvvada
2 ( points) min. SW receiving $Rt forwarding in MEM
Reproduced below from the Spring 2020 MT is the forwarding arrangement for the $Rt to SW in MEM stage from his Senior#1 LW. This saves one clock in stalling SW in IDstage.
Reproduced below is the HDU pseudo code. Revise it as needed to save the clock in stalling.
WR
ME
WB
ALU
_res
ult
Stor
e_da
ta
Unit
M
_contro
(rs)
(rt)
ALU
rtrd
ALUctrl
EX
MEWB
ALUSrcALUOpRegDst
ALUSrc
Reg
Dst
ALUOp
RegWrite_EX
func
ts_
ext
Forwarding
rs
MemRead_EXMemRead_MEM
Writ
eReg
iste
r_EX
FW_R
S_W
B
FW_R
S_M
EM
FW_R
T_W
B
FW_R
T_M
EM
WriteRegister_MEM
11
11
1
00
00
0
0
1fowarding_mux_
Dat
am
emor
y
@
W
R
Mem
Rea
d
Mem
Writ
e
Reg
Writ
e
WR
WB
MEM
_dat
aR
EG_d
ata
Reg
Writ
e
MemtoReg
Writ
eReg
iste
r_M
EM
0
1
ol
Tap offChoice 1
01
EX_FORW_SW MEM_FORW_SW
lw $8, 40 ($2);
sw $8, 60 ($2);
HDU (Original Hazard Detection Unit in ID stage):Note: Here ID/EX.WriteRegister refers to the WriteRegister after the mux governed by RegDst. We could replace it with ID/EX.WriteRegisterRt . If [ ID/EX.MemRead
and (ID/EX.WriteRegister =/= 0)
and {(ID/EX.WriteRegister == IF/ID.ReadRegister_RS) or (ID/EX.WriteRegister == IF/ID.ReadRegister_RT)} ]then make STALL_LW = 1
Your revision of HDU pseudo code can potentiallyelongate the longest (critical) timing path in theHDU in ________ (A/B/C/D) A = late branch de-sign, B = early branch design, C = both, D = neither.
Q2P7 Page total pts
EE457 MT - Spring 2021 8 / 12 C Copyright 2021 Gandhi Puvvada
3 ( points) min. FIFO
3.1 In the case of the single clock FIFOs, any of the two types, BINARY counters or GRAY code counters (for WP and RP) may be used. True / FalseIt will be expensive to use ___________ (GRAY code / BINARY) counters as we need to perform code conversion before performing WP - RP subtraction. So avoid them if not needed
3.2 Legend: A = single-clock FIFOs only, B = two-clock FIFOs only, C = both, D = neitherFor a 512-location deep FIFO, we can use 9-bit counters for the WP and the RP in the case of _________ (A/B/C/D) and we need to use 10-bit counters for the WP and the RP in the case of _________ (A/B/C/D). A MOD-1024 (modulo 1024) subtraction produces the depth as 2 rather than -1022 when you are performing a 10-bit subtraction of WP - RP when WP is 00_0000_0001 and RP is 11_1111_1111. ________ (T/F). MOD subtraction is needed in _________ (A/B/C/D).
In the case of a 4K-location single-clock FIFO (4K = 212), you would use a JK FF to record AF (Almost Full) and AE (Almost Empty) and _______-bit wide counters for WP and RP and a MOD-_______ subtracter to perform WP-RP. In this case the FIFO is FULL when the JK FF reads AF (Almost Full) and the depth is found to be _________________ (state in binary) and the FIFO is EMPTY when the JK FF reads AE (Almost Empty) and the depth is found to be _________________ (state in binary).
In the case of a 4K-location 2-clock FIFO (4K = 212), you would use _______-bit wide counters for WP and RP and a MOD-_______ subtracter to perform WP-RP. In this case the FIFO is FULL when the depth is found to be _________________ (state in binary) and the FIFO is EMPTY when the depth is found to be _________________ (state in binary).
3.3 With a FIFO delinking the producer and the consumer (who run on different clocks), if the producer is fast and is able to produce data, the consumer will be able to consume from the FIFO on consecutive clocks in a burst-manner most of the time. _______ (T/F). The consumer needs ______________ (A/B/C) A = to look at the FULL flag besides the Empty flag, B = to look at the EMPTY flag only, C = to wait for the RP to be passed to the writer after every read.
3.4 In the absence of any FIFO or delinking buffer, the producer has to present the data and tell the consumer to "take it" by activating "TAKE" and then wait for the consumer to say "got it" (by activating "GOT"). This is not enough. Then the __________________ (producer/consumer) has to inactivate __________ ("TAKE" / "GOT") and wait for the __________________ (producer/consumer) to inactivate __________ ("TAKE" / "GOT"). This is called the complete handshake. With the overhead of double synchronization both ways, this takes ____________ (one / several) clock(s) per exchanging one data item.
3.5 DEAD LOCK: Note that the Producer and Consumer have other constraints to be able to Produce or Consume. Clock frequency is not the only thing. On and off, they may not be able to pay attention to the FIFO. That is why we need the FIFO to delink them.
Q3P8 Page total pts
EE457 MT - Spring 2021 9 / 12 C Copyright 2021 Gandhi Puvvada
For a 64-location 2-clock FIFO, if one tries to use 6-bit (instead of 7-bit) counters for WP and RP with two JK FFs and two MOD-64 subtracters (one set per clock domain) in spite of using the usual Gray-code counters double synchronizations both ways, dead-lock can easily occur ______________(A/B/C/D) A = two-clock FIFO with widely differing frequencies, B = two-clock FIFO with slightly differing frequencies, C = both, D = neither. In a dead-locked situation, it is possible that the consumer stops consuming the FIFO because he thinks that the FIFO is running ____________ (FULL/EMPTY) , while at the same time, the producer stops producing as he believes that the FIFO is running ____________ (FULL/EMPTY).
4 ( points) min. Cache
4.1 The following is the L1 cache for our USC80486 processor (32-bit address, 32-bit Data, Byte addressable processor). Total cache size is 10 KB . Set-Associative Mapping with DoSA (Degree of Set
Associativity) = 5 Block size = 4 32-bit words. Divide the address and complete all details below.
A19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4
(Byte enables)
(CPU address bits)
D[7
:0]
D[1
5:8]
D[2
3:16
]
D[3
1:24
]
DATA RAM( ___ more like this)
Address
Data_in
Data_out
1Valid
Com
p un
it __
- bi
ts w
ide
Hit/Miss
( ___ more like this)TAG RAM
1 Size
of o
neB
yte-
wid
e B
ank
____
_ x
8
/BE[3:0] 4
It is not difficult to get an A in EE457. You need to aspire for it, work for it, and seek help from the 457 teaching team on whatever you do not understand. We are eager to help you. The final topics, exceptions, branch prediction, out-of-order execution, chip multi-threading, chip multiprocessing, cache coherency, locks and mutual exclusion are interesting andchallenging too. They are the focus of 70% of the final exam. Best wishes! - Gandhi, Kartik, Gengyu, Arvind, Medha, Yunfei, Sanket, Fangqing, and Lin
Blank area
Q4P9 Page total pts
EE457 MT - Spring 2021 10 / 12
CC
opyright 2021 Gandhi Puvvada
4.2G
iven below is a diagram
similar to the one in your classnotes. D
ivide the 13-bit address from
the CPU
into TAG
, SET, WO
RD
, and BY
TE fields based on the information provided on the
diagram. Fill in all boxes.
Address from the CPU
Block 1TAG RAM
Block 1’sDATA RAM
1
Comparator
A7 A6 A5 A4 A3 A2 A1 A0Valid
A10 A9 A8
0 0
0 1
1 0
1 1
0 0
0 1
1 0
1 1
BE3
-BE0
D23
-D16
D31
-D24
A [ ]
1, A [ ]
A [
]
BE3
BE2
0 0
0 1
1 0
1 1
0 0
0 1
1 0
1 1
D7-
D0
D15
-D8BE1
BE0
A11A12D
ata-
In
Data-O
ut
Addr
Data
Data D
ata
Data
AddrBlock 0TAG RAM
Block 0’sDATA RAM
Comparator
0 0
0 1
1 0
1 1
0 0
0 1
1 0
1 1
D23
-D16
D31
-D24
A [ ]
1, A [ ]
A [
]
BE3
BE2
0 0
0 1
1 0
1 1
0 0
0 1
1 0
1 1
D7-
D0
D15
-D8BE1
BE0
Data-O
ut
Addr
Data
Data D
ata
Data
Addr
Block 2TAG RAM
Block 2’sDATA RAM
Comparator
0 0
0 1
1 0
1 1
0 0
0 1
1 0
1 1
D23
-D16
D31
-D24
A [ ]
1, A [ ]
A [
]
BE3
BE2
0 0
0 1
1 0
1 1
0 0
0 1
1 0
1 1
D7-
D0
D15
-D8BE1
BE0
Data-O
ut
Addr
Data
Data D
ata
Data
Addr
4 (1+6)-bit Tags6
4 (1+6)-bit Tags6
4 (1+6)-bit Tags6
Dat
a-In
Dat
a-In
Degree of Set Associativity =
Number of Sets =
Total size of the cache =
For an address of 0_1010_1010_1010, you (the CCU) go to TAG RAM (s) (circle all applicable)A. Block 0 TAG RAM B. Block 1 TAG RAM C. Block 2 TAG RAM D. None of them
Q4P10
Page total pts
EE457 MT - Spring 2021 11 / 12 C Copyright 2021 Gandhi Puvvada
5 ( points) min. Virtual Memory
5.1 We said that 1-level page table (for one process) wasn’t that huge as shown on the side, but when we consider several processes running together on the CPU, the cumulative consumption of space to hold all those page tables is excessive. Hence we went for a multi-level page table. What is the size (in Kilo bytes (KB) or Mega bytes (MB)) of the 1-level page table (for one process) shown on the side.___________________________________________The principle of locality makes the 2-level (or multi-level ) page table attractive. If our applications and our OS try to deliberately break the "principle of locality" and our OS allocates Virtual address space in a very distributed manner, then the 2-level page table shown below in the next diagram ends up building all 1024 2nd-level tables (pink tables) and we in fact consume ________ KB more space :(
5.2 It was observed that most of our applications use smaller clusters of Virtual pages and we were mostly using the lower 32 locations of the 1024 locations in most of the 2nd-level (pink) page tables shown in the diagram. This means that most virtual addresses have their VPNs with 5 zeros as shown below. DDDD_DDDD_DD00_000P_PPPP.For the sake of our question, let us assume that those 5 bits are always 5 zeros. Choose one of the 2 choices of a 3-level page table as your choice and tell us if you achieved any saving in terms of cumulative space consumed for holding all level tables for one process. A. Divide 20-bit VPN as 5+5+10 VA[31:27], VA[26:22], VA[21:12]B. Divide 20-bit VPN as 10+5+5 VA[31:22], VA[21:17], VA[16:12]____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________
PTBR
Low
er 3
2lo
catio
ns
VA[31:22] VA[21:12]
Q5P11 Page total pts
EE457 MT - Spring 2021 12 / 12 C Copyright 2021 Gandhi Puvvada
Non-grading page, Don’t submit
Blank page for rough work