ee457 final exam (~30% to 36%) - university of southern

May 9, 2018 8:49 am EE457 Final - Spring 2018 1 / 13 C Copyright 2018 Gandhi Puvvada

EE457 Final Exam (~30% to 36%)Closed-book Closed-notes Exam; No cheat sheets;

Ordinary calculators may be used but not the smart phone with calculators. Verilog Guides are not needed and are not allowed.Smart phones, tablets (and any kind of computing/Internet devices) are not allowed.

This is a Crowdmark exam. Please do not write on margins or on backside.

Spring 2018Instructor: Gandhi Puvvada

Wednesday, 5/9/2018 (A 3-hour exam) 04:30 PM - 07:30 PM (180 min) in SGM124Please do not write your student ID

Ques# Topic Page# Time Points Score

1 Tomasulo 2-5 50 min 91

2 Arithmetic 6 20 min 52

3 Virtual Memory 7 20 min 34

4 Cache 8 8 min 15

5 Miscellaneous advanced topics 8-9 30 min 70

6 Lab 6 Early Branch, EX and MEM stages merged

10-12 40 min 80

Total Cover+11+ Blank = 13

168 min. 342

Perfect Score 330


1 ( 18+4+2+16+10+10+4+7.5+3.5+6+1 = 91 points) 50 min. Tomasulo

I-Cache

Register Status Table

Integer / Branch

D-CacheDiv Mul

TAG FIFO

Instruc. Queue

Reg

. File

Int.

Que

ue

L/S

Que

ue

Div

Que

ue

Mul

t. Q

ueue

CDB

Issue Unit

Dispatch

Load Buffer

IoI-OoW-OoC with RST

IoI-OoE-IoC_with_ROB

I-Cache

Br. Pred. Buffer

lw mult

Integer / Branch Div Mul

ROB

InstructionPrefetch Queue

Reg

. File

Int.

Que

ue

L/S

Que

ue

Div

Que

ue

Mul

t. Q

ueue

CDB

Issue Unit

D-Cache

Dispatch

1 mult2 Completed3 lw4 Completed

StoreAddr. Buffer D-Cache

L/S Buffer

Current Head

Current TailWP

Addr.Adder

No store buffer

for EE457

Stores hit in cache

always for EE457


You know the above two designs very well. The top is the OoC design and bottom is the IoC design.

1.1 OoC design components: RF, RST, and TAG FIFO

1.2 Between the Data Cache and the CDB, it is Load Buffer in the case of the OoC and LS Buffer in the case of the IoC. Explain: _____________________________________________________________________________________________________________________________________________________________________________

1.3 Exceptions can be supported in the ________ (OoC/IoC/both) designs.

1.4 We said a few times,"we wish we could "UNDO" like in word processing, and get back to LION

in $8 from the current value of ANT in the above diagram of RST".Explain the context and discuss. Is this needed in the OoC design or the IoC design?

18 pts $0$1$2

$8

$31

$0$1$2

$8

$31

RST RF

LION TIGER CAT ANT

ANT

In our Summer 2008 EE560 project of the OoC design, we had a RST of 32x7, a RFof 32x32 and a Tag FIFO of 64x6. Some choices may be random but not all of them.Are there any relations between these sizes (depths and widths)? Is there a valid bit in any of these? Who performs an associative search on whom and when? How many of what size comparators are involved in that associative search?In your analysis, consider scaling this to a new bigger OoC design for a 64-bit processor with a RF of 64 x 64 and a 128 deep TAG FIFO.

0 1 2

63

TAG FIFO

DOG

Current design: RST of 32x7 , a RF of 32x32 and a Tag FIFO of 64x6 Scaled up design: RST of ____ , a RF of 64x64 and a Tag FIFO of 128x___

D_Cache

Load Buffer

D_Cache

LS Buffer 4 pts

2 pts

LION TIGER CAT ANT$8 ANT

16 pts


1.5 In the case of Tomasulo #3 (the more current IoC design of EE560, introduced in the last lecture), _________ (FRAT / RRAT) avoids the expensive associative prioritized search in ROB (by holding the latest mapping between each architectural register and the locations in the physical register file). FRAT stands for Front-end RAT (Register Alias Table) associated with the dispatch process, where as RRAT stands for the ________ RAT. In case of branch misprediction, ______ (FRAT / RRAT) needs to be reverted to its state before the dispatch of the mispredicted branch. We take snap shots (check-points) of the ______ (FRAT / RRAT) and store them. Check-points are taken for _________________ (low-confidence/high-confidence) branches so if it fails, we restore ______ (FRAT / RRAT) by copying the checkpoint to it. But if a _______ _____________ (low-confidence / high-confidence) branch (for whom we did not create a check-point) got mispredicted, we jump to the nearest check-point and then walk though the ROB by rewinding/fast-forwarding the ______________ (dispatch / graduation) process to the mispredicted branch. This is similar to a _______ (DVD /VHS tape) where we jump to the nearest chapter point and then rewind or fast forward to the scene of interest.

1.6 Initially (on power-on), a FIFO between a producer and a consumer is _______ (empty/full).ROB in the IoC design is initially _______ (empty/full). TAG FIFO in the OoC design is initially _______ (empty/full). Do you use full or empty to describe the state of the RST anytime? Y / NDo you use the word FULL for a direct-mapped cache? How about Fully Associative TLB? ____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

1.7 RAS (Return Address Stack) _______ (is / is not) used in the OoC design because __________________________________________________________________________________________________________________________________________________________________

1.8 You use TOSP+1 to _________ (push /pop) return address.You use TOSP to _________ (push /pop) return address.If recursive routines are very rarely used, (i) would you still make the RAS a circular stack? _____ (Y / N), (ii) can you remove the UseWhenEmpty register? _____ (Y / N). Since the RAS holds return addresses, what is the width of this stack array? ________.

1.9 The branch prediction in the IoC design is considered equivalent to predicting from the ________ (IF stage / ID stage) because ________________________________________________________________________________________________________________________________

1.10 If ROB is full in the clock when a branch on CDB announces that it was mispredicted, (circle all applicable)i) WP and RP must be pointing to the same location in the ROBii) The branch instruction can potentially be the junior-most instruction in the ROB, pointed to by _________ (RP, RP-1, RP+1, WP, WP-1, WP+1)iii) The branch instruction can potentially be the senior-most instruction in the ROB, pointed to by _________ (RP, RP-1, RP+1, WP, WP-1, WP+1)

10 pts

10 pts

4 pts

TOSPTOSP+1

0123

RAS

UseWhenEmpty

7.5 pts

3.5 pts

6 pts


1.11 In the figure below, RT15 means ROB_Tag = 15 and also the location 15 in ROB.

1.12 CDB: Mr. Bruin said that he removed the CDB Register and his design worked perfectly in simulation. Mr. Trojan asked if he is performing behavioral simulation or post-implementation delay-back-annotated simulation. Mr. Bruin did not understand the question. Please comment._________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

1.13 We know that the instruction bearing ROB_Tag 17 (RT17) can be a junior or a senior to the mispredicted branch instruction bearing ROB_Tag 13 (RT13). Out of the three choices below each of the scenarios, select the right choice. Write a brief explanation of your choice.

12 pts

RD_Data

Like in Q#1.1, if we move to a bigger 64-bit processor whoseregister file is 64x64 instead of our current 32x32, do you im-mediately see that the ___________ (depth/width/both) of theROB increasing substantially? Explain.

ROB

$8RT12RT13RT14RT15

RD_ID Reg_Write CE

RT00

RT31

CE = Completed Execution

1 1$8 1 0

$8 1 0RT16RT17

$8 1 1

6 pts

16 pts

ROB Scenario #1

RT12RT13RT14RT15

RT00

RT31

RT16RT17

RT23

RT05

Mispredicted branch

Instruction in question

Instr RT17 is a juniorInstr RT17 is a seniorImpossible scenario

WP

RP

ROB Scenario #2

RT12RT13RT14RT15

RT00

RT31

RT16RT17

RT23

RT05

Mispredicted branch



RP

WP

ROB Scenario #3

RT12RT13RT14RT15

RT00

RT31

RT16RT17

RT23

RT05

Mispredicted branch



WPRP

ROB Scenario #4

RT12RT13RT14RT15

RT00

RT31

RT16RT17

RT23

RT05

Mispredicted branch



RPWP

Scenario #1

Scenario #2

Scenario #3

Scenario #4


2 ( 12+12+12+16 = 52 points) 20 min. Arithmetic (based on Lab 3 SLT, SLTU and Spring 2012 and Fall 2012 questions)

2.1 We are give two 4-bit subtracters with C3, C4 and R3 brought out as shown on the side.

On the upper-part, treat X and Y as unsigned numbers, and produce XhsY (X higher or same as Y) (opposite of Less Than Unsigned (SLTU) of Lab 3).

On the lower-part, treat X and Y as signed numbers, and produce XgeY (X greater or equal as Y) (opposite of Less Than Signed (SLT) of Lab 3).

2.2 Based on the Fall 2012 question on overflows:Treating X, Y and R as signed numbers,produce R = 0-X-Y (i.e. zero minus X minus Y) using two subtracters. Also produce overall overflow.

2.3 Unsigned series of additions and subtractions: R = A+B-C+D-EIn the above expression there are equal number of additions and subtractions. Hence there are _________ (equal/unequal) number of opportunities to overflow and underflow, which can potentially cancel each other. You are given two half- adders and a 2-bit comparator to complete the design and produce RR (Result_Right) signal.Does it matter we do the additions and subtractions alternatively? Y / N

4 invertersA

B

Adde

r

Cout

S

C4

Cin

Cout_1

C3

RY

X

?

R3

4 invertersA

B

Adde

r

Cout

S

C4

Cin

Cout_1

C3R

Y

X

?

R3

XhsY

XgeY

Unsigned

Signed

12 pts

12 pts

4 invertersA

B

Adde

r

Cout

S

C4

Cin

Cout_1

C3

IRX

0

?

IR3 4 inverters

A

B

Adde

r

Cout

S

C4Cin

Cout_1C3

RY

IR

?

R3

IR = Intermediate Result

Signed

12 pts

A

BS

Cin

Cout

4 inv

A

BS

Cin

Cout

A

BS

Cin

Cout

4 inv

A

BS

Cin

Cout

R

Adder

Adder

Adder

Adder

O1 O2

U1 U2

Unsigned

? ? ? ?

16 pts


3 ( 8+8+18 = 34 points) 20 min. Virtual Memory:

In a system with 32-bit Virtual Address VA[31:0] and 32-bit physical address PA[31:0], the TLB uses set associative mapping with 256 sets, each set with 5 entries (Degree of associativity = 5). The PPFN is 5 hex digits long and has values in the rang 00000 hex to FFFFF hex. The same range applies to VPN also which is divided into 2 hex digits of Set field and 3 hex digits of TAG field. Divide the Virtual address and physical address below as appropriate.

Size of Virtual Page (in Kilobytes): _________________. Size of Physical Page Frame (in Kilobytes): ____________________.

When the TLB is full, is it possible to have (A) repetitions of Virtual TAGs in the 5 locations of any row (say the top-most row) Yes / No(B) repetitions of Virtual TAGs across rows (say in the first left-most column of 256 TAGs) Yes / No(C) repetitions of PPFNs in the 5 locations of any row (say the top-most row) Yes / No (D) repetitions of PPFNs across rows (say in the first left-most column of 256 PPFNs) Yes / No

Currently the TLB has only three entries as indicated by the three valid bits shown above. Fill-inthe three entries with Virtual TAGs and PPFNs of your choice in such a way that (a) the three virtual pages are as close to each other as possible (preferably adjacent to each other, if possible) and (b) the three physical page frames are also as close to each other as possible (preferably adjacent to each other, if possible). Write down the 3 + 3 = 6 page numbers in hex below.

VA31 VA[1:0] PA31 PA[1:0]8 pts

8 pts

The three VPNs inincrementing order:_____________________________________________

The three PPFNs inincrementing order:_____________________________________________

18 pts


4 ( 15 points) 8 min. Cache

Consider a 160KB cache in a 32-bit physical address 64-bit data byte addressable system, block size = 4 64-bit words, set associative mapping, degree of set associativity = 5Complete the details of one TAG RAM and one Data RAM and state how many such items.

5 ( 6+3+9+4+4+8+8+4+3+12+6+3 = 70 points) 30 min. Miscellaneous advanced topics

5.1 The ideal CPI (clocks per instruction) of our OoO processor (IoI-OoE-IoC) is __________.The ideal CPI (clocks per instruction) of a simple 5-stage pipeline (IoI-IoE-IoC) is ______.The ideal CPI (clocks per instruction) of a 2-way Super-scalar machine is _____________.The ideal CPI (clocks per instruction) of a super-pipelined 10-stage (IoI-IoE-IoC) is ______.The ideal CPI (clocks per instruction) of a 4-threaded single core machine is ___________.The ideal CPI (clocks per instruction) of a 1-threaded 4-core machine is ______________.

5.2 Branch direction prediction becomes more important in __________ (deeper / shallow) pipelines.Branch direction prediction becomes more important in _____________ (out-of-order / in-order) executing pipelines.

5.3 In a 2-bit branch direction predictor, the four states on the right are for ______ (A / B) (A = the whole 1K-deep BPB, B = for each entry of the 1024 entries in the 1K-deep BPB). In implementation the two bits in an entry are either incremented or decremented, ________________________________ (at most one entry in one clock, one entry in every clock, all entries in one clock). Hence we need ___________________________________________ (just one incrementer/decrementer unit / 1024 incrementer/decrementer units). State change is done when a branch ____________________________________ (is predicted / reaches CDB and announces predicted right or mispredicted. SN to N state transition takes place when the branch is __________________(predicted right / mispredicted). N to SN state transition takes place when the branch is __________________(predicted right / mispredicted).

5.4 In the MPI (Miss rate per instruction) calculations, the “per instruction” refers to ___________ (all / only memory accessing) instructions. This facilitates quickly figure out the impact of the misses on _____________ (write the metric we calculated using this).

15 pts

A19 A18 A17 A16A31 A30 A29 A28 A27 A26 A25 A24 A23 A22 A21 A20 A3 A2 A1 A0A15 A14 A13 A12 A11 A10 A9 A8 A7 A6 A5 A4

(Byte enables)

(CPU address bits)

D[7

:0]

D[1

5:8]

D[2

3:16

]

D[3

1:24

]

DATA RAM( ___ more like this)

Address

Data_in

Data_out

1Valid Com

p un

it __

- bi

ts w

ide

Hit/Miss

( ___ more like this)TAG RAM with Comp. unit

1

Size

of o

neB

yte-

wid

e B

ank

____

_ x

8

/BE[7:0] 8

D[3

9:32

]

D[4

7:40

]

D[5

5:48

]

D[6

3:56

]

(CPU address bits)

(CPU address bits)

TAG RAM size: ____ x ____

6 pts

3 pts

SN N T ST

9 pts

4 pts


5.5 Mutual Exclusion using atomic instruction pair like the LL and SC is needed in (circle all right answers)(i) single-core multi-threaded CPU (ii) multi-core multi-threaded CPU (iii) multi-core single-threaded CPU (iv) single-core single-threaded CPU but executing multiple interacting processes in a time-shared fashion.

5.6 CMT: In a 4-threaded core, circle the common resources below:PC, IFQ (Instr. Fetch Queue), I-TLB, I-cache, RF (Register file), TS (Thread Selection), IEU (Integer Execution Unit), FPU (Floating-point Execution unit), SB (Store Buffer), D-TLB, D-cache, SCU (Snoopy Control Unit for Cache Coherency)

5.7 Non-blocking caches _____________ (reduce/increase) processor stalls due to cache misses. They buffer the misses and ___________ (continue/wait) to serve other access requests from _____________ (same thread/different threads). MSHR, standing for ________________________________, buffers the missed cache access until it is served. Non-blocking cache is helpful in _______________________________ (instruction cache only/data cache only/both). Non-blocking caches have two controllers, such as CCU and SCU, so that one attends to the new requests while the other attends to the misses. T / F

5.8 Intel HTT stands for H_________ T___________ T___________ . Its generic name is ____ (A/B/C) where A = fine-grain multi-threading where all instructions dispatched in a clock are of the same thread but threads are switched on every clock, B = coarse-grain multi-threading where all instructions dispatched in a clock are of the same thread but threads are suspended for a cause such as cache miss, C = Simultaneous multi-threading where instructions dispatched in a clock can be any mix of the threads.

5.9 If you want to use one mega Data TLB instead of 4 individual TLBs for the 4 threads in a 4-threaded core, you will add to the entries of the TLB ______ (A / B). A = 2-bit thread ID , B = ASN (address space number = Process ID)

5.10 Complete the improved state diagram of EE560 on the side. Add R/FMM or R/-- as appropriate. I have removed all state transition conditions for the arrows converging on to or diverging from the two O states

5.11 Out of the 6 states above, LL can be performed in state(s) ________________ and SC in state(s) __________.$6 in LL $6, 200($0) is ______ (A/B/C), $7 in SW $7, 200($0) is ______ (A/B/C), $8 in SC $8, 200($0) is ______ (A/B/C), where A = a source register, B = a destination register, C = a source as well as a destination register.

5.12 SC is used to ____________________ (to obtain / to release / to obtain or release) a lock.

4 pts

8 pts

8 pts

4 pts

3 pts

12 pts

6 pts

3 pts


6 ( 40 + 40 = 80 points) 40 min. Lab 6 Early Branch Design -- EX and MEM stages combined:

Nothing needs to be done on this page. Complete your design on the next two pages. On the next page, I have mainly removed the EX/MEM stage register. Remove unneeded connections and unneeded items of the original 5-stage pipeline on the next page. Revise FU_Br, FU, HDU_Br, HDU as needed. On the further next page, complete the new EX stage stalling, Memory address, MR and MW.

Consider the left two instructions and compare them with the right two instructions.

lw $8, 0($2); lw $8, 40($2);sw $8, 0($2); sw $8, 60($2);

Let us say, for the sake of this problem, that we observed that many programs use lw and sw instructions with 0 offset (for their address calculation) as shown on the left quite often.

So we proposed to merge the EX and MEM stages into one stage called new EX. Now lw $8, 0($2) on the left side bypasses the ALU, and uses the raw $rs ($2 here) as the memory address. So the left-hand instructions require only 1 clock in the new EX stage. Hence instructions dependent on lw $8, 0($2) do not need to stall for one clock. This is a gain.

But instruction on the right need two clocks in the new EX stage for the address calculation to complete before you access memory. This is a loss. We use our standard mechanism of stalling a stage for exactly one clock (from our Lab 7 Part 3 SP2) to provide an extra clock for the right side instructions in the new EX stage.

Certainly, in the case of sw $8, 60($2), we should not convey MemWrite control signal in the first clock of the two clock process to the Data Memory before the address 60($2) is fully calculated by the ALU. Otherwise you would be corrupting the Data Memory. It is also important in the case of lw $8, 40($2), not to convey MemRead control signal in the first clock to avoid unwanted side-effects in the Data Cache. Make sure that the MR and MW are inactivate at all other times (other than the needed times).

We should convey the right (proper) address to the memory. We convey either the contents of Rs bypassing the ALU or the effective address as calculated by the ALU. Hence a mux is added in the address path of the Memory.

Offset0

D QCLKCLRCLK

RESET_B

2nd

EX

MEWB

ALUSrcALUOpRegDst

MemReadMemWrite

EN

Offset0SW_LW_0

SW_LW_N0

MR

MW

Offset is 0

Offset is not 0

I0[0]

I0[1]

I1[0]

I1[1]

Y[0]

Y[1]

S

ALU

ALUSrc

Dat

am

emor

y

@

W

R

1

0

1

0

May 9, 2018 8:49 am

EE457 Final - Spring 2018 11 / 13C

Copyright 2018 G

andhi Puvvada

Hazarddetection

unit

04

Inst

ruct

ion

mem

ory

PC

+

r1

r2

R1

R2w

W

opco

ders

rtrd

shift

func

t

Reg

iste

rs

Co

ntr

ol(PC

)

(rs)

(rt)

ALU

rtrd

ALUctrlSign

ext.

EX

MEWB

ALUSrcALUOpRegDst

ALUSrc

Reg

Dst

ALUOp

RegWrite_EX

Dat

am

emor

y

@

W

R

Mem

Rea

d

Mem

Writ

e

Reg

Writ

e

IF.Flush

WR

WB

MEM

_dat

aR

EG_d

ata

Reg

Writ

e

MemtoReg

+

=

func

ts_ext

ShiftLeft 2

Zero

Forwarding Unit

Early Branch with incomplete modifications to merge EX & MEM stages

IF/IDIF-Stage

ID/EXID-Stage EX_Stage

EX/WBWB-Stage

rs

MemRead_EXMemRead_MEM

Writ

eReg

iste

r_EX

FU_BrFW

_RS_

WB

FW_R

S_M

EM

FW_R

T_W

B

FW_R

T_M

EM

FW_R

T

FW_R

S

Writ

eReg

iste

r_M

EM

WriteRegister_MEMHDU_Br

STALL_BEQSTALL_LW

STALL

Bra

nch

01

0

1

1

0

0

1

11

11

1

00

00

0

0

0

1

Bra

nch

1

fowarding_mux_control

?

Offset0

D QCLKCLRCLK

RESET_B

2ndRemove unneeded connections and unneeded items. Revise FU_Br, FU,HDU_Br, HDU.

40 pts

May 9, 2018 8:49 am

EE457 Final - Spring 2018 12 / 13C

Copyright 2018 G

andhi Puvvada

04

Inst

ruct

ion

mem

ory

PC

+

r1

r2

R1

R2w

W

opco

ders

rtrd

shift

func

t

Reg

iste

rs

Co

ntr

ol(PC

)

(rs)

(rt)

ALU

rtrd

Signext.

EX

MEWB

ALUSrcALUOpRegDst

ALUSrc

Reg

Dst

Dat

am

emor

y

@

W

R

MemReadMemWrite

IF.Flush

WR

WB

MEM

_dat

aR

EG_d

ata

Reg

Writ

e

MemtoReg

+

=

func

ts_ext

ShiftLeft 2

Zero

Early Branch with incomplete modifications to merge EX & MEM stages

IF/IDIF-Stage

ID/EXID-Stage EX_Stage

EX/WBWB-Stage

rs

Bra

nch

01

0

1

1

0

0

0

1

1

Offset is 0

D QCLKCLRCLK

RESET_B

2nd

STALL

EN

EN

RESET_B

RESET_B

RESET_B

RESET_B

Offset0SW_LW_0

SW_LW_N0

MR

MW

Offset is 0

Offset is not 0

I0[0]

I0[1]

I1[0]

I1[1]

Y[0]

Y[1]

S

1

0

Complete the rest of the design.New EX stage stalling, Memory address, MR, MW.

40 pts


Blank page: Please write your name and email. Tear it off and use it for rough work. Do not submit.Student’s Last Name:____________________ email: __________________

We enjoyed teaching this course! Hope you liked it! Hope to see some of you in EE560. Grades will be out in a week. Enjoy your summer break! Happy Holidays!!! Gandhi, TAs: Fangzhou, Chao, Mentors: Rui, Pravin HW Graders: Navtej, Rupam Lab graders: Ujwala, Aashish

ee457 final exam (~30% to 36%) - university of southern

Documents