9 1-18exams.skule.ca/exams/ece552h1_20159_631481236696ece552...the processor has a 4gb address space...

Student # (use if pages get separated)

ECE552 Computer Architecture Pg 1 of 18 Fall2015

UNIVERSITY OF TORONTO FACULTY OF APPLIED SCIENCE AND ENGINEERING

Final EXAMINATION, October 2015

ECE552F – COMPUTER ARCHITECTURE Exam Type: D Duration: 2:30

Examiner –A. Moshovos

Instructions

This is a type D exam. You are allowed to use any printed/hand-written material including your course notes. You

may use a University approved calculator.

Last Name (Print Clearly): _

First Name:

Student Number:

Question Marks Awarded

1 15

2 19

3 10

4 9

5 7

6 10

7 15

8 5

9 10

Total 100

General Instructions: State your assumptions. Show your work. Comment your code. Solutions that are judged

significantly inefficient will lose some marks. The exam is printed on two sides of the page.

Make your answers clear.

There are 9 questions and a total of 100 marks. There are 9 pieces of paper in the exam,

this one included, printed both sides. The page numbering is 1-18.



Inst Dest PR

Source1 PR

Source2 PR

L0 p32 — — L1 p33 p32 — L2 p34 p32 — L3 p35 p34 p33

L4 p36 p32 —

L5 — p36 —

L1’ p37 p36 —

L2’ p38 p36 — L3’ p39 p38 p37 L4’ p40 p36 — L5’ — p40 —

Question 1 – Register Renaming and OOO Execution [12 Marks]

A processor uses register renaming were architectural registers are mapped to physical

registers. Initially, the free list of physical registers contains in order: P32, P33, …, P127

and the relevant Register Alias Table (RAT) contents are:

Architectural Register

Physical Register

R8 P8

R9 P9

R10 P10

The following code is to be renamed:

L0: addi r8, r0, 0x200 # r8 = 0 + 0x200

L1: ldw r9, 0x2000(r8) # r9 = mem[r8 + 0x2000]

L2: ldw r10, 0x3000(r8) # r10 = mem[r8 + 0x3000]

L3: add r10, r10, r9 # r10 = r10 + r9

L4: addi r8, r8, -4 # r8 = r8 - 4

L5: bne r8, r0, L1 # if (r8 != 0) goto L1

Fill in the tables below representing the reservation stations and the Reorder Buffer

(ROB) entries after all instructions for the first two loop iterations have been

renamed. Assume that no instruction executes before all instructions have been

renamed. The inst fields are already filled in. The remaining fields refer to the

appropriate architectural (AR) or physical registers (PR). If the instruction does not use a

particular register, write N/A. ROB Reservation Stations

L1’ is the first instruction of the second iteration.

Inst Dest AR

Dest. Old PR

L0 r8 p8 L1 r9 p9 L2 r10 p10 L3 r10 p34

L4 r8 p35

L5 — — L1’ r9 p33

L2’ r10 p35 L3’ r10 p38 L4’ r8 p36 L5’ — —



Question 2 – Virtual Memory [19 Marks]

a. [3 Marks] A Virtually-Indexed Physically-Tagged Data cache is 64KB and uses 64B

blocks. The processor has a 4GB address space and uses 8KB pages. What should the

cache associativity be in order to avoid having to search through multiple sets to correctly

identify synonyms? Explain your answer.

Associativity required = 64KB/8KB = 8

With an associativity of 8, only the address within a page (bits 0..12) is necessary to identify the set. Breakdown: bits 0..5 for the byte inside the line, bits 6..12 for the set. There are 128 sets each with 8 ways, for a total of 1024 blocks.

b. A processor uses a 96-entry, 3-way set-associative TLB with LRU replacement and

8KB pages. The following code executes on the processor:

for (i = 0; i < 16k; i++)

c[i] = a[i] + b[i];

When each loop iteration executes, it reads in order a[i] and b[i], and finally writes c[i].

Assume the array elements are each 4 bytes long. Also assume that the arrays a, b, and c

start at 0x10000, 0x20000, and 0x30000 respectively.

b.i) [6 Marks] Fill in the table below with the accesses that the TLB will observe for

the first two loop iterations. Assume that the TLB is empty and that we start filling in

from way 0, then way 1 and finally way 3.

Requested Address TLB Set Hit Way

Miss Page Table Index Accessed (on TLB miss) in Binary

0x10000 8 X b01000

0x20000 16 X b10000

0x30000 24 X b11000

0x10004 8 0

0x20004 16 0

0x30004 24 0

b.ii) [4 Marks] Assuming that initially none of the data pages are allocated in memory,

when all loop iterations have executed, report the following:

Total TLB accesses: 48k

Total TLB misses: 24

Total Page Table Accesses: 24

Total Page Faults: 24



c. A processor implementation uses 43-bit physical addresses and 8KB pages. The flat

page table uses entries that use 4 bytes each. The design team is considering a

hierarchical page table organization with three levels, each using 10-bits of the incoming

43-bit address:

10 10 10 13

L1 L2 L3 Offset

The L1 and L2 level table entries are also 4B long each, and L3 uses the same entry

format as the flat table.

i) [3 Marks] What is the size in bytes of each subtable (table chunk) at each level?

L1 subtable size = 210

× 4 = 4096 bytes


× 4 = 4096 bytes


10 × 4 = 4096 bytes

ii) [2 Marks] What is the maximum size in bytes that the hierarchical page table can

have3?

L1: 1 subtable, 4K

L2: 1024 subtables, 4096K

L3: 10242 subtables, 4096M

Total: 4,299,165,696 bytes



Question 3 – Caches [7 Marks]

Assume a 4GB byte-addressable space

A. [5 marks] Due to an error in the design the direct mapped cache ended up using the

following set index function where A is the address of the incoming access:

Set = (A7 OR A31)A6A5A4

This scheme ORs bits A7 and A31 of the incoming address concatenating the result with

bits A6 through A4. The resulting 4-bit number is used as the set index. The cache uses

A31…A8 for the tags.

Will this cache be able to correctly distinguish among different addresses and thus

correctly identify hits and misses? Check one:

The cache will correctly identify hits and misses:

The cache will INCORRECTLY identify hits or misses: X

If you answered no, fill in the table below with an example where this cache incorrectly

reports a hit or a miss on a block. For all accesses report the address, the tag, the set, and

whether the access correctly (check mark in the corresponding column) or incorrectly hits

or misses. For incorrect hits report the address that is currently in the block, and for

incorrect misses report where in the cache the block actually is.

Correct Incorrect Address Tag Set Hit Miss Hit (what is the

block in the cache) Miss (where is the block in the cache)

0x80000000 0x8000000 0x8 X

0x80000080 0x8000000 0x8 0x80000000



B. [2 Marks] Can we use LRU replacement with a skewed associative cache with two

ways? Explain your answer.

We can, by adding a global last-use timestamp to each block, or some other

structure that remembers the ordering across all blocks (not really practical).



Question 4 – Prefetching [9 Marks]

A stride prefetcher associates using a PC-indexed scheme and has 16 entries. The

following three pieces of code execute on the processor: for (i= 0; i < N; i++)

A[i] = B[i+2] + C[i-1]

for (i= 0; i < N; i++)

j = 2 * i;

A[j] = B[j] + C[j]

(a) (b)

while (L != NULL)

if (L->key == X) return L;

L = L->next;

(c)

[3 Marks] Fill in the table below with a checkmark per row indicating how likely it is

for the prefetcher to successfully prefetch data for these codes. “MAY work” means that

while in general, the prefetcher is not expected to be successful, there are cases, no matter

how rare, where the prefetcher will be successful.

Code Prefetcher WILL

Work Prefetcher MAY Work

Prefetcher WONT Work

(a) X (b) X (c) X

[6 Marks] Now explain your answers.

(a) Sequential accesses to 3 arrays.

(b) Stride-2 accesses to the arrays.

(c) Can work if the data is arranged in such a way that the addresses of L and L->next are always

different by a fixed amount (for example, if they are in a single, contiguous array, and L-next always

points to the next element in this array)



Bimodal (2 Marks) Predictor State

Actual Direction

Before After Prediction

1 01 10 0

0 10 01 1

1 01 10 0

0 10 01 1

1 01 10 0

0 10 01 1

Question 5 – Branch Prediction [10 Marks]

Illustrate the worst possible single static branch execution sequence for the following

branch predictors: 1) last-outcome, 2) bimodal (two bit saturating counter), and 3) single

bit global history with two bit saturating counters (2nd level table). “Single static branch”

means that there all branch prediction accesses are done by a single branch that executes

multiple times. Use 0 for not-taken and 1 for taken. The last outcome predictor is

initialized to 0 and all counters of the other two predictors are initialized to 01 (weakly

not taken). The last outcome and bimodal predictors have a single entry. The 2-level

predictor has a 1-bit GHR and a 2-entry second level table. The GHR value is used as the

index for the 2nd-level table. Assume that initially GHR = 0.

2‐Level (4 Marks) Predictor State

Actual Direction

GHR Before

GHR After

Entry Index

Entry Before

Entry After

Prediction

1 0 1 0 01 10 0

1 1 1 1 01 10 0

0 1 0 1 10 01 1

0 0 0 0 10 01 1

1 0 1 0 01 10 0

1 1 1 1 01 10 0

0 1 0 1 10 01 1

0 0 0 0 10 01 1

Last Outcome (2 Marks) Predictor State

Actual Direction

Before After Prediction

1 0 1 0

0 1 0 1

1 0 1 0

0 1 0 1

1 0 1 0

0 1 0 1



Question 6 – Performance Impact of Instruction Set Changes [10 Marks]

A team member has observed that an important application often executes the following

sequence of instructions:

I1 beq r1, r0, else # if r1 == 0 goto else

I2 Then: inst1 # some non-branch instruction

I3 br after # goto else (unconditional)

I4 Else: inst2 # some non-branch instruction

I5 after

Assuming that perfect branch prediction was possible, the baseline processor would

achieve 1 IPC on the specific application

Unfortunately, the branch at I1 is not biased and follows either direction with a 50%

probability. They are suggesting to introduce predicated execution where the code can be

implemented instead as follows:

I1 p = cmpeq r1, r0 # p = (r1 == 0)

I2 p: inst1 # some non-branch instruction

I3 !p: inst2 # some non-branch instruction

If implemented, the processor would need three cycles to execute the predicated code

segment. Recall that “p: inst1” means execute inst1 but write its results only if p is true.

A. [5 Marks] If perfect branch prediction was possible, and conditional branches are

20% of all instructions executed (does not include I3), is it possible to improve

performance with predication? Show and explain your derivations.

No. IPC = 1 in both cases, and assuming the same frequency, the number of instructions is greater

in the predicated case:

Originally, a taken branch does 1 extra instruction, and a non-taken branch does 2.

With predication, in both cases 2 extra instructions per branch are done.

If 20% are conditional branches and 50% of these are taken, then 10% more instructions will be

executed (1.1x slowdown).



B. [5 Marks] If mispredictions are equally likely on both paths following I1 and all

conditional branches can be converted to predicated form: what is the minimum

misprediction penalty that is needed for predication to improve performance? Show and

explain your derivations. As seen before, 10% more instructions are executed, so the break-even point is a CPI of 1.1.

20% are conditional branches and 50% of these are mispredicted, so 10% of all the instructions are

mispredicted branches.

The break-even penalty is

excess cycle. At that point, predication is as fast as the original

processor. With penalties of 2 cycles or more, predication will be faster.



Question 7 – OOO Execution [15 Marks]

Our processor uses MIPS R10K-style dynamic scheduling and has a 7-stage: FETCH (F) Fetch the instruction DECODE/RENAME (D) Attempt to rename and insert in the scheduler

SCHEDULE (S) Wait for operands to become ready

REGFILE READ (R) Read input register values EXECUTE (X) As in 5‐stage pipeline MEMORY (M) As in the 5‐stage pipeline COMPLETE (W) Speculative writeback COMMIT/RETIRE (C) Commit instruction

Dependent instructions can issue back-to-back since the scheduler knows the expected

latency of each operation. The conventional reservation design indirectly links producers

with consumers through physical registers. A producer does not directly know who is

waiting for its value. Instead, a producer has to broadcast its destination physical register

to all other reservation stations which they compare with their source physical registers.

This is an expensive operation in terms of latency and energy. A team member is

suggesting an alternate design in hopes of improving latency and energy consumption. In

their proposal, producers directly link with consumers. Specifically, the reservation

stations take the following form:

Inst DST SRC1 SRC2 CONS1 CONS2

Where “inst” is the instruction opcode, DST, SRC1, and SRC2 are the physical register

names for a destination and for up to two source registers. Finally, CONSx are the

reservation station indexes for up to two consuming instructions.

An instruction stalls at the rename stage (stalling all preceding stages) when it cannot link

into the producer’s reservation station (both CONS field are occupied). It is allowed to

proceed when the producer leaves the reservation station. A producer can leave the

reservation station at the complete stage, so, when a consumer is waiting for an

instruction to Commit, it can Decode in the same cycle as the commit. Only one

instruction can be at R, X, M, W, and C each cycle.

A. [11 Marks] Show how the following instruction sequence will execute with this new

scheduler. Assume that the processor can fetch, rename and schedule a single instruction

per cycle.

Cycle

Instr.

0

1

2

3

4

5

6

7

8

9 1 0

1 1

1 2

1 3

1 4

1 5

1 6

1 7

1 8

1 9

2 0

2 1

2 2

2 3

2 4

ldw r8, 0(r8) F D S R X M W C ldw r9, 4(r8) F D S R X M W C ldw r7, 8(r8) F D S R X M W C add r7, r7, r9 F D S R X M W C stw r7, 12(r8) F D S R X M W C ldw r8, 16(r8) F D S R X M W C



B. [2 Marks] The conventional Register Alias Table maps architectural registers to

physical registers. How will it have to change to allow consumers to find the

corresponding reservation station of a producer? It will need a new field which points to the corresponding reservation station (and is cleared when an instruction commits).

C. [2 Marks] What additional information we will need in the ROB to recover the RAT

from branch mispredictions?

No new information is needed.



Question 8 – Coherence [5 Marks]

A 3-processor snoop coherence system uses a bus and an MSI coherence protocol. Each

processor has a private, direct-mapped data cache with a single 16B block. Show what

would be the cache contents for the three processors when the following sequence of

accesses takes place. The first action shows an example of how to fill in the table.

Initially, all caches are empty and their blocks are in invalid state (I) and the contents are

not relevant (---). P1 performs a read on address 0x0. This results in a miss which brings

the block containing 0x10…0x1F into P1’s cache which stores it in shared state (S).

Request P1 Cache P2 Cache P3 Cache Explanation P1 READ 0x10 I: ‐‐‐ I: ‐‐‐‐ I: ‐‐‐ Miss, read 0x10 block and cache in S

P2: READ 0x10 S: 0x10 I: ‐‐‐‐ I: ‐‐‐‐ Miss, read 0x10 block and cache in S

P3: READ 0x10 S: 0x10 S: 0x10 I: ‐‐‐‐ Miss, read 0x10 block and cache in S

P2: WRITE 0x10 S: 0x10 S: 0x10 S: 0x10 Promote to M, invalidate others

P3: READ 0x10 I: ‐‐‐‐ M: 0x10 I: ‐‐‐‐ Miss, demote to S

P1: READ 0x12 I: ‐‐‐‐ S: 0x10 S: 0x10 Miss (same block)

P3: READ 0x14 S: 0x10 S: 0x10 S: 0x10 Hit (same block)

P2: READ 0x20 S: 0x10 S: 0x10 S: 0x10 Miss, bring in block 0x20 to P2 as S



Question 9 – Consistency [10 Marks]

A two node shared memory system uses a clustered approach where each node has a

processor and associated portion of main memory. The two nodes are connected by a

link. Any processor can access any memory location be in its local memory or at the

other processor’s memory (remote). The latency for accessing the local memory is 5

cycles while accessing the remote memory requires an additional 10 cycles. The memory

system and communication link is pipelined and a new memory reference can be initiated

per cycle. The processors use out-of-order execution and allow other operations to

proceed while there are outstanding memory requests (issued but not completed). They

also use speculative execution and predict branches.

P1 P2

MEM1

10 cycle latency

MEM2

5 cycle latency

5 cycle latency



Consider first that the processors execute the following code where A resides in MEM1

and B resides in MEM2, initially A = B = 0:

P1: A = 1 … = B

P2: B = 1 … = A

The following diagram shows a time diagram of how these accesses may be proceed in

our system:

P1 MEM1 MEM2 P2 A=1 Issued @0 Arrives @0

Arrives @0 Issued @0 B=2

…= A Issued @1

completes @5 Issued @1

…=A completes @5 completes @5

completes @5

Arrives @11 Arrives @11

completes @16 (value = 1)

completes @16 (value = 2)

completes @26

completes @26

For example, at time 0, processor P1 issues a write to A while P2 issues a write to B.

Both are local memory writes and arrive to their memory immediately. It takes the

memories 5 cycles to complete the requests, so at cycle 5 the writes are complete and the

processors are notified at the same cycle. At cycle 1, P1 issues a read for B, it takes 10

cycles for it to arrive at MEM2 at cycle 11, another 5 cycles to complete there, and

another 10 cycles to return back to P1 at cycle 26.

If the read for B had arrive any time before cycle 5, it would have read the old value of B

and not the one written by P2. That is, a write becomes visible to any access once it

completes in memory.



Now consider the following program where initially flag = Data = 0 and where Data

resides in MEM2 and flag in MEM1:

P1: Data = 200 Flag = 1

P2: While (flag == 0); … = Data

A. [5 Marks] If the processors issue (not complete) all operations in program order show

a possible timing of events such that P2 reads flag =1 and Data = 0:

P1 MEM1 MEM2 P2 Issued @0

Compl @25

Compl @25

Compl @6

Data = 0

Issued @1

Compl @6

Flag = 1

Issued @1

Issued @0

Compl @15

Data = 200

Arrives @10

Compl @15

Flag = 1

Arrives @10



B. [5 Marks] If the processors issue and complete writes in program order but allow

reads to issue out-of-order show a sequence of events where P2 reads flag =1 and Data

=0:

Pl MEMl MEM2 P2

Compl @31

Flag = 1

Issued @0

Compl @5

Data = 0

Compl @47

Compl @37

Flag = 1

Arrives @32

Issued @22

Compl @25

Issued @26

Compl @15

Data = 200

Arrives @10

Issued @0



(page intentionally left blank)

9 1-18exams.skule.ca/exams/ece552h1_20159_631481236696ece552...the processor has a 4gb address space...

Documents