9 1-18exams.skule.ca/exams/ece552h1_20159_631481236696ece552...the processor has a 4gb address space...
TRANSCRIPT
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 1 of 18 Fall2015
UNIVERSITY OF TORONTO FACULTY OF APPLIED SCIENCE AND ENGINEERING
Final EXAMINATION, October 2015
ECE552F – COMPUTER ARCHITECTURE Exam Type: D Duration: 2:30
Examiner –A. Moshovos
Instructions
This is a type D exam. You are allowed to use any printed/hand-written material including your course notes. You
may use a University approved calculator.
Last Name (Print Clearly): _
First Name:
Student Number:
Question Marks Awarded
1 15
2 19
3 10
4 9
5 7
6 10
7 15
8 5
9 10
Total 100
General Instructions: State your assumptions. Show your work. Comment your code. Solutions that are judged
significantly inefficient will lose some marks. The exam is printed on two sides of the page.
Make your answers clear.
There are 9 questions and a total of 100 marks. There are 9 pieces of paper in the exam,
this one included, printed both sides. The page numbering is 1-18.
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 2 of 18 Fall2015
Inst Dest PR
Source1 PR
Source2 PR
L0 p32 — — L1 p33 p32 — L2 p34 p32 — L3 p35 p34 p33
L4 p36 p32 —
L5 — p36 —
L1’ p37 p36 —
L2’ p38 p36 — L3’ p39 p38 p37 L4’ p40 p36 — L5’ — p40 —
Question 1 – Register Renaming and OOO Execution [12 Marks]
A processor uses register renaming were architectural registers are mapped to physical
registers. Initially, the free list of physical registers contains in order: P32, P33, …, P127
and the relevant Register Alias Table (RAT) contents are:
Architectural Register
Physical Register
R8 P8
R9 P9
R10 P10
The following code is to be renamed:
L0: addi r8, r0, 0x200 # r8 = 0 + 0x200
L1: ldw r9, 0x2000(r8) # r9 = mem[r8 + 0x2000]
L2: ldw r10, 0x3000(r8) # r10 = mem[r8 + 0x3000]
L3: add r10, r10, r9 # r10 = r10 + r9
L4: addi r8, r8, -4 # r8 = r8 - 4
L5: bne r8, r0, L1 # if (r8 != 0) goto L1
Fill in the tables below representing the reservation stations and the Reorder Buffer
(ROB) entries after all instructions for the first two loop iterations have been
renamed. Assume that no instruction executes before all instructions have been
renamed. The inst fields are already filled in. The remaining fields refer to the
appropriate architectural (AR) or physical registers (PR). If the instruction does not use a
particular register, write N/A. ROB Reservation Stations
L1’ is the first instruction of the second iteration.
Inst Dest AR
Dest. Old PR
L0 r8 p8 L1 r9 p9 L2 r10 p10 L3 r10 p34
L4 r8 p35
L5 — — L1’ r9 p33
L2’ r10 p35 L3’ r10 p38 L4’ r8 p36 L5’ — —
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 3 of 18 Fall2015
Question 2 – Virtual Memory [19 Marks]
a. [3 Marks] A Virtually-Indexed Physically-Tagged Data cache is 64KB and uses 64B
blocks. The processor has a 4GB address space and uses 8KB pages. What should the
cache associativity be in order to avoid having to search through multiple sets to correctly
identify synonyms? Explain your answer.
Associativity required = 64KB/8KB = 8
With an associativity of 8, only the address within a page (bits 0..12) is necessary to identify the set. Breakdown: bits 0..5 for the byte inside the line, bits 6..12 for the set. There are 128 sets each with 8 ways, for a total of 1024 blocks.
b. A processor uses a 96-entry, 3-way set-associative TLB with LRU replacement and
8KB pages. The following code executes on the processor:
for (i = 0; i < 16k; i++)
c[i] = a[i] + b[i];
When each loop iteration executes, it reads in order a[i] and b[i], and finally writes c[i].
Assume the array elements are each 4 bytes long. Also assume that the arrays a, b, and c
start at 0x10000, 0x20000, and 0x30000 respectively.
b.i) [6 Marks] Fill in the table below with the accesses that the TLB will observe for
the first two loop iterations. Assume that the TLB is empty and that we start filling in
from way 0, then way 1 and finally way 3.
Requested Address TLB Set Hit Way
Miss Page Table Index Accessed (on TLB miss) in Binary
0x10000 8 X b01000
0x20000 16 X b10000
0x30000 24 X b11000
0x10004 8 0
0x20004 16 0
0x30004 24 0
b.ii) [4 Marks] Assuming that initially none of the data pages are allocated in memory,
when all loop iterations have executed, report the following:
Total TLB accesses: 48k
Total TLB misses: 24
Total Page Table Accesses: 24
Total Page Faults: 24
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 4 of 18 Fall2015
c. A processor implementation uses 43-bit physical addresses and 8KB pages. The flat
page table uses entries that use 4 bytes each. The design team is considering a
hierarchical page table organization with three levels, each using 10-bits of the incoming
43-bit address:
10 10 10 13
L1 L2 L3 Offset
The L1 and L2 level table entries are also 4B long each, and L3 uses the same entry
format as the flat table.
i) [3 Marks] What is the size in bytes of each subtable (table chunk) at each level?
L1 subtable size = 210
× 4 = 4096 bytes
L2 subtable size = 210
× 4 = 4096 bytes
L3 subtable size = 2
10 × 4 = 4096 bytes
ii) [2 Marks] What is the maximum size in bytes that the hierarchical page table can
have3?
L1: 1 subtable, 4K
L2: 1024 subtables, 4096K
L3: 10242 subtables, 4096M
Total: 4,299,165,696 bytes
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 5 of 18 Fall2015
Question 3 – Caches [7 Marks]
Assume a 4GB byte-addressable space
A. [5 marks] Due to an error in the design the direct mapped cache ended up using the
following set index function where A is the address of the incoming access:
Set = (A7 OR A31)A6A5A4
This scheme ORs bits A7 and A31 of the incoming address concatenating the result with
bits A6 through A4. The resulting 4-bit number is used as the set index. The cache uses
A31…A8 for the tags.
Will this cache be able to correctly distinguish among different addresses and thus
correctly identify hits and misses? Check one:
The cache will correctly identify hits and misses:
The cache will INCORRECTLY identify hits or misses: X
If you answered no, fill in the table below with an example where this cache incorrectly
reports a hit or a miss on a block. For all accesses report the address, the tag, the set, and
whether the access correctly (check mark in the corresponding column) or incorrectly hits
or misses. For incorrect hits report the address that is currently in the block, and for
incorrect misses report where in the cache the block actually is.
Correct Incorrect Address Tag Set Hit Miss Hit (what is the
block in the cache) Miss (where is the block in the cache)
0x80000000 0x8000000 0x8 X
0x80000080 0x8000000 0x8 0x80000000
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 6 of 18 Fall2015
B. [2 Marks] Can we use LRU replacement with a skewed associative cache with two
ways? Explain your answer.
We can, by adding a global last-use timestamp to each block, or some other
structure that remembers the ordering across all blocks (not really practical).
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 7 of 18 Fall2015
Question 4 – Prefetching [9 Marks]
A stride prefetcher associates using a PC-indexed scheme and has 16 entries. The
following three pieces of code execute on the processor: for (i= 0; i < N; i++)
A[i] = B[i+2] + C[i-1]
for (i= 0; i < N; i++)
j = 2 * i;
A[j] = B[j] + C[j]
(a) (b)
while (L != NULL)
if (L->key == X) return L;
L = L->next;
(c)
[3 Marks] Fill in the table below with a checkmark per row indicating how likely it is
for the prefetcher to successfully prefetch data for these codes. “MAY work” means that
while in general, the prefetcher is not expected to be successful, there are cases, no matter
how rare, where the prefetcher will be successful.
Code Prefetcher WILL
Work Prefetcher MAY Work
Prefetcher WONT Work
(a) X (b) X (c) X
[6 Marks] Now explain your answers.
(a) Sequential accesses to 3 arrays.
(b) Stride-2 accesses to the arrays.
(c) Can work if the data is arranged in such a way that the addresses of L and L->next are always
different by a fixed amount (for example, if they are in a single, contiguous array, and L-next always
points to the next element in this array)
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 8 of 18 Fall2015
Bimodal (2 Marks) Predictor State
Actual Direction
Before After Prediction
1 01 10 0
0 10 01 1
1 01 10 0
0 10 01 1
1 01 10 0
0 10 01 1
Question 5 – Branch Prediction [10 Marks]
Illustrate the worst possible single static branch execution sequence for the following
branch predictors: 1) last-outcome, 2) bimodal (two bit saturating counter), and 3) single
bit global history with two bit saturating counters (2nd level table). “Single static branch”
means that there all branch prediction accesses are done by a single branch that executes
multiple times. Use 0 for not-taken and 1 for taken. The last outcome predictor is
initialized to 0 and all counters of the other two predictors are initialized to 01 (weakly
not taken). The last outcome and bimodal predictors have a single entry. The 2-level
predictor has a 1-bit GHR and a 2-entry second level table. The GHR value is used as the
index for the 2nd-level table. Assume that initially GHR = 0.
2‐Level (4 Marks) Predictor State
Actual Direction
GHR Before
GHR After
Entry Index
Entry Before
Entry After
Prediction
1 0 1 0 01 10 0
1 1 1 1 01 10 0
0 1 0 1 10 01 1
0 0 0 0 10 01 1
1 0 1 0 01 10 0
1 1 1 1 01 10 0
0 1 0 1 10 01 1
0 0 0 0 10 01 1
Last Outcome (2 Marks) Predictor State
Actual Direction
Before After Prediction
1 0 1 0
0 1 0 1
1 0 1 0
0 1 0 1
1 0 1 0
0 1 0 1
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 9 of 18 Fall2015
Question 6 – Performance Impact of Instruction Set Changes [10 Marks]
A team member has observed that an important application often executes the following
sequence of instructions:
I1 beq r1, r0, else # if r1 == 0 goto else
I2 Then: inst1 # some non-branch instruction
I3 br after # goto else (unconditional)
I4 Else: inst2 # some non-branch instruction
I5 after
Assuming that perfect branch prediction was possible, the baseline processor would
achieve 1 IPC on the specific application
Unfortunately, the branch at I1 is not biased and follows either direction with a 50%
probability. They are suggesting to introduce predicated execution where the code can be
implemented instead as follows:
I1 p = cmpeq r1, r0 # p = (r1 == 0)
I2 p: inst1 # some non-branch instruction
I3 !p: inst2 # some non-branch instruction
If implemented, the processor would need three cycles to execute the predicated code
segment. Recall that “p: inst1” means execute inst1 but write its results only if p is true.
A. [5 Marks] If perfect branch prediction was possible, and conditional branches are
20% of all instructions executed (does not include I3), is it possible to improve
performance with predication? Show and explain your derivations.
No. IPC = 1 in both cases, and assuming the same frequency, the number of instructions is greater
in the predicated case:
Originally, a taken branch does 1 extra instruction, and a non-taken branch does 2.
With predication, in both cases 2 extra instructions per branch are done.
If 20% are conditional branches and 50% of these are taken, then 10% more instructions will be
executed (1.1x slowdown).
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 10 of 18 Fall2015
B. [5 Marks] If mispredictions are equally likely on both paths following I1 and all
conditional branches can be converted to predicated form: what is the minimum
misprediction penalty that is needed for predication to improve performance? Show and
explain your derivations. As seen before, 10% more instructions are executed, so the break-even point is a CPI of 1.1.
20% are conditional branches and 50% of these are mispredicted, so 10% of all the instructions are
mispredicted branches.
The break-even penalty is
excess cycle. At that point, predication is as fast as the original
processor. With penalties of 2 cycles or more, predication will be faster.
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 11 of 18 Fall2015
Question 7 – OOO Execution [15 Marks]
Our processor uses MIPS R10K-style dynamic scheduling and has a 7-stage: FETCH (F) Fetch the instruction DECODE/RENAME (D) Attempt to rename and insert in the scheduler
SCHEDULE (S) Wait for operands to become ready
REGFILE READ (R) Read input register values EXECUTE (X) As in 5‐stage pipeline MEMORY (M) As in the 5‐stage pipeline COMPLETE (W) Speculative writeback COMMIT/RETIRE (C) Commit instruction
Dependent instructions can issue back-to-back since the scheduler knows the expected
latency of each operation. The conventional reservation design indirectly links producers
with consumers through physical registers. A producer does not directly know who is
waiting for its value. Instead, a producer has to broadcast its destination physical register
to all other reservation stations which they compare with their source physical registers.
This is an expensive operation in terms of latency and energy. A team member is
suggesting an alternate design in hopes of improving latency and energy consumption. In
their proposal, producers directly link with consumers. Specifically, the reservation
stations take the following form:
Inst DST SRC1 SRC2 CONS1 CONS2
Where “inst” is the instruction opcode, DST, SRC1, and SRC2 are the physical register
names for a destination and for up to two source registers. Finally, CONSx are the
reservation station indexes for up to two consuming instructions.
An instruction stalls at the rename stage (stalling all preceding stages) when it cannot link
into the producer’s reservation station (both CONS field are occupied). It is allowed to
proceed when the producer leaves the reservation station. A producer can leave the
reservation station at the complete stage, so, when a consumer is waiting for an
instruction to Commit, it can Decode in the same cycle as the commit. Only one
instruction can be at R, X, M, W, and C each cycle.
A. [11 Marks] Show how the following instruction sequence will execute with this new
scheduler. Assume that the processor can fetch, rename and schedule a single instruction
per cycle.
Cycle
Instr.
0
1
2
3
4
5
6
7
8
9 1 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
2 0
2 1
2 2
2 3
2 4
ldw r8, 0(r8) F D S R X M W C ldw r9, 4(r8) F D S R X M W C ldw r7, 8(r8) F D S R X M W C add r7, r7, r9 F D S R X M W C stw r7, 12(r8) F D S R X M W C ldw r8, 16(r8) F D S R X M W C
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 12 of 18 Fall2015
B. [2 Marks] The conventional Register Alias Table maps architectural registers to
physical registers. How will it have to change to allow consumers to find the
corresponding reservation station of a producer? It will need a new field which points to the corresponding reservation station (and is cleared when an instruction commits).
C. [2 Marks] What additional information we will need in the ROB to recover the RAT
from branch mispredictions?
No new information is needed.
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 13 of 18 Fall2015
Question 8 – Coherence [5 Marks]
A 3-processor snoop coherence system uses a bus and an MSI coherence protocol. Each
processor has a private, direct-mapped data cache with a single 16B block. Show what
would be the cache contents for the three processors when the following sequence of
accesses takes place. The first action shows an example of how to fill in the table.
Initially, all caches are empty and their blocks are in invalid state (I) and the contents are
not relevant (---). P1 performs a read on address 0x0. This results in a miss which brings
the block containing 0x10…0x1F into P1’s cache which stores it in shared state (S).
Request P1 Cache P2 Cache P3 Cache Explanation P1 READ 0x10 I: ‐‐‐ I: ‐‐‐‐ I: ‐‐‐ Miss, read 0x10 block and cache in S
P2: READ 0x10 S: 0x10 I: ‐‐‐‐ I: ‐‐‐‐ Miss, read 0x10 block and cache in S
P3: READ 0x10 S: 0x10 S: 0x10 I: ‐‐‐‐ Miss, read 0x10 block and cache in S
P2: WRITE 0x10 S: 0x10 S: 0x10 S: 0x10 Promote to M, invalidate others
P3: READ 0x10 I: ‐‐‐‐ M: 0x10 I: ‐‐‐‐ Miss, demote to S
P1: READ 0x12 I: ‐‐‐‐ S: 0x10 S: 0x10 Miss (same block)
P3: READ 0x14 S: 0x10 S: 0x10 S: 0x10 Hit (same block)
P2: READ 0x20 S: 0x10 S: 0x10 S: 0x10 Miss, bring in block 0x20 to P2 as S
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 14 of 18 Fall2015
Question 9 – Consistency [10 Marks]
A two node shared memory system uses a clustered approach where each node has a
processor and associated portion of main memory. The two nodes are connected by a
link. Any processor can access any memory location be in its local memory or at the
other processor’s memory (remote). The latency for accessing the local memory is 5
cycles while accessing the remote memory requires an additional 10 cycles. The memory
system and communication link is pipelined and a new memory reference can be initiated
per cycle. The processors use out-of-order execution and allow other operations to
proceed while there are outstanding memory requests (issued but not completed). They
also use speculative execution and predict branches.
P1 P2
MEM1
10 cycle latency
MEM2
5 cycle latency
5 cycle latency
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 15 of 18 Fall2015
Consider first that the processors execute the following code where A resides in MEM1
and B resides in MEM2, initially A = B = 0:
P1: A = 1 … = B
P2: B = 1 … = A
The following diagram shows a time diagram of how these accesses may be proceed in
our system:
P1 MEM1 MEM2 P2 A=1 Issued @0 Arrives @0
Arrives @0 Issued @0 B=2
…= A Issued @1
completes @5 Issued @1
…=A completes @5 completes @5
completes @5
Arrives @11 Arrives @11
completes @16 (value = 1)
completes @16 (value = 2)
completes @26
completes @26
For example, at time 0, processor P1 issues a write to A while P2 issues a write to B.
Both are local memory writes and arrive to their memory immediately. It takes the
memories 5 cycles to complete the requests, so at cycle 5 the writes are complete and the
processors are notified at the same cycle. At cycle 1, P1 issues a read for B, it takes 10
cycles for it to arrive at MEM2 at cycle 11, another 5 cycles to complete there, and
another 10 cycles to return back to P1 at cycle 26.
If the read for B had arrive any time before cycle 5, it would have read the old value of B
and not the one written by P2. That is, a write becomes visible to any access once it
completes in memory.
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 16 of 18 Fall2015
Now consider the following program where initially flag = Data = 0 and where Data
resides in MEM2 and flag in MEM1:
P1: Data = 200 Flag = 1
P2: While (flag == 0); … = Data
A. [5 Marks] If the processors issue (not complete) all operations in program order show
a possible timing of events such that P2 reads flag =1 and Data = 0:
P1 MEM1 MEM2 P2 Issued @0
Compl @25
Compl @25
Compl @6
Data = 0
Issued @1
Compl @6
Flag = 1
Issued @1
Issued @0
Compl @15
Data = 200
Arrives @10
Compl @15
Flag = 1
Arrives @10
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 17 of 18 Fall2015
B. [5 Marks] If the processors issue and complete writes in program order but allow
reads to issue out-of-order show a sequence of events where P2 reads flag =1 and Data
=0:
Pl MEMl MEM2 P2
Compl @31
Flag = 1
Issued @0
Compl @5
Data = 0
Compl @47
Compl @37
Flag = 1
Arrives @32
Issued @22
Compl @25
Issued @26
Compl @15
Data = 200
Arrives @10
Issued @0
Student # (use if pages get separated)
ECE552 Computer Architecture Pg 18 of 18 Fall2015
(page intentionally left blank)