ece243 final exam condensed notes - exams.skule.ca

ECE243

Final Exam

Condensed Notes (ARM/C/memory)

Caution:

This document is NOT organized in lecture order.

It is based on Professor Rose's and Brown's lectures in 2020S semester.

It does not include content on any Verilog and Lab 1/2 processors.

Lab related contents / test revision contents are not added here.

The contents of lectures can vary every year.

It is written as revision document, so do NOT regard it as alternative lecture.

The lecture is very important and valuable, so DO NOT skip any lecture.

ECE243 Final exam condensed notes | 2

DATA REPRESENTATION

• Binary o Data is represented in 0 and 1 o (1001) in binary is equivalent to 8*1+4*0+2*0+1*1=9

• Decimal

o Data is represented in digits between 0 and 9

• 2's Complement

o Representing negative numbers o Flip the digits and add 1! o (1011) in 2's complement is negative of 0101 in binary

▪ 0101 (5) -> 1010 -> 1011 (-5)

• Floating point

o NOT covered in 2020S o IEEE Single precision (IEEE754) o http://cse.hcmut.edu.vn/~hungnq/courses/501120/docthem/Single%20precision%20floatin

g-point%20format%20-%20Wikipedia.pdf

Sign Exponent (8

bits)

Fraction (23

bits)

31 30-23 22-0

o (−1)sign(1 + ∑ [b−i]23i=1 2−i) ⋅ 2e−127

▪ Whereas e is biased exponent, sign is sign bit, and b−i is fraction bit which is i bit away

from exponent • b−1 is bit 22 of the data, which is 1 bit away from exponent

o (12.375) in binary = 12+0.375 = (1100) in binary + (0.011) in binary = (1100.001) in binary ▪ IEEE754 requires (1.XXXXXXX) x 2^e format ▪ (1.100001)*2^3

• Exponent is 3 -> biased form -> 1000 0010 • Fraction is 100011 • 32-bit binary32 -> 0 - 10000010 - 10001100000000000000000

• Hexadecimal

o Data is represented in digits between 0-9, A-F o (1FB3) in hexadecimal is equivalent to 1*16^3+15*16^2+11*16+3*1=8,115

Decimal 0 1 2 3 4 5 6 7 8 9 (10) (11) (12) (13) (14) (15)

Hexa-

decimal

0 1 2 3 4 5 6 7 8 9 A B C D E F

• ASCII

o Numerical codes for letters, numbers, symbols from keyboard input o Little-endian -> read from least significant bits

▪ Data .word 0x44454546 //44 = 'D' 45 = 'E' 46 = 'F' -> "FEED" because of little endian o One ASCII character = one byte

http://cse.hcmut.edu.vn/~hungnq/courses/501120/docthem/Single%20precision%20floating-point%20format%20-%20Wikipedia.pdf

http://cse.hcmut.edu.vn/~hungnq/courses/501120/docthem/Single%20precision%20floating-point%20format%20-%20Wikipedia.pdf


INTRODUCTION TO ARM ASSEMBLY • Address

o 32-bit ▪ 4 x 8-bit = 4 bytes (32-bit) word

• Every 8 bit gets own address ▪ We can access 32 bits in single memory access

o 2^32 memory locations -> 2^32 = 4E9 (4GB)

▪ DE1-SoC has 1GByte DDR3 memory = 256M words o Memory address in DE1-SoC

▪ Main memory - 0 ~ 3FFF FFFF (hexadecimal) • Leave some for I/O devices


▪ Memory-mapped I/O • LEDR9-0 - FF20 0000 • HEX3-0 - FF20 0020 • HEX5-4 - FF20 0030 • SW9-0 - FF20 0040 • KEY3-0 - FF20 0050 • A9 Timer - FFFE C600

• ARM Registers

o 16 registers (R0-R15) ▪ R0-R12 are used to store/process data

• General purpose registers ▪ R13-R15 has special purposes ▪ R13 → Stack pointer (SP)

• Contains address of current item at the top of stack • Stack grows downwards towards smaller address

▪ R14 → Link register (LR) • Used to know where to return after subroutine/function/procedure/method is

done • MOV PC,LR to return back to routine after subroutine is done

▪ R15 → Program counter (PC) • The address of next instruction to be executed

o CPSR - "Current Processor Status Register"

• Subroutines o Serves same role with function (C) or method (C++) o Performs 'certain work' when called by main routine o BL loads subroutine

▪ LR gets first line after BL ▪ PC gets subroutine

o MOV PC, LR to return ▪ Return to first line after BL (or where LR says so)

MAIN: MOV R0,#10

BL MY_SUB

//branch into MY_SUB.

// LR gets PC (first line of NEXT)

//PC gets MY_SUB

NEXT: MOV R1, #2

…

B END

END : B END

MY_SUB: MOV R1, R0

ADD R1, R0

MOV PC, LR

//executes next code after BL

MY_SUB

//(the first line of NEXT)


o Subroutine doesn't know which registers were

being used by "calling" code ▪ In general, it show "save" those registers

that it wants to use (somewhere) and

"restore" them to their original values

afterwards ▪ We will save/restore registers on the system

'stack' • A data structure for saving and

restoring • Push an item onto a stack • Pop an item off stack

• Stack operation - refer to ARM ASSEMBLY INSTRUCTIONS below o Every program must initialize stack pointer

▪ MOV SP, #0x200000 o To 'push' single (4-byte) word into stack,

▪ Decrease stack pointer by 4 (SUB SP,#4) ▪ Store value (say it is in R1) at that location (STR R1,[SP]) ▪ To make it easier,

• STR R1,[SP,#-4]! • PUSH {R1}

o To pop the item on top of stack off and put into R2 ▪ Load item on top of stack (SP) into R2 (LDR R2,SP) ▪ Add 4 to stack pointer (ADD SP,#4) ▪ To make it easier,

• LDR R2,[SP],#4 • POP {R2}

o Analogy - how we put and remove 'stack' of books ▪ We put book at the top of stack (PUSH) ▪ We remove book at the top of stack (POP)

• ARM convention for sending information to/from subroutines

o Parameters passed to subroutine should be placed in registers R0-R3 ▪ "Caller saved register" -> caller refers to code calling subroutine ▪ It is okay to change value of R0-R3

• Don't have to save/restore ▪ If we need to use more than 4 parameters -> use stack

o It is NOT okay to change R4 - R12 within subroutine ▪ "Callee saved register" -> callee refers to the subroutine ▪ If they have to be changed, they need to be saved and restored

o A single result from subroutine is put into R0 ▪ Use stack if more register is needed

Routine R0-R3 R4-R12 PUSH/POP used PUSH/POP not used

MAIN O O R0-R12 can be used R0-R12 can be used

SUBROUTINE O X - Need to be

saved/restored

to be used

PUSH {R4-R12}

(R4-R12 can be used

freely)

POP {R4-R12}

R0-R3 can be used

R4-R12 CANNOT be

used

MAIN O O R0-R12 can be used R0-R12 can be used


ARM ASSEMBLY INSTRUCTION

• How each instruction is shaped

Assembler knows

how to create

encode (Op2) for

any given

constant -> gives

error if can't

• Move instruction (MOV)

MOV Rd, Operand2 //Rd <- Operand2

o Rd - Destination register o Operand2 - very flexible

▪ Register (R0-R15) ▪ Immediate (Constant number) ▪ Other values…

o Each kind of MOV has 32 bits of code

• Load instruction (LDR)

o Look/read from memory & put into register o Load 32bit number from memory

Address Mode Usages Meanings

Index LDR R2, [R1] R2 <- Data from memory at address [R1]

Index + offset LDR R2, [R1,#4] R2 <- Data from memory at address [R1+4]

Pre-index

mode

LDR R2, [R1,#4]! R2 <- Data from memory at address

[R1+4], then update R1<-R1+4

Post-index

mode

LDR R2,[R1],#4 R2 <- Data from memory at address [R1],

then update R1<-R1+4

Conditional

instruction

LDRNE R2,[R1] R2 <- Data from memory at address [R1], if

and only if z=0 (not equal to zero)

o R1 is called "base" register o Can also have offset in register

LDR R2,[R1+R3] //R2 <- Data from memory at address [R1+R3]

• Manipulating bytes

o Put -B after instruction ▪ LDRB, STRB, … ▪ 1 bytes = 8 bit, store -128-127 (signed integer)


• Store instruction (STR) o Store/write to memory instruction

▪ STRB - store byte ▪ STR - store word

STR R2,[R1,#4] //Store value of R2 in the memory at address [R1+4]

• Condition code flags

Flag 0 1

Z Result not 0 Result=0

C carryout is 0 (no carryout) Carryout is 1

N Result is positive Result is negative

V Result is not overflowed Result is overflowed

o condition code flags are affected when there is -S after the instructions

▪ SUBS, ADDS, MOVS, ANDS… o condition code flag affects branch instructions

▪ BEQ, BNE, BLT, ….

• Branch instruction (B)

o Branch into another instruction set o Condition code bits make lots of possible branch conditions

Instruction Branch if

BEQ Z=1

BNE Z=0

BLT result<0

BLE result≤0

BGE result≥0

BGT result>0

BL Branch to subroutine

• Compare instruction (CMP)

o Subtraction operation without changing register, but change condition codes ▪ Does not require subscript (-S) to modify condition flags

CMP R0,R1 //Compare R0 and R1 and set appropriate condition codes

CMR R0,#4 //Compare R0 and 4 and set appropriate condition codes

• PUSH/POP - refer to "stack operation" above

o Used to save and restore registers in subroutines o PUSH - push word into stack

▪ PUSH {R0-R5} o POP - pop word from stack

▪ POP {R0-R5}


• Arithmetic instructions o Addition instruction (ADD)

ADD Rd, Rn, Operand2 // Rd <- Rn + Operand2

▪ Rn can be omitted if Rd and Rn are same ▪ ANDS to affect condition codes

o Subtraction instruction (SUB)

SUB Rd, Rn, Operand2 // Rd <- Rn - Operand2

o Multiplication instruction (MUL)

MUL Rd, Rn1, Rn2 // Rd <- Rn1 * Rn2

o Multiplication accumulate (MULA)

MULA Rd, Rn1, Rn2, Rn3 // Rd <- Rn1 * Rn2 + Rn3

o Floating point computation is handled separately!

▪ Not covered in this course o There is NO integer divide instruction

• Logic instructions

o Logical AND instruction (AND)

AND Rd, Rn, Op2 //Rd <- Rn * Op2 (AND of each bit)

o Logical OR instruction (ORR)

ORR Rd, Rn, Op2 //Rd <- Rn + Op2 (OR of each bit)

o Exclusive OR instruction (EOR)

EOR Rd, Rn, Op2 //Rd <- Rn ⊕ Op2 (XOR of each bit)


• Shift/Rotate instructions o Used to move bits into other bit positions o Logical shift right (LSR) instruction

LSR Rd, Rn, Op2 //Rd <- Shift register Rn right by Op2 (or divide by 2^Op2)

o Arithmetic shift right (ASR) instruction

ASR Rd, Rn, Op2 //Similar to LSR but most significant bit stays as before the shift

o Logical shift left (LSL) instruction

LSL Rd, Rn, Op2 //shift left by Op2 bits (multiply by 2^Op2)

o Rotate Right (ROR) instruction

ROR Rd, Rn, Op2 //rotate right by Op2 bits

INPUT/OUTPUT • Memory mapped Input/Output

o Load (LDR) - load data from memory o Store (STR) - store data to memory o Idea of 'special' memory addresses are used from/to transmit data to input/output devices

▪ Input devices (SW/KEY) -> LDR from input ▪ Output devices(LEDR/HEX) -> STR to output

MOV R0, =0xFF200000

LDR R1, [R0, #0x40] //load from switches

STR R1, [R0] //store to LEDR


• Polling VS Interrupt-driven synchronization of Input & Output o Polling - processor "asks" device if it has new input, or has finished transmitting pervious

output, so it (processor) can proceed ▪ Asks over and over again -> inefficient!

o Interrupt-driven - processor stops what it is doing & runs different code to "service the

interrupt" • Parallel-port interface

o Each memory address serves different work o Example - KEY parallel port

▪ Data register (FF200050) • Determines if button is pressed

▪ Edgecapture (FF20005C) • When button is released, corresponding bit in edgecapture register becomes 1 • Remains that way until program resets it

• Write/store 1 into that bit

o Measuring time ▪ in hardware, we used to set CLOCK_50 and count down counter (1 second = 50million

counts) ▪ In ARM assembly, we are setting 'A9 timer'

• Clocked with 200MHZ • 4 parallel-port registers


▪ Software-controlled hardware timer

1. Put the value you want to count down from -> into the LOAD register

a. 200 million for 1 second of time

2. Turn on the E bit in control register, simultaneously turn A on (set A and E as 1)

a. Causes Load Register to be copied into the counter and to start counting

down to zero

3. When counter reaches zero -> F bit of interrupt status register becomes 1

4. To reset F bit into zero -> write a 1 into it (like edgecapture register)

5. To have counter reload&do it again -> set A bit in control register

a. Important to reset F when timer reaches zero -> before starting new timer

_start: LDR R8, =0xFFFEC600 //Base address of A9

private timer

LDR R3, =2000000 //value to put into LOAD register

STR R3, [R8] //put into LOAD register

MOV R3, #0b011

STR R3, [R8,#8] //turn on A and E bit into control

register

WAIT: LDR R3, [R8, #0xC] //read interrupt status register

ANDS R3, #1 //isolate bit 0

BEQ WAIT //when R3 is 0

//when timer hit zero

STR R3, [R8, #0xC] //resets F bit to zero

B WAIT

• Interrupt

o Method to respond to external hardware signal and other things o Difference between subroutine and interrupt

▪ Subroutine is 'caused' by program calling ▪ Interrupt is caused by anytime when event happens

o Possible issues ▪ Processor needs to know where to begin executing after interrupt occurs

• "Exception vector table" ▪ Processor needs to know where to return when interrupt service routine is done

• Address sets put into a special link register ▪ Need to save and restore main program's (CPSR) condition codes

• Sets saved to SPSR ▪ Needs to save/restore all the registers that might be "clobbered" by interrupt service

routine ▪ ARM processor 'modes' to solve some of those issues


o ARM processor modes - Refer to "ARM Registers" above

Mode CPSR(4-0) Details

Supervisor 10011 Privilege to change CPSR mode, use MSR

User 10000 General user code

Interrupt Request (IRQ) 10010 Go into where there is an interrupt

Fast interrupt request

(FIQ)

10001

Abort 10111

Undefined 11011

▪ Supervisor mode - has privilege to change CPSR mode, use MSR

• MSR - instruction that can change MSR

MOV R0, #0b10000 //user mode code

MSR CPSR_C, R0 //CPSR(7-0) <- R0(7-0)

▪ User mode - cannot change CPSR mode ▪ IRQ mode - automatically entered when interrupt happens

• Have new, different hardware registers for R13 (SP) and R14 (LR)

▪ Start main program in user mode -> interrupt request happens

1. CPSR copied into newly available (in IRQ mode) SPSR • SPSR - "saved processor status register", save CPSR in user mode

2. CPSR is changed to set processor into IRQ mode

3. The PC is copied into new LR

• We know where to return after interrupt service routine is over

4. Need to change program counter (PC) to be the address of the interrupt service

routine • Given in Exception vector table (interrupt is one kind of exceptions)

o Sequence of events involved in handling an IRQ

i. Device raise an IRQ

ii. Processor interrupts the program currently being executed

iii. Device is informed that its request has been recognized and the device deactivates

the request signal

iv. The requested action is performed

v. Interrupt is enabled and the interrupted program is resumed


o Exception vector tables ▪ IRQ exception vector at 0x18

• Goes into IRQ mode ▪ Power on/reset at 0x0

• Goes into supervisor mode ▪ System call - SVC instruction at 0x8

• Goes into supervisor mode ▪ The scope of this course is not using user mode (yet) ▪ In monitor program -> must select "Exceptions" in linker section of new project ▪

.section .vectors, "ax"

B _start //reset vector (0x0)

B SERVICE_UND //undefined instruction vector (0x4)

B SERVICE_SVC //software interrupt vector (0x8)

B SERVICE_ABT_INST //aborted prefetch vector (0xC)

B SERVICE_ABT_DATA //aborted data vector (0x10)

.word 0 //unused vector (0x14)

B SERVICE_IRQ //IRQ interrupt vector (0x18)

B SERVICE_FIQ //FIQ interrupt vector (0x1C)

o IRQ handler - handles all possible sources of interrupts

IRQ_HANDLER: PUSH {R0-R12, LR} //push (save all registers onto the stack)

//code figures out which interrupt occurred -> i.e) Key, timer, …

//ex) if interrupt was key

BL KEY_ISR //go to specific interrupt service routine

//ex) if interrupt was timer

BL TIMER_ISR

//Before you finish -> restore the registers

POP {R0-R12, LR}

//go back to main program that was executing when interrupt request

happened

//1) PC <- LR

//2) CPSR <- SPSR (restores condition codes and mode back into CPSR)

//return from interrupt instructions

SUBS PC,LR,#4 //PC <- LR-4

//SUBS into PC causes SPSR -> CPSR transfer


o 3 ways to enable/disable interrupts -> processor, GIC, device itself • Set IRQ mode -> Device request -> GIC generate interrupt signal ->

Interrupt_service executed • Must enable interrupts/configure them in all three places

i. In processor, assume that processor starts in supervisor mode

• has privilege to change CPSR • 'I' bit in CPSR

• 0 -> interrupt is enabled (respond to IRQ) • 1 -> interrupt is disabled (ignore IRQ) • Enable interrupt -> use MSR instruction

MSR CPSR_C, #0b01010011 //lower 8 bits of CPSR

• We are not dealing with F&T bits in this course.

ii. Enable KEYs (pushbutton) to cause an interrupt

• Turn on a bit on interrupt mask register, to enable corresponding bit in

edgecapture register to request an interrupt to GIC • Interrupt service routine must

a. Read edge capture register to determine which key is pressed (should

respond to 1)

b. When done, the routine MUST reset that bit in edge capture register

iii. Enable/configure interrupts in the GIC • Too complicated -> use BL CONFIG_GIC • Need 2 memory-mapped registers in the GIC to deal with multiple interrupts

• 0xFFFEC10C - ICCIAR (Interrupt Acknowledge Register) • 0xFFFEC110 - ICCEOIR (End of Interrupt Register)

• Each possible source from a device of an interrupt has its own numerical 10-bit ID

number (in decimals) • KEY = 73 • A9_private timer = 29 • Interval timer = 72

• Bit 9-0 of ICCIAR tell you ID of device causing interrupt • Bit 9-0 of ICCEOIR should be written with ID of device that you want to

acknowledge, and have GIC turn off IRQ


o Steps required in interrupts

1. Set up exception vector table

2. Initialize stack pointers (both supervisor mode and IRQ mode)

MSR CPSR_C,#0b11010010 //IRQ mode, disabled interrupts

LDR SP, =0x20000 //sets IRQ banked SP

MSR CPSR_C, #0b11010011 //supervisor

LDR SP, =0x40000 //set supervisor SP -> different from IRQ's SP

3. Configure GIC -> BL CONFIG_GIC (code provided below)

.global CONFIG_GIC

CONFIG_GIC:

PUSH {LR}

/* Configure the A9 Private Timer interrupt, FPGA KEYs, and FPGA Timer

/* CONFIG_INTERRUPT (int_ID (R0), CPU_target (R1)); */

MOV R0, #MPCORE_PRIV_TIMER_IRQ

MOV R1, #CPU0

BL CONFIG_INTERRUPT

MOV R0, #INTERVAL_TIMER_IRQ

MOV R1, #CPU0

BL CONFIG_INTERRUPT

MOV R0, #KEYS_IRQ

MOV R1, #CPU0

BL CONFIG_INTERRUPT

/* configure the GIC CPU interface */

LDR R0, =0xFFFEC100 // base address of CPU interface

/* Set Interrupt Priority Mask Register (ICCPMR) */

LDR R1, =0xFFFF // enable interrupts of all priorities levels

STR R1, [R0, #0x04]

/* Set the enable bit in the CPU Interface Control Register (ICCICR). This bit

* allows interrupts to be forwarded to the CPU(s) */

MOV R1, #1

STR R1, [R0]

/* Set the enable bit in the Distributor Control Register (ICDDCR). This bit

* allows the distributor to forward interrupts to the CPU interface(s) */

LDR R0, =0xFFFED000

STR R1, [R0]

POP {PC}

4. Enable interrupts in device

5. Turn on processor interrupts -> set bit I in CPSR as zero

6. Write IRQ_HANDLER that figures out which device is causing interrupt & call the

appropriate service routine • Make sure to turn off GIC, device interrupt for each instance of an interrupt


(Interrupt example)

.section .vectors, "ax"

B _start // reset vector

.word 0 // undefined instruction vector

.word 0 // software interrrupt vector

.word 0 // aborted prefetch vector

.word 0 // aborted data vector

.word 0 // unused vector

B IRQ_HANDLER // IRQ interrupt vector

.word 0 // FIQ interrupt vector

/* ********************************************************************************

* This program demonstrates use of interrupts with assembly language code.

* The program responds to interrupts from the pushbutton KEY port in the FPGA.

********************************************************************************/

.text

.global _start

_start:

/* Set up stack pointers for IRQ and SVC processor modes */

MOV R1, #0b11010010 // interrupts masked, MODE = IRQ

MSR CPSR_c, R1 // change to IRQ mode

LDR SP, =0x20000 // set IRQ stack pointer

/* Change to SVC (supervisor) mode with interrupts disabled */

MOV R1, #0b11010011 // interrupts masked, MODE = SVC

MSR CPSR, R1 // change to supervisor mode

LDR SP, =0x40000 // set SVC stack

BL CONFIG_GIC // configure the ARM generic interrupt controller

// write to the pushbutton KEY interrupt mask register

LDR R0, =0xFF200050 // pushbutton KEY base address

MOV R1, #0xF // set interrupt mask bits

STR R1, [R0, #0x8] // interrupt mask register is (base + 8)

// enable IRQ interrupts in the processor

MOV R0, #0b01010011 // IRQ unmasked, MODE = SVC

MSR CPSR, R0

MAIN_LOOP:

AND R0, R1, R2 // code doesn't do anything useful

EOR R3, R4, R5

ORR R6, R7, R8

AND R8, R7, R6

EOR R5, R4, R3

ORR R2, R1, R0

B MAIN_LOOP // main program simply repeats the loop

.text

IRQ_HANDLER:

PUSH {R0-R5, LR}

/* Read the ICCIAR from the CPU interface */


LDR R4, =0xFFFEC100

LDR R5, [R4, #0xC] // read from ICCIAR

CHECK_KEYS: CMP R5, #73

UNEXPECTED: BNE UNEXPECTED // if not recognized, stop here

BL KEY_ISR // pass R0 as a parameter to KEY_ISR

EXIT_IRQ:

/* Write to the End of Interrupt Register (ICCEOIR) */

STR R5, [R4, #0x10] // write to ICCEOIR

POP {R0-R5, LR}

SUBS PC, LR, #4

.global KEY_ISR

KEY_ISR: LDR R0, =0xFF200050 // base address of pushbutton KEY port

MOV R2, #0xF

STR R2, [R0, #0xC] // clear the interrupt

MOV PC, LR // return

// Configure the Generic Interrupt Controller (GIC)

/* Interrupt controller (GIC) CPU interface(s) */

.equ MPCORE_GIC_CPUIF, 0xFFFEC100 /* PERIPH_BASE + 0x100 */

.equ ICCICR, 0x00 /* CPU interface control register */

.equ ICCPMR, 0x04 /* interrupt priority mask register */

.equ ICCIAR, 0x0C /* interrupt acknowledge register */

.equ ICCEOIR, 0x10 /* end of interrupt register */

/* Interrupt controller (GIC) distributor interface(s) */

.equ MPCORE_GIC_DIST, 0xFFFED000 /* PERIPH_BASE + 0x1000 */

.equ ICDDCR, 0x00 /* distributor control register */

.equ ICDISER, 0x100 /* interrupt set-enable registers */

.equ ICDICER, 0x180 /* interrupt clear-enable registers */

.equ ICDIPTR, 0x800 /* interrupt processor targets registers */

.equ ICDICFR, 0xC00 /* interrupt configuration registers */

.global CONFIG_GIC

CONFIG_GIC:

PUSH {LR}

/* To configure the FPGA KEYS interrupt (ID 73):

* 1. set the target to cpu0 in the ICDIPTRn register

* 2. enable the interrupt in the ICDISERn register */

/* CONFIG_INTERRUPT (int_ID (R0), CPU_target (R1)); */

MOV R0, #73 // KEY port (interrupt ID = 73)

MOV R1, #1 // this field is a bit-mask; bit 0 targets cpu0

BL CONFIG_INTERRUPT

/* configure the GIC CPU interface */

LDR R0, =MPCORE_GIC_CPUIF // base address of CPU interface

/* Set Interrupt Priority Mask Register (ICCPMR) */

LDR R1, =0xFFFF // enable interrupts of all priorities levels

STR R1, [R0, #ICCPMR]

/* Set the enable bit in the CPU Interface Control Register (ICCICR). This bit


* allows interrupts to be forwarded to the CPU(s) */

MOV R1, #1

STR R1, [R0]

/* Set the enable bit in the Distributor Control Register (ICDDCR). This bit

* allows the distributor to forward interrupts to the CPU interface(s) */

LDR R0, =MPCORE_GIC_DIST

STR R1, [R0]

POP {PC}

/*

* Configure registers in the GIC for an individual interrupt ID

* We configure only the Interrupt Set Enable Registers (ICDISERn) and Interrupt

* Processor Target Registers (ICDIPTRn). The default (reset) values are used for

* other registers in the GIC

* Arguments: R0 = interrupt ID, N

* R1 = CPU target

*/

CONFIG_INTERRUPT:

PUSH {R4-R5, LR}

/* Configure Interrupt Set-Enable Registers (ICDISERn).

* reg_offset = (integer_div(N / 32) * 4

* value = 1 << (N mod 32) */

LSR R4, R0, #3 // calculate reg_offset

BIC R4, R4, #3 // R4 = reg_offset

LDR R2, =MPCORE_GIC_DIST+ICDISER

ADD R4, R2, R4 // R4 = address of ICDISER

AND R2, R0, #0x1F // N mod 32

MOV R5, #1 // enable

LSL R2, R5, R2 // R2 = value

/* now that we have the register address (R4) and value (R2), we need to set the

* correct bit in the GIC register */

LDR R3, [R4] // read current register value

ORR R3, R3, R2 // set the enable bit

STR R3, [R4] // store the new register value

/* Configure Interrupt Processor Targets Register (ICDIPTRn)

* reg_offset = integer_div(N / 4) * 4

* index = N mod 4 */

BIC R4, R0, #3 // R4 = reg_offset

LDR R2, =MPCORE_GIC_DIST+ICDIPTR

ADD R4, R2, R4 // R4 = word address of ICDIPTR

AND R2, R0, #0x3 // N mod 4

ADD R4, R2, R4 // R4 = byte address in ICDIPTR

//now that we have the register address (R4) and value (R2), write to (only) the appropriate byte

STRB R1, [R4]

POP {R4-R5, PC}

.end


C-language Programming

• Introduction o Engineers avoid assembly…

o _start is NOT main o code placed there by our 'compiler' is called the "Common startup code" which

▪ Initializes stack pointer ▪ Initializes variables as int z=5 ▪ sets uninitialized variables to zero

o main is your program!

• Reading/Writing I/O devices in C

o Reading switches and copy to LEDs

ARM Assembly C code

start: LDR R1, =0xFF200000 //base address (LEDR)

LOOP: LDR R3,[R1, #0x40] //read switches

STR R3, [R1] //store (write) into LEDs

B LOOP

volatile int * LED_ptr = (int *) 0xFF200000;

volatile int * SW_ptr = (int *) 0xFF200040;

int main(){

int value;

while(1){

value = *SW_ptr;

*LED_ptr = value;

}

}

o why are we using 'volatile'?

▪ C compiler will not 'optimize' the circuit if we use volatile. • optimize means running code 'once' instead of continuously running • Ensures variable to be accessed by using its address

▪ So, always use volatile for I/O device!

o *ptr is "pointer dereferencing"

▪ How we read/write specific addresses ▪ ptr+1 //add 1 if pointer type is char. add 4 if pointer type is int

• int is word, char is byte


Drawing graphics on VGA Display with C code!

• Recall from ECE241 -> VGA display o Drew simple pictures using hardware o Used VGA adapter module

▪ VGA adapter -> hardware wrapped around memory ▪ Frame buffer - chunk of memory where software must write colour of all of pixels

that you want drawn on the screen o DE1-SoC has 3 kinds of memory

▪ DDR3 1GByte - 0x00000000 - 0x3FFFFFFF ▪ SDRAM 64MByte - 0xC0000000 - 0xC3FFFFFF ▪ On-chip memory 256KB - 0xC8000000 - 0xC803FFFF ▪ Only SDRAM and On-chip memory have frame buffer


o The video output port continuously read frame buffer from one of memories -> draws

what is there onto actual screen ▪ When the content in memory (data) is changed, it changes what is on the screen

(operate independently with processor) • "Direct memory access (DMA)"

• Drawing pictures -> we need to write

into 'frame buffer' to change picture on

the screen o Each pixel being 16 bits

▪ order of colour is from most

significant bit to least

significant bit

Red Green Blue

15-11

5 bits

10-5

6 bits

4-0

5 bits

• Convert x & y coordinates to corresponding memory address o 0 ≤x≤319, 0≤y≤239

▪ x needs 2^9 = 512 ▪ y needs 2^8 = 256

o We need 'base' address of entire frame buffer

Base Y X 0

31 - 18 17 - 10 9 - 1 0

o Compute address of 2 bytes (1 pixel) at screen coordinate (x,y) in C: ▪ Address = base + (y ≪ 10) + (x ≪1)

• y≪a means 'shift y left by a bits' ▪ Default base address for frame buffer is 0xC8000000 (on-chip memory)

X Y Address

0 0 C8000000

1 0 C8000002

0 1 C8000400

319 239 C803BE7E


• Bresenham's Algorithm o Drawing line on VGA screen

Not possible to draw perfect line! So, we use algorithm below

• Drawing animation o Processor writing values into frame buffer o Video out system is reading from frame buffer → drawing what it finds there onto the

screen ▪ Happens simultaneously → sometimes causes weird artifacts ▪ We use 'double buffering'

• Double buffering

o Have one frame buffer that processor draws into o Different buffer that video output reads from to draw onto the display o Two buffers

▪ front buffer - being rendered on the display currently ▪ back buffer - buffer written into by the processor

o Switch them at appropriate time o Swap those buffers when

▪ Processor has finished frame it wanted to draw (when S bit in status register

becomes 0) ▪ Video out system has finished on complete redraw of the screen

o VGA standard → each frame drawn in speed of 60 frames per second ▪ Maximum swap → 60 times per second

o When double buffering happens, the second buffer is set as 0xC0000000 (SDRAM) o To tell video port that 'processor finished writing into back buffer' → write '1' into front

buffer register at FF203020 ▪ Doesn't change front buffer register → signal to the port ▪ Writing 1 to front buffer sets S bit in status register back to 1

o Does not cause 'swap' of front & back buffer yet ▪ video port waits until front buffer rendering is done ▪ When done → S bit in status register becomes 0

• Must poll the status bit to see what to draw the new frame into back buffer


int main(void){

volatile int * pixel_ctrl_ptr = (int *)0xFF203020;

// declare other variables(not shown)

// initialize location and direction of rectangles(not shown)

/* set front pixel buffer to start of FPGA On-chip memory */

*(pixel_ctrl_ptr + 1) = 0xC8000000; // first store the address in the back buffer

/* now, swap the front/back buffers, to set the front buffer location */

wait_for_vsync();

/* initialize a pointer to the pixel buffer, used by drawing functions */

pixel_buffer_start = *pixel_ctrl_ptr;

clear_screen(); // pixel_buffer_start points to the pixel buffer

/* set back pixel buffer to start of SDRAM memory */

*(pixel_ctrl_ptr + 1) = 0xC0000000; //set back buffer to SDRAM

pixel_buffer_start = *(pixel_ctrl_ptr + 1); // we draw on the back buffer

clear_screen();

while (1){

//code for drawing shapes

wait_for_vsync(); // swap front and back buffers on VGA vertical sync

pixel_buffer_start = *(pixel_ctrl_ptr + 1); // new back buffer

}

}

• Synchronize with VGA timing o Both Front and Back buffer has default

address of C8000000 o 0 ≤x≤319, 0≤y≤239 o Status register

▪ S bit - useful to know when DMA

controller has finished transferring

the last pixel (319,239)

void wait_for_vsync(){

volatile int * pixel_ctrl_ptr = 0xFF203020; //pointer to DMA controller

register int status;

*pixel_ctrl_ptr =1;

//when writing 1 to buffer register in DMA controller → does not change

content of register

//serves as request to synchronize with VGA timing → set S as 1 in status register

status = *(pixel_ctrl_ptr + 3);

//wait until S becomes zero → when all frame buffer is drawn

while((status & 0x01) !=0){

status = *(pixel_ctrl_ptr + 3);

}

}


Computer Memory

• Connection between ARM processor and memory o cache? not there yet

• Memory read - processor memory read operation triggered by LOAD instruction (LDR) o Processor puts address [R1] onto ADDR o Sets W=0, which enables read from memory o Some cycles later, memory places the data from that memory location onto data wires

▪ Processor puts it into R0

• Memory write - processor memory write operation trigger by STORE instruction (STR) o Processor puts R2 onto address bus ADDR o Processor puts R3 onto data bus o Sets W=1, which enables write into memory o Within a clock cycle of 2, memory is changed as desired

• How does memory block itself work? o We want memory to be:

▪ Speed - fast → want to execute programs and hardware outputs faster! ▪ Cost - cheap → more memory can be used if it is cheap ▪ Energy consumption → need to organize and design memory carefully

• Terminologies in memory o 256 words, 16 bits each → 256x16 memory

▪ How many address wires? → 8? (2^8 = 256)

o Let's make a smaller memory for simplicity → 16 word x 8 bits memory

▪ 4 address wires → 2^4 = 16


Building memory cell • To build this memory, we need to start with low-level building block

o Making 1 bit ▪ To make it cheap ▪ Cross-couple NOT gates

o 2 stable states

▪ x=0 or x=1 ▪ Stay that way as long as power is

on

• Connecting NMOS left to X node → works as short circuit o When W=1, B is connected to X (through NMOS transistor)

▪ W means "word", NOT memory block input W o How do we change this bit? How do we "write" into it?

▪ Turn on W (=1V) → then force B to be the value we want in the bit, and leave it in that

way long enough for change to propagate through P and Q • Turn on W and wait enough time to write into gates

o How to read from this? ▪ W=1 → just turn on W and look at X as seen on B

• Turn on W and read from B ▪ When W=0, the signal from B is blocked → state of X and Y are saved?

• Every bit is connected to "word" line and "bit" line o Each bit lines are accommodating Data

in and Data out

• To build 16 word x 8 bit (16x8) memory → 128

bits!


o To read word 0, turn on word line 0 → bit lines 0~7 will have value of word 0 o Use 4 to 16 decoder to turn on right word line as function of address wires

▪ 4-to-16 decoder → convert address lines A[3:0] into 16 word lines (0-15)

o If a memory read is happening (R/W=0), then this operation as described happens

▪ Bit line i is connected to DataOUT_i o If it is memory write (R/W=1), then bit line i is connected to DataIN_i

▪ Tristate buffer is used to connect and disconnect here • Analog circuit which we don't see in here

• Consider 2Mx8 memory, 2,097,152 words of size 8 bits o If done is same way → 2 million rows, awfully long o Want to make it into square → there are 16M bits

▪ 4096x4096

o Address word lines → we need decoder that has 4096 outputs

▪ 12 address inputs → 2^12 = 4,096 o 2 million address → 2^21 = 2,097,152 =2M o 12 of them (A20-A9) are used for decoder

▪ Leaves A8-A0 ▪ How do I use those 9 bits to complete a read operation?

• Multiplexer? 8-bit 512-to-1 MUX ▪ For writing, 8-bit 9-to-512 demultiplexer


Memory Types

• The bits built in "Building memory cell" are called "static memory" o They remember their value as long as the power is on o The memory is called "Static Random Access Memory" (SRAM)

▪ Random → You can select any (random) address o Static RAM bits need at least 5 transistors → 2 each for inverters(NOT), 1 for other transistor

▪ If we use fewer transistors, we get more memory!

• "Dynamic" RAM cell uses single transistor o Called "Dynamic Random Access Memory" (DRAM)

▪ Charge capacitor for a "1" ▪ Discharge capacitor for a "0" ▪ Forgets memory (leak charge in 100ms) when there is no power provided ▪ Controller which determines to provide power to memory cells in each 60ms

(periodically)

• Flash memory

o Used in USB sticks and Solid State Disks (SSD) o Remember data AFTER the power is turned off! o Use 'special' kind of transistor → "Floating gate transistor"

MOSFET

High voltage on gate ->

connect source and

drain

Floating

gate

transistor

Floating gate is isolated

from regular gate and

channel


o There is a way to "inject" a charge onto the floating gate that stays there permanently, until

something explicitly removes it ▪ Quantum tunneling - electrons stuck in floating gate ▪ Charge prevents the transistor to work normally

• Field from gate doesn't work ▪ Transistor in "Working" or "Not working" → two states

• Working - allow current to flow

• Not working - current is not flowing ▪ Two states are used to store "1 bit" ▪ Non-volatile memory, that stays when power is off

Specifics on DDR3 memory in DE1-SoC (This is not tested in 2020S)

• DE1-SoC has 1GB DDR3, which consists two 512MB memories o each chip is 256Mbits x 16 bits = 512MBytes o How many address inputs are needed?

▪ 256M word → 2^28 → 28 address lines? o We only have 15 address inputs!

▪ Row address and column addresses are each input separately using same address

signals • A14-0 Address inputs

▪ Also BA2-0 signals are also address inputs to select a bank • 15 address inputs → 32K rows in memory

o There are 1K 16-bit columns in bank ▪ For reading, column select multiplexer (16-bit wide 1K-to-1 MUX) chooses one of the

1K 16-bit columns ▪ For writing to the memory, there should be a 10-1K decoder

• Each decoder output would enable 16 tri-state buffers connected to each

column

o 256M x 16 = 4Gbit storage cells o 4Gbit / 32K rows = 128K bits/row o 128k bits / 16bits per column = 8K columns o 8k columns / 8 banks = 1k columns per bank o Address lines → 15 (row) + 10 (col) + 3 (bank) = 28


• Bank-row-column (BRC) address connections for DDR memory

27 - 25 24 - 10 9 - 0

Bank Row Column

• Row-bank-column (RBC)

27 - 13 12-10 9 - 0

Row Bank Column

• Refresh - every storage cell has to be read/written at regular periods to avoid the memory cells

(capacitors) losing their storaged charge (voltage) o Each bank can be refreshed independently, and in parallel

• DDR controller - a DDR controller is required between ARM and the DDR memory o Given a 28-bit address A27-0 from ARM processor, the DDR port

▪ Activate a bank using BA2-0 ▪ Apply row address on DDR A14-0, and then pulse "RAS" (pulse it low) input

• RAS → Row address strobe ▪ Apply column address on DDR A9-0, and pulse "CAS" (pulse it low) input

• CAS → Column address strobe ▪ For read → DDR memory provide a word to DDR port, which provides it to ARM

processor ▪ For write → the word provided by ARM processor is then send to DDR memory by the

DDR port o In DDR memory, it can take many cycles to read a word.

▪ However, successive words (which would be in successive columns) can be provided

quickly ▪ Why is it useful for a memory to provide multiple words in successive clock cycles?

• To fill a cache line.


Introduction to Cache memory

• Computer doesn't have simple block of memory, because: o Big memories are slow - takes a long time to read and write

▪ Because distance is big → signal takes significant time to travel • Speed of light = 30cm/ns

o Small memories are fast for opposite reason ▪ But we need big memories

o Analogy - humans have different speeds and kinds of memory

• Modern main memory takes 300 processor cycles to get memory access o Processor can request memory access (read OR write) far faster than this memory can

possibly supply! ▪ Main memory is slow, processor is fast ▪ Insert cache memory - smaller, faster - in between the main memory and the

processor • Add logic that keeps memory we are currently using (the processor is currently

accessing) into the cache • Between fast processor & slow main memory, we build special circuit called "cache memory"

o Keep memory locations we are currently using inside → access them quickly o When not using it → throw it out and bring something else in

▪ We DO reuse memory - loop is repeating same code over and over again • If we have cache, we can get into it quickly

▪ Locality of reference - stick some locality to cache → program can operate fast • If you put some content into cache memory, you can run them faster without

reaching main memory • Before we are hitting main memory, we are hitting 3 levels of cache memories

o Add "logic" that keeps memory location the processor is currently using in the cache →

better overall performance o If the processor ONLY uses a limited range of main memory at once → processor works

more quickly ▪ Processor accesses have "locality of reference"

• Processor stays in local area of memory for a long time (few microseconds) ▪ Make sense for loops (for) / data (small matrix)

o Two kinds of locality in memory access/reference ▪ Spatial locality - processor accesses word that are 'near' to each other in memory

addresses ▪ Temporal locality - reuse same sets of data (data words) over and over

• General operation of cache memory

1. Processor requests data from memory at address 'A' (read)

2. Cache locks itself to see if it has memory location of A inside itself

a. Cache doesn't have memory location of A → "cache miss"

i. Ask main memory to "put into the cache" → not only A but few more

b. Cache has memory location of A → "Cache hit"

i. Quickly send data over to processor (faster than main memory)

Hit rate of cache is higher than 90%


• Cache memory organization o Cache memory → organized in rows

and columns o Row → "Cache line" - contain several

words of memory ▪ Words, NOT bytes

o Consider a very small cache that has 4

cache lines in it, and 4 words/line

o Think about main memory - organized into blocks of words as same size of cache lines o The words in block are more likely to be accessed once one of them is accessed

Q) When processor and cache memory are connected,

1. where do we put any given main memory block in the cache?

a. How many different lines can a specific main memory block go into?

(# of main memory blocks) ≫ (# cache lines)

• Just one line → "Direct mapped cache" ▪ Each memory block is allowed to go into specific line in cache ▪ Easy to figure out if specific block (set of words) is in the cache

• Several, or all lines → "Associative cache" ▪ Memory block goes into more than one cache line

• All lines in extreme case

2. How do we know if the block (4 words) that processor wants, is already in the cache or not?

a. Is it a "hit" or "miss"? • Tag bits in each cache say which main memory block is resident in cache line

▪ Tag → a function of address of the block • Valid (V) bit must be equal to 1

▪ Becomes 1 → there is a valid block in line ▪ Zero as default

What happens when the processor wants to write into memory location that is in the cache?

a. Do we change cache memory and main memory? Ultimately yes, but whe? • Change both immediately, called "write through" cache

▪ Send wrote to main memory and cache • Can wait until finished with cache line, and only write to main memory if cache line is

changed ▪ "Dirty bit" (D) is set if any part (word) of line is writtem

• "write through" cache

Suppose cache is full → all lines occupied and need to bring a new block in

a. Which line containing memory block gets kicked out? • Direct mapped cache - answer is clear → only one choice to evict (one memory line) • Associative cache - has choices of which main memory block to evict

▪ "Replacement policy" says which one to choose


Directed mapped caches • each main memory block (j) can only go into 1 cache line • for a 4 line, 4 words/line cache, then main memory block j can only reside in cache line j mod 4

(j%4) o Remainder after division by 4 o 4 refers to # of lines in the cache memory

MM Block 0 1 2 3 4 5 6 7 8

Cache

line

0 1 2 3 0 1 2 3 0

• How do we know which main memory block is resident within the given line of cache (when we

are accessing into particular memory location)? o Tag bits, associated with cache line tell us which main memory block that line contains o Tag comes from main memory address of block/word

• Consider binary address (word address) of the first 8 blocks in main memory

Block # Binary

address

0 00…0000XX

1 00…0001XX

2 00…0010XX

3 00…0011XX

4 00…0100XX

5 00…0101XX

6 00…0110XX

7 00…0111XX

8 00…1000XX

• How do you know which main memory block is in the cache?

o Tag bits will let you know → 00…00 o These bits uniquely identify which block in main memory is placed in a specific cache line

Tag bits Direct mapped

line

Word # in

block

00…00 00 XX

• Memory request from processor to following word address A

Tag bits (remaining

bits)

Cache line (2

bits)

Word (2

bits)

𝐴𝑇 𝐴𝐿 𝐴𝑊

o This address is sent to cache o Cache looks at the tag for that line o There is a hit in the cache when:

▪ Tag(𝐴𝐿) = 𝐴𝑇 //is tag of line 𝐴L equal to 𝐴𝑇? ▪ V(𝐴L) = 1 //is valid bit 1?

• Initially all cache lines are zero until cache lines are loaded o If Tag doesn't match OR V=0, there's a miss in the cache

▪ If this happens, we have to evict the current valid block out of this cache line


▪ Eviction → load main memory block A into cache, and set Tag(𝐴𝐿) = 𝐴𝑇, V(𝐴L) = 1,

D(𝐴L) = 0 ▪ "Eviction" - If this is a "write back" cache,

• We check D(𝑨𝐋), "Dirty bit", which indicates if entered the cache (because of

write) • If so, must write those changes back to main memory (VS. write through

cache)

• Example cache - different → 4 lines, 2 words per line

V Tag Word 1 Word 0 Word line

#

1 00…00 0xCA 0xFE 0

1 00…01 0x0E 0xAD 1

1 00…00 0xBE 0xEF 2

0 00…01 0xFE 0xED 3

o Processor address reads:

Address What returns back to

processor

Tag Line Word

0….01 01 0

0xAD

0….01 11 0

Miss → Line 3 is invalid (V=0)

0….00 00 1

0xCA

0….01 10 1

Miss → Tag in line 2 does not

match

o # bits in L = 2 → 4 lines o # bits in W = 1 → 2 words per line

• Pros and cons of direct-mapped cache o Good - easy to determine if main

memory block is in the cache o Bad - main memory blocks compete

for just 1 line cache ▪ Better to have more than 1 line

for each main memory block to

go into the cache • "Associative cache"


Fully Associative Cache

• Harder to detect "hit" now! o Have to check every tag in the cache

to see if the main memory block we

want is in the cache • We need "associative memory" → a.k.a

"content addressable memory" (CAM) o Give it data & tells us if that data is in

the memory & if so, which line/row o In this, the input to associative memory

is T (𝐴𝑇) → Tag in address

▪ Output • If T matches any of the tags in the cache • The line number that matches if so

• Harder to detect "hit" now! o Have to check every tag in the cache to see if the main memory block we want is in the

cache • We need "associative memory" → a.k.a "content addressable memory" (CAM)

o Give it data & tells us if that data is in the memory & if so, which line/row o In this, the input to associative memory is T (𝐴𝑇) → Tag in address

▪ Output • If T matches any of the tags in the cache • The line number that matches if so

• How do we build an associative

memory? o Need a comparator digital

circuit ▪ Comparator - circuit to

see if two n-bit numbers

are equal

o So, circuit for n-entry associative memory, for an associative cache:

• Expensive (one comparator for every

line) but fast o Lots of hardware for fully

associative cache • Can compromise between direct

mapped and fully associative cache o Where we can put a main memory

block into 1 of M different lines in a

cache (M<all) o "M-way set associative cache" o Needs less hardware than fully

associative, but have some of the

benefit!


M-way Set Associative Cache

• Compromise each main memory block can go into M cache lines o Reduce conflicts and hardware is cheaper than fully associative

• Details of M-way set associative caches o Through an example based on ARM o Include "byte addressable" → every byte gets an address o Continue to also talk about words = 32 bits (4 bytes = 1 word) o "Set" is group of lines, contains 4 cache lines in each set (when M=4)

ex) ARM processor has 32KB cache, 32 bytes (8 words) per line

• 4 way set associative cache (M=4) • 4 cache lines in each set → a main memory block can be put into one of four possible lines in a

specific set o To figure out how main memory address tells us where to look into the cache if MM block is

there o Also TAG to look for in the cache

Q) How many lines in the cache?

32KB/32 bytes per line = 1024 lines

Q) 32KB cache → how many possible sets are in this cache?

32KB cache/(32 bytes per line) = 1024 lines

1024 lines/(4 lines per set) = 256 sets → 8 bits to specify, as 2^8 = 256

• 8 words per line o # of bits to specify word = 3 bits to say which word (2^3=8)

• Now, consider the address sent from

processor to memory system. o Recall main memory is grouped into

blocks of size of 8 words (=size line) o Block 0, 1, …, 125M in main memory →

block j o Which set does a main memory block

go into the cache? ▪ It goes into set j%256 (j mod 256)

o Here is the way to look at main

memory address emitted by processor

• Into binary

Tag bits (19) Set bits (8) Word bits (3) Byte bits (2) Meaning (address of a

block)

000….00000 00000000 XXX XX Set 0 in cache


… … … …




… … … …

• Way to show 4-way set associative cache of size 32KB, 8 words/line


• When address A is emitted by processor, say for a read memory address, this cache goes to set

S as given above o checks to see if the Tag in the 4 lines in that set matches T in the address o We have to check 4 of the tags in the set, in a 'smaller' associative search that is needed in

fully associative cache • Other questions arise:

o We have a choice of which of 4 lines in a set, where to put a given memory bl9ock ▪ If there is empty line → its V bit equals zero → just put it there

o If none of the 4 lines in the set are empty → we must evict one of them ▪ Which one? → the one least likely to use if possible ▪ The way in which we choose which one is called "replacement policy" of the cache

• Replacement policy

1. Kick out the 'oldest' → the one that has been there the longest

a. What if the oldest one is the variable we use all the time? NOPE

2. Least Recently Used policy (LRU)

a. Kick out the line that was used the most (longest) time ago

b. Each line gets an extra 2-bit counter

i. When a line is used, its counter is reset to zero

ii. And every other counters are incremented by 1

c. Evict the counter that is highest number in counter (=3)

3. It turns out, choosing a random line to evict works as well as LRU

Cache memory performance

• Processor 'emits' requests to access memory

o Request 'hits' in the cache → takes 'C' processor cycles to satisfy request ▪ Real world → C ≈ 4 cycles

o Request 'misses' in the cache (memory location not in cache)


▪ Takes 'M' cycles to satisfy request ▪ Around 300 cycles → "Miss penalty"

o Miss hurts! We care a lot about speed performance of processor

• Performance of processor based on cache hit rate (h) o Hit rate of cache (h) = (# of memory accesses that 'hit')/(Total # of memory accesses)

▪ h = (# hits)/(# hits) + (# miss)) o Miss rate = 1-h = (# of memory accesses that 'miss')/(Total # of memory accesses)

▪ h = (# miss)/(# hits) + (# miss)) o Given C and M, we can compute average # of cycles it takes to do memory access

▪ Average Access Cycles (AAC) = h*C + (1-h)*M

ex) Hit rate 90% (h=0.9), with C=4, M=300

AAC = 0.9*4+0.1*300=33.6 cycles

h .85 .90 .95 .98 .99

AAC 48 34 19 9.9 7.0

→ when h approaches to 1, AAC significantly drops!

• Reduce miss penalty by addressing another layer of cache!

o Level 2 (L2) cache will be bigger than first cache, and a little slower (but not as slow as

main memory) o Modern systems have L1-L3 caches o Instruction fetches are different in memory access patterns from data fetches

▪ Level 1 cache is split into two parts • L1 data cache • L1 instruction cache

ece243 final exam condensed notes - exams.skule.ca

Documents