synar systems networking and architecture group cmpt 886: computer architecture primer dr. alexandra...

SYNAR Systems Networking and Architecture GroupSYNAR

Systems Networking and Architecture Group

CMPT 886: Computer Architecture Primer

Dr. Alexandra FedorovaSchool of Computing Science

SFU

SYNAR Systems Networking and Architecture Group

Outline

• Caches• Branch prediction• Out-of-order execution• Instruction Level Parallelism


Caches

• Level 1 / Level 2 / Level 3• Instruction/Data or unified


Direct-Mapped Cache

Line size = 32 bytes

Cache eviction


Set-Associative Cache

• 4-way set associative cache• The data can go into any of the four locations• When the entire set is full, which line should we replace? • LRU – least recently used (LRU stack)


Cache Hit/Miss

• Cache hit – the data is found in the cache• Cache miss – the data is not in the cache• Miss rate:– misses per instruction– misses per cycle– misses per access (also miss ratio)

• Hit rate:– the opposite


Cache Miss Latency

• How long you have to wait if you miss in the cache

• Miss in L1 L2 latency (~20 cycles)• Miss in L2 memory latency (~300 cycles)

(if there is no L3)


Writing in Cache

• Write through – write directly to memory• Write back – write to memory later, when the

line is evicted


Caches on Multiprocessor Systems

Bus

cache

memory

cachecache

© Herlihy-Shavit 2007


Processor Issues Load Request

Bus

cache

memory

cachecache

datadata



Another Processor Issues Load Request

Bus

cache

memory

cachecache

data

dataBus

I got data

dataBus

I want data



memory

Bus

Processor Modifies Data

cache cachecache

data

datadata

Now other copies are invalid

data



Send Invalidation Message to Others

memory

Bus

cache cachecache

data

datadata data

Invalidate!

Bus

Other caches lose read permission

No need to change now: other caches can provide valid data



Processor Asks for Data

memory

Bus

cache cachecache

data

datadata

Bus

I want data

data



Shared Caches

• Filled on demand• No control over cache shares• An aggressive thread can grab a large cache share, hurt others

Thread 1 Thread 1 Thread 2 Thread 2

Thread 1 Thread 1Thread 1 Thread 1

Thread 1 Thread 1

Thread 1 Thread 1Thread 1 Thread 1Thread 1 Thread 2


Outline



Branching and CPU Pipeline

CPU pipeline


Branching Hurts Pipelining


Branch Prediction


Outline



Out-of-order Execution

• Modern CPUs are super-scalar• They can issue more than one instructions per

clock cycle• If consecutive instructions depend on each

other instruction-level parallelism is limited• To keep the processor going at full speed,

issue instructions out of order


Speculative Execution

• Out-of-order execution is limited to basic blocks• To go beyond basic blocks, use speculative execution


Outline



Instruction-Level Parallelism

• Many programs fail to keep processor busy– Code with lots of loads– Code with frequent and unpredictable branches

• CPU cycles are wasted: power is consumed, no useful work is done

• Running multiple threads on the chip helps this

synar systems networking and architecture group cmpt 886: computer architecture primer dr. alexandra...

Documents