memory/storage architecture lab computer architecture memory hierarchy

42
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

Upload: grant-watkins

Post on 13-Jan-2016

235 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

Memory/Storage Architecture Lab

Computer Architecture

Memory Hierarchy

Page 2: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

2Memory/Storage Architecture Lab 2

Technology Trends

0

50

100

150

200

250

300

'80 '83 '85 '89 '92 '96 '98 '00 '04 '07

Trac

Tcac

Year Capacity $/GB

1980 64Kbit $1500000

1983 256Kbit $500000

1985 1Mbit $200000

1989 4Mbit $50000

1992 16Mbit $15000

1996 64Mbit $10000

1998 128Mbit $4000

2000 256Mbit $1000

2004 512Mbit $250

2007 1Gbit $50

Page 3: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

3Memory/Storage Architecture Lab 3

Memory Hierarchy

Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

Burks, Goldstine, and von Neumann, 1946

Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available … We are … forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

Burks, Goldstine, and von Neumann, 1946

CPU

Size of the memory at each level

Decreasing cost

Increasing speed and bandwidth

Levels in the memory hierarchy

Level 1

Level 2

• • •

Level n

Page 4: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

4Memory/Storage Architecture Lab 4

Memory Technology (Big Picture)

Speed: Fastest Size: Smallest Cost: Highest

Slowest

BiggestLowest

Processor

Control

DatapathMemory

Memory

Memory

Mem

or

y Mem

or

y

Page 5: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

5Memory/Storage Architecture Lab 5

Memory Technology (Real-world Realization)

Processor

Control

Off-chip Level Caches (SRAM)

Main Memory (DRAM)

Secondary

storage (Disk)

Reg

isters O

n-ch

ip

Cach

es

Register Cache Main Memory Disk Memory

Speed <1ns <5ns 50ns~70ns 5ms~20ms

Size 100B KB→MB MB→GB GB→TB

Management Compiler Hardware OS OS

Register Cache Main Memory Disk Memory

Speed <1ns <5ns 50ns~70ns 5ms~20ms

Size 100B KB→MB MB→GB GB→TB

Management Compiler Hardware OS OS

Page 6: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

6Memory/Storage Architecture Lab 6

Memory Hierarchy

An optimization resulting from a perfect match between memory technology and two types of program locality

Temporal locality (locality in time) − If an item is referenced, it will tend to be referenced again soon.

Spatial locality (locality in space)− If an item is referenced, items whose addresses are close by will

tend to be referenced soon.

Goal : To provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

Page 7: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

7Memory/Storage Architecture Lab 7

Temporal and Spatial Localities

Source: Glass & Cao (1997 ACM SIGMETRICS)

Page 8: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

8Memory/Storage Architecture Lab 8

Memory Hierarchy Terminology

Hit – Accessed data is found in upper level Hit Rate = fraction of accesses found in upper level Hit Time = time to access the upper level

Miss – Accessed data found only in lower level Processor waits until data is fetched from next level,

then restarts/continues access Miss rate = 1 – (hit rate) Miss penalty = time to get block from lower level

+ time to replace in upper level

Hit time << miss penalty Average memory access time << worst case access time Average memory access time

= hit time + miss rate ⅹmiss penalty

Data are transferred in the unit of blocks

Page 9: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

9Memory/Storage Architecture Lab 9

(CPU) Cache

Upper level : SRAM (small, fast, expensive)

lower level : DRAM (large, slow, cheap) Goal : To provide a “virtual” memory technology that has

an access time of SRAM with the size and cost of DRAM Additional benefits

Reduction of memory bandwidth consumed by processor

More memory bandwidth for I/O No need to change the ISA

Page 10: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

10Memory/Storage Architecture Lab 10

Direct-mapped Cache

Each memory block is mapped to a single cache block The mapped cache block is determined by memory block

address mod number of cache blocks

Page 11: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

11Memory/Storage Architecture Lab 11

Direct-Mapped Cache Example

Consider a direct-mapped cache with block size 4 bytes and total capacity 4KB

Assume 1 word per block… The 2 lowest address bits

specify the byte within a block The next 10 address bits

specify the block’s index within the cache

The 20 highest address bits are the unique tag for this memory block

The valid bit specifies whether the block is an accurate copy of memory

Exploit temporal locality

Page 12: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

12Memory/Storage Architecture Lab 12

On cache read

On cache hit, CPU proceeds normally On cache miss (handled completely by hardware)

Stall the CPU pipeline Fetch the missed block from the next level of hierarchy Instruction cache miss

− Restart instruction fetch

Data cache miss− Complete data access

Page 13: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

13Memory/Storage Architecture Lab 13

On cache write

Write-through Always write the data into both the cache and main

memory Simple but slow and increases memory traffic (requires a

write buffer) Write-back

Write the data into the cache only and update the main memory when a dirty block is replaced (requires a dirty bit and possibly a write buffer)

Fast but complex to implement and causes a consistency problem

Page 14: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

14Memory/Storage Architecture Lab 14

Write allocation

What should happen on a write miss? Alternatives for write-through

Allocate on miss: fetch the block Write around: don’t fetch the block

− Since programs often write a whole block before reading it (e.g., initialization)

For write-back Usually fetch the block

Page 15: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

15Memory/Storage Architecture Lab 15

Memory Reference Sequence

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0

0 0 XXXX XXXX

1 0 XXXX XXXX

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data

Cache Initially Empty

Cache Initially Empty

Page 16: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

16Memory/Storage Architecture Lab 16

After Reference 1

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 0 XXXX XXXX

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data

Cache Miss, Place Block at Index 0

Cache Miss, Place Block at Index 0

Miss

Page 17: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

17Memory/Storage Architecture Lab 17

After Reference 2

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000001 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 0 XXXX XXXX

… …

Index

Valid

Tag Data

Cache Miss, Place Block at Index 1

Cache Miss, Place Block at Index 1

Miss

Page 18: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

18Memory/Storage Architecture Lab 18

After Reference 3

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000001 1111111111 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)

… …

Index

Valid

Tag Data

Cache Miss, Place Block at Index 1023

Cache Miss, Place Block at Index 1023

Miss

Page 19: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

19Memory/Storage Architecture Lab 19

After Reference 4

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00

0 1 00000000000000000000 Memory bytes 0…3 (copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)

… …

Index

Valid

Tag Data

Cache Hit to Block at Index 0

Cache Hit to Block at Index 0

Hit

Page 20: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

20Memory/Storage Architecture Lab 20

After Reference 5

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000100 0000000000 00 [same index!]

0 1 Memory bytes 16384…16387(copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)

… …

Index

Valid

Tag Data

Cache Miss, Replace Block at Index 0

Cache Miss, Replace Block at Index 0

Miss0000000000000000000000000000000000000100

Page 21: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

21Memory/Storage Architecture Lab 21

After Reference 6

Look at the following sequence of memory references for the previous direct-mapped cache

0, 4, 8188, 0, 16384, 0 Address = 00000000000000000000 0000000000 00 [same index!]

0 1 Memory bytes 0…3 (copy)

1 1 00000000000000000000 Memory bytes 4…7 (copy)

2 0 XXXX XXXX

3 0 XXXX XXXX

1021 0 XXXX XXXX

1022 0 XXXX XXXX

1023 1 00000000000000000001 Memory bytes 8188…8191 (copy)

… …

Index

Valid

Tag Data

Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5

Misses

Cache Miss, Replace Block at Index 0 Total of 1 Hit and 5

Misses

Miss

Agai

n

0000000000000000000000000000000000000100

Page 22: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

22Memory/Storage Architecture Lab 22

Exploiting Spatial Locality - Larger than one word block size

16 KB Direct-mapped cache with 256 64B (16 words) blocks

Page 23: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

23Memory/Storage Architecture Lab 23

Miss Rate vs. Block Size

Page 24: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

24Memory/Storage Architecture Lab 24

Set-Associative Caches

Allow multiple entries per index to improve hit rates n-way set associative caches allow up to n conflicting references to be

cached− n is the number of cache blocks in each set− n comparisons are needed to search all blocks in the set in parallel− When there is a conflict, which block is replaced (this was easy for direct mapped caches

– there`s only one entry!)

Fully-associative caches− a single (very large!) set allows a memory location to be placed in any cache block

Direct-mapped caches are essentially 1-way set-associative caches

For fixed cache capacity, higher associativity leads to higher hit rates Because more combinations of memory blocks can be present in the

cache Set associativity optimizes cache contents, but at what cost?

Page 25: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

25Memory/Storage Architecture Lab 25

Cache Organization Spectrum

Page 26: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

26Memory/Storage Architecture Lab 26

Implementation of Set Associative Cache

Page 27: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

27Memory/Storage Architecture Lab 27

Cache Organization Example

One-way set associative

(direct mapped)

Block Tag

Data

0

1

2

3

4

5

6

7

Two-way set associative

Set Tag

Data

0

1

2

3

Tag

Data

Four-way set associative

Set

0

1

Tag

Data Tag

Data Tag

Data Tag

Data

Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data Tag

Data

Eight-way set associative (fully associative)

Page 28: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

28Memory/Storage Architecture Lab 28

Cache Block Replacement Policy

Direct-mapped Caches No replacement policy is needed since each memory block

can be placed in only one cache block N-way set-associative Caches

Each memory block can be placed in any of the n cache blocks in the mapped set

Least Recently Used (LRU) replacement policy is typically used to select a block to be replaced among the blocks in the mapped set

LRU replaces the block that has not been used for the longest time

Page 29: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

29Memory/Storage Architecture Lab 29

Miss Rate vs. Set Associativity

Page 30: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

30Memory/Storage Architecture Lab 30

Memory Reference Sequence

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0

This sequence had 5 misses and 1 hit for the direct mapped cache with the same capacity

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Set Number Vali

dTag Data

0

1

255

Cache Initially EmptyCache Initially Empty

Page 31: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

31Memory/Storage Architecture Lab 31

After Reference 1

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Miss, Place in First Block of Set 0

Cache Miss, Place in First Block of Set 0

Miss

Page 32: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

32Memory/Storage Architecture Lab 32

After Reference 2

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 100

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Hit to first Block in Set 0

Cache Hit to first Block in Set 0

Hit

Page 33: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

33Memory/Storage Architecture Lab 33

After Reference 3

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000111 11111111 000

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

1 000000000000000000111 Memory bytes 8188..8195 (copy)

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Miss, Place in First Block of Set 255

Cache Miss, Place in First Block of Set 255

Miss

Page 34: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

34Memory/Storage Architecture Lab 34

After Reference 4

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

1 000000000000000000111 Memory bytes 8188..8195 (copy)

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Hit to first Block in Set 0

Cache Hit to first Block in Set 0

Hit

Page 35: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

35Memory/Storage Architecture Lab 35

After Reference 5

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000010000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

1 000000000000000010000 Memory bytes 16384..16391(copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

1 000000000000000000111 Memory bytes 8188..8195 (copy)

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Miss, Place in Second Block of Set 0

Cache Miss, Place in Second Block of Set 0

Miss

Page 36: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

36Memory/Storage Architecture Lab 36

After Reference 6

Look again at the following sequence of memory references for a 2-way set associative cache with a block size of two words (8bytes)

0, 4, 8188, 0, 16384, 0 Address = 000000000000000000000 00000000 000

1 000000000000000000000 Memory bytes 0..7 (copy)

1 000000000000000010000 Memory bytes 16384..16391(copy)

0 XXXX XXXX

0 XXXX XXXX

0 XXXX XXXX

1 000000000000000000111 Memory bytes 8188..8195 (copy)

0 XXXX XXXX

… …

Valid

Tag Data

0

1

255

Set Number

Cache Hit to first Block in Set 0

Total of 3 hits and 3 misses

Cache Hit to first Block in Set 0

Total of 3 hits and 3 misses

Hit

Page 37: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

37Memory/Storage Architecture Lab 37

Improving Cache Performance

Cache Performance is determined byAverage memory access time = hit time + (miss rate x miss penalty)

Decrease hit time Make cache smaller, but miss rate increases Use direct mapped, but miss rate increase

Decrease miss rate Make cache larger, but can increases hit time Add associativity, but can increases hit time Increase block size, but increases miss penalty

Decrease miss penalty Reduce transfer time component of miss penalty Add another level of cache

Page 38: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

38Memory/Storage Architecture Lab 38

Current Cache Organizations

Intel Nehalem AMD Opteron X4

L1 caches(per core)

L1 I-cache: 32KB, 64-byte blocks, 4-way, approx LRU replacement, hit time n/a

L1 D-cache: 32KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a

L1 I-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, hit time 3 cycles

L1 D-cache: 32KB, 64-byte blocks, 2-way, LRU replacement, write-back/allocate, hit time 9 cycles

L2 unified cache(per core)

256KB, 64-byte blocks, 8-way, approx LRU replacement, write-back/allocate, hit time n/a

512KB, 64-byte blocks, 16-way, approx LRU replacement, write-back/allocate, hit time n/a

L3 unified cache (shared)

8MB, 64-byte blocks, 16-way, replacement n/a, write-back/allocate, hit time n/a

2MB, 64-byte blocks, 32-way, replace block shared by fewest cores, write-back/allocate, hit time 32 cycles

n/a: data not available

Page 39: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

39Memory/Storage Architecture Lab 39

Cache Coherence Problem

Suppose two CPU cores share a physical address space Write-through caches

Time step

Event CPU A’s cache

CPU B’s cache

Memory

0 0

1 CPU A reads X 0 0

2 CPU B reads X 0 0 0

3 CPU A writes 1 to X 1 0 1

Page 40: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

40Memory/Storage Architecture Lab 40

Snoopy Protocols

Write Invalidate Protocol: Write to shared data: an invalidate is sent to all caches

which snoop and invalidate any copies Write Broadcast Protocol:

Write to shared data: broadcast on bus, processors snoop, and update copies

Write serialization: bus serializes requests Bus is single point of arbitration

Page 41: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

41Memory/Storage Architecture Lab 41

Write invalidate Protocol

Cache gets exclusive access to a block when it is to be written

Broadcasts an invalidate message on the bus Subsequent read in another cache misses

− Owning cache supplies updated value

CPU activity Bus activity CPU A’s cache

CPU B’s cache

Memory

0

CPU A reads X Cache miss for X 0 0

CPU B reads X Cache miss for X 0 0 0

CPU A writes 1 to X Invalidate for X 1 0

CPU B read X Cache miss for X 1 1 1

Page 42: Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy

42Memory/Storage Architecture Lab 42

Summary

Memory hierarchies are an optimization resulting from a perfect match between memory technology and two types of program locality

Temporal locality Spatial locality

The goal is to provide a “virtual” memory technology (an illusion) that has an access time of the highest-level memory with the size and cost of the lowest-level memory

Cache memory is an instance of a memory hierarchy exploits both temporal and spatial localities direct-mapped caches are simple and fast but have higher miss rates set-associative caches have lower miss rates but are complex and slow multilevel caches are becoming increasingly popular cache coherence protocols ensures consistency among multiple caches