1 the memory system (chapter 5) iosup/courses/2011_ti1400_9.ppt
Post on 21-Dec-2015
216 views
TRANSCRIPT
1
The Memory System(Chapter 5)
http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_9.ppt
Agenda
1. Basic Concepts2. Performance Considerations:
Interleaving, Hit ratio/rate, etc.3. Caches4. Virtual Memory
1.1. Organization1.2. Pinning
TU-DelftTI1400/11-PDS
3
1.1. Organization
0 1 2 3
4 5 6 7
8 9 . .
. . . .
Word Address
Byte Address
0
1
2
3
TU-DelftTI1400/11-PDS
4
1.1. Connection Memory-CPU
Memory
CPURead/Write
MFC
Address
Data
MAR
MDR
TU-DelftTI1400/11-PDS
5
1.1. Memory: contents
• Addressable number of bits• Different orderings• Speed-up techniques
- Memory interleaving - Cache memories
• Enlargement- Virtual memory
TU-DelftTI1400/11-PDS
6
1.1. Organisation (1)
sense/wr
W0
W1
W15
FF FF
Address decoder
input/outputlines b7 b1 b0
R/WCS
A0
A1
A2
A3
b1 b1
TU-DelftTI1400/11-PDS
7
1.2. Pinning
• Total number of pins required for 16x8 memory: 16- 4 address lines- 8 data lines- 2 control lines- 2 power lines
TU-DelftTI1400/11-PDS
8
32 by 32memory
array
W0
W31
......
1.2. A 1K by 1 Memory
5-bitdeco-der
10-bitaddresslines
two 32-to-1multiplexors
in out
TU-DelftTI1400/11-PDS
9
1.2. Pinning
• Total number of pins required for 1024x1 memory: 16- 10 address lines- 2 data lines (in/out)- 2 control lines- 2 power lines
• For 128 by 8 memory: 19 pins (7+8+2+2)
• Conclusion: the smaller the addressable unit, the fewer pins needed
TU-DelftTI1400/11-PDS
Agenda
1. Basic Concepts2. Performance Considerations3. Caches4. Virtual Memory
2.1. Interleaving2.2. Performance Gap
Processor-Memory2.3. Caching2.4. A Performance
Model: Hit ratio, Performance Penalty, etc.
TU-DelftTI1400/11-PDS
11
2.1. InterleavingMultiple Modules (1)
Address in Module
m bits
CSaddress
Modulen-1
CSaddress
Modulei
CSaddress
Module0
Module
k bits
MMaddress
Block-wise organization (consecutive words in single module)CS=Chip Select
TU-DelftTI1400/11-PDS
12
2.1. InterleavingMultiple Modules (2)
CSaddress
Module2**k-1
CSaddress
Modulei
CSaddress
Module0
Module
k bits
Address in Module
m bits
MMaddress
Interleaving organization (consecutive words in consecutive module)CS = Chip Select
TU-DelftTI1400/11-PDS
13
Questions
• What is the advantage of the interleaved organization?
• What the disadvantage?
Higher bandwidth CPU-memory: data transfer to/from multiple modules simultaneously
When a module breaks down, memory has manysmall holes
TU-DelftTI1400/11-PDS
14
2.2. Problem: The Performance Gap Processor-Memory
Processor: CPU Speeds 2X every 2 years~Moore’s Law; limit ~2010Memory: DRAM Speeds 2X every 7 years
Gap: 2X every 2 years
Gap Still Growing?
TU-DelftTI1400/11-PDS
15
2.2. Idea: Memory Hierarchy
increasingsize
increasingspeed
increasingcost
Disks
MainMemory
Secondarycache: L2
Primary cache: L1
CPU
TU-DelftTI1400/11-PDS
16
2.3. Caches (1)
• Problem: Main memory is slower than CPU registers (factor of 5-10)
• Solution: Fast and small memory between CPU and main memory
• Contains: recent references to memory
CPU Cache Mainmemory
TU-DelftTI1400/11-PDS
17
2.3. Caches (2)/2.4. A Performance Model
• Works because of locality principle• Profit:
- cache hit ratio (rate): h- access time cache: c- cache miss ratio (rate): 1-h- access time main memory: m- mean access time: h.c + (1-
h).m
• Cache is transparent to programmer
TU-DelftTI1400/11-PDS
18
2.3. Caches (3)
• READ operation:- if not in cache, copy block into cache and
read out of cache (possibly read-through)- if in cache, read out of cache
• WRITE operation:- if not in cache, write in main memory- if in cache, write in cache, and:
• write in main memory (store through)• set modified (dirty) bit, and write later
TU-DelftTI1400/11-PDS
19
2.3. Caches (4) The Library Analogy
• Real-world analogue: - borrow books from a library- store these books according to the first letter
of the name of the first author in 26 locations
• Direct mapped: separate location for a single book for each letter of the alphabet
• Associative: any book can go to any of the 26 locations
• Set-associative: two locations for letters A-B, two for C-D, etc
1 2 3 26…
A Z
TU-DelftTI1400/11-PDS
20
2.3. Caches (5)
• Suppose- size of main memory in bytes: N = 2n
- block size in bytes: b = 2k
- number of blocks in cache: 128- e.g., n=16, k=4, b=16
• Every block in cache has valid bit (is reset when memory is modified)
• At context switch: invalidate cache
TU-DelftTI1400/11-PDS
Agenda
1. Basic Concepts2. Performance Considerations3. Caches4. Virtual Memory
3.1. Mapping Function3.2. Replacement Algorithm3.3. Examples of Mapping3.4. Examples of Caches in
Commercial Processors3.5. Write Policy3.6. Number of Blocks/Caches/…
TU-DelftTI1400/11-PDS
22
3.1. Mapping Function
1. Direct Mapped Cache (1)
• A block in main memory can be at only one
place in the cache
• This place is determined by its block number
j:
- place = j modulo size of cache
5 7 4
tag block word
main memory address
TU-DelftTI1400/11-PDS
23
3.1. Direct Mapped Cache (2)
BLOCK 0
.................
BLOCK 127BLOCK 128BLOCK 129.................. BLOCK 255BLOCK 256
5 bitstag
tag
tag
BLOCK 0
BLOCK 1
BLOCK 2CACHE
main
mem
ory
TU-DelftTI1400/11-PDS
24
3.1. Direct Mapped Cache (3)
BLOCK 0
BLOCK 1
.................
BLOCK 127BLOCK 128BLOCK 129.................. BLOCK 255BLOCK 256
5 bits
CACHE
main
mem
ory
tag
tag
tag
BLOCK 0
BLOCK 1
BLOCK 2
TU-DelftTI1400/11-PDS
25
3.1. Mapping Function
2. Associative Cache (1)
• Each block can be at any place in cache
• Cache access: parallel (associative) match of tag in address with tags in all cache entries
• Associative: slower, more expensive, higher hit ratio
12 4
tag word
main memory address
TU-DelftTI1400/11-PDS
26
3.1.2. Associative Cache (2)BLOCK 0
BLOCK 1
.................
BLOCK 127BLOCK 128BLOCK 129.................. BLOCK 255BLOCK 256
12- bits
128 blocks
main
mem
ory
tag
tag
tag
BLOCK 0
BLOCK 1
BLOCK 2
TU-DelftTI1400/11-PDS
27
3.1. Mapping Function
3. Set-Associative Cache (1)
• Combination of direct mapped and associative
• Cache consists of sets
• Mapping of block to set is direct, determined by set number
• Each set is associative
6 6 4
tag set word
main memory address
TU-DelftTI1400/11-PDS
28
3.1.3. Set-Associative Cache (2)
BLOCK 0
BLOCK 1
................
.BLOCK 127BLOCK 128BLOCK 129.................. BLOCK 255BLOCK 256
tag6- bits
BLOCK 0
128 blocks, 64 sets
tagBLOCK 1
tagBLOCK 2
tagBLOCK 3
tagBLOCK 4
set 0
set 1
Q: What is wrong in this picture?
Answer: 64 sets, so block 64 also goes to set 0
TU-DelftTI1400/11-PDS
29
3.1.3. Set-Associative Cache (3)
BLOCK 0
BLOCK 1
...............
..BLOCK 127
BLOCK 128
BLOCK 129
...............
... BLOCK 255
BLOCK 256
tag6- bits
BLOCK 0
128 blocks, 64 sets
tagBLOCK 1
tagBLOCK 2
tagBLOCK 3
tagBLOCK 4
set 0
set 1
TU-DelftTI1400/11-PDS
30
Question
• Main memory: 4 GByte• Cache: 512 blocks of 64 byte• Cache: 8-way set-associative (set size is 8)• All memories are byte addressable
Q How many bits is the:- byte address within a block- set number- tag
TU-DelftTI1400/11-PDS
31
Answer
• Main memory is 4 GByte, so 32-bits address
• A block is 64 byte, so 6-bits byte address within a block
• 8-way set-associative cache with 512 blocks, so 512/8=64 sets, so 6-bits set number
• So, 32-6-6=20-bits tag20 6 6
tag set word
TU-DelftTI1400/11-PDS
32
3.2. Replacement Algorithm
Replacement (1)(Set) associative replacement algorithms:• Least Recently Used (LRU)
- if 2k blocks per set, implement with k-bit counters per block
- hit: increase counters lower than the one referenced with 1, set counter at 0
- miss and set not full: replace, set counter new block 0, increase rest
- miss and set full: replace block with highest value (2k-1), set counter new block at 0, increase rest
TU-DelftTI1400/11-PDS
33
3.2.1. LRU: Example 1
0 1
0 0
1 0
1 1
1 0
0 1
0 0
1 1
k=2 4 blocks per set
HIT
increased
increased
unchanged
now at the top
TU-DelftTI1400/11-PDS
34
3.2.2. LRU: Example 2
1 1
0 0
1 0
0 1
0 0
0 1
1 1
1 0
k=2EMPTY
miss and set not full
increased
increased
increased
now at the top
TU-DelftTI1400/11-PDS
35
3.2.3. LRU: Example 3
0 1
0 0
1 0
1 1
1 0
0 1
1 1
0 0
k=2
miss and set full
increased
increased
increased
now at the top
TU-DelftTI1400/11-PDS
36
3.2. Replacement Algorithm
Replacement (2)
• Alternatives for LRU:- Replace oldest block, First-In-First-Out
(FIFO)- Least-Frequently Used (LFU)- Random replacement
TU-DelftTI1400/11-PDS
37
3.3. Example (1): program
int SUM = 0;for(j=0, j<10, j++) {
SUM =SUM + A[0,j];}AVE = SUM/10;for(i=9, i>-1, i--){
A[0,i] = A[0,i]/AVE}
• Normalize the elements of row 0 of array A• First pass: from start to end• Second pass: from end to start
TU-DelftTI1400/11-PDS
38
3.3. Example (2): cache
BLOCK 0tag
BLOCK 1tag
BLOCK 2tag
BLOCK 3tag
BLOCK 4tag
BLOCK 5tag
BLOCK 6tag
BLOCK 7tag
Cache: • 8 blocks• 2 sets • each block 1 word• LRU replacement
Set 0
Set 1
13 3
tag block
direct
16
tag
associative
15 1
tag setsetassociative
TU-DelftTI1400/11-PDS39
3.3. Example (3): array0111101000000 0 0 00111101000000 0 0 10111101000000 0 1 00111101000000 0 1 1........................ 1 0 0........................0111101000100 1 0 00111101000100 1 0 10111101000100 1 1 00111101000100 1 1 1
Tag direct
Tag set-associative
Tag associative
a(0,0) a(1,0) a(2,0) a(3,0) a(0,1) .... a(0,9) a(1,9) a(2,9) a(3,9)
Mem
ory
ad
dre
ss
4x10 array
column-majorordering
elements ofrow 0 are fourlocations apart
7A00
TU-DelftTI1400/11-PDS
40
3.3. Example (4): direct mapped
a[0,0] a[0,2] a[0,4] a[0,6] a[0,8] a[0,6] a[0,4] a[0,2] a[0,0]
j=1 j=3 j=5 j=7 j=9 i=6 i=4 i=2 i=0
0
1
2
3
4
5
6
7
blockpos.
Contents of cache after pass:
a[0,1] a[0,3] a[0,5] a[0,7] a[0,9] a[0,7] a[0,5] a[0,3] a[0,1]
= miss
= hit
Elements of row 0 are also 4 locations apart in the cache
Conclusion: from 20 accesses none are in cache
TU-DelftTI1400/11-PDS
41
3.3. Example (5): associative
a[0,0] a[0,8] a[0,8] a[0,8] a[0,0]
j=7 j=8 j=9 i=1 i=0
a[0,1] a[0,1] a[0,9] a[0,1] a[0,1]
a[0,2] a[0,2] a[0,2] a[0,2] a[0,2]
a[0,3] a[0,3] a[0,3] a[0,3] a[0,3]
a[0,4] a[0,4] a[0,4] a[0,4] a[0,4]
a[0,5] a[0,5] a[0,5] a[0,5] a[0,5]
a[0,6] a[0,6] a[0,6] a[0,6] a[0,6]
a[0,7] a[0,7] a[0,7] a[0,7] a[0,7]
0
1
2
3
4
5
6
7
blockpos.
from i=9 toi=2 all arein cache...
Conclusion: from 20 accesses 8 are in cache
Contents of cache after pass:
TU-DelftTI1400/11-PDS
42
3.3. Example (6): set-associative
a[0,0] a[0,4] a[0,8] a[0,4] a[0,4]
j=3 j=7 j=9 i=4 i=2
a[0,1] a[0,5] a[0,9] a[0,5] a[0,5]
a[0,2] a[0,6] a[0,6] a[0,6] a[0,2]
a[0,3] a[0,7] a[0,7] a[0,7] a[0,3]
0
1
2
3
4
5
6
7
blockpos.
a[0,0]
i=0
a[0,1]
a[0,2]
a[0,3]
set 0
all elements of row 0 are mapped to set 0
Contents of cache after pass:
from i=9 toi=6 all arein cache...
Conclusion: from 20 accesses 4 are in cache
TU-DelftTI1400/11-PDS
43
3.4. Example: PowerPC (1)
• PowerPC 604
• Separate data and instruction cache
• Caches are 16 Kbytes
• Four-way set-associative cache
• Cache has 128 sets
• Each block has 8 words of 32 bits
TU-DelftTI1400/11-PDS
44
3.4. Example: PowerPC (2)
Block 000BA2 st
Block 1
Block 2
Block 3003F4 st
address 0000 0000 0011 1111 0100 0000000 01000
003F4 0 8
set 0
=?
=?
no
yes
word address in block
set numbertag
.....
TU-DelftTI1400/11-PDS
Agenda
1. Basic Concepts2. Performance Considerations3. Caches4. Virtual Memory
4.1. Basic Concepts4.2. Address Translation
TU-DelftTI1400/11-PDS
46
4.1. Virtual Memory (1)
• Problem: compiled program does not fit into memory
• Solution: virtual memory, where the logical address space is larger than the physical address space
• Logical address space: addresses referable by instructions
• Physical address space: addresses referable in real machine
TU-DelftTI1400/11-PDS
47
4.1. Virtual Memory (2)
• For realizing virtual memory, we need an
address conversion:
am = f(av)
• am is physical address (machine address)
• av is virtual address
• This is generally done by hardware
TU-DelftTI1400/11-PDS
48
4.1. Organization
Processor
MMU
Cache
Main Memory
Disk Storage
am
am
av
data
data
DMA transfer
TU-DelftTI1400/11-PDS
49
4.2. Address Translation
• Basic approach is to partition both
physical address space and virtual
address space in equally sized blocks
called pages
• A virtual address is composed of:
- a page number
- word number within a page (the offset)
TU-DelftTI1400/11-PDS
50
4.2. Page tables (1)
virtual page number offset
page frame offset
page table address
+
virtual address from processor page table base register
physical address from processor
control bits page framenumber
Page table in main memory
TU-DelftTI1400/11-PDS
51
4.2. Page tables (2)
• Having page tables only in main memory is much too slow
• Additional memory access for every instruction and operand
• Solution: keep a cache with recent address translation: a Translation Look-aside Buffer (TLB)
TU-DelftTI1400/11-PDS
52
4.2. Operation of TLB
virtual page number offset
virtual address from processor
page frame offset
physical address from processor
virtual page # real page #
= ?
hitmiss
control bits
TLB
Idea: keep most recentaddress translations
TU-DelftTI1400/11-PDS
53
4.2. Policies
• The pages of a process in main memory: resident set
• Mechanism works because of principle of locality
• Page replacement algorithms needed• Protection possible through page table
register• Sharing possible through page table• Hardware support: Memory Management
Unit (MMU)
TU-DelftTI1400/11-PDS
54
Question
• Main memory: 256 MByte• Maximal virtual-address space: 4 GByte• Page size: 4 KByte• All memories are byte addressable
Q How many bits is the- offset within a page- virtual page frame number- (physical) page
frame number
TU-DelftTI1400/11-PDS
55
Answer• Main memory: 256 MByte• Maximal virtual-address space: 4 GByte• Page size: 4 KByte• All memories are byte addressable
• Virtual address: 32 bits (232=4 Gbyte)
• Physical address: 28 bits (228=256 Mbyte)
• Offset in a page: 12 bits (212=4 kbyte)
• Virtual page frame number: 32-12=20 bits
• Physical page frame number: 28-12=16
bits