slide 1 memory hierarchy design motivated based onmotivated by a combination of programmer's...
Post on 13-Dec-2015
216 Views
Preview:
TRANSCRIPT
Slide 1
Memory Hierarchy Design• MotivatedMotivated by a combination of programmer's desire for unlimited fast
memory and economical considerations, and based onbased on:– principle of locality, and
– cost/performance ratio of memory technologies (fast small, large slow),
to achieve a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level.
• HierarchyHierarchy: CPU/register file (RF) cache (C) main memory (MM) disk memory/I/O devices (DM)
• Speed in descending orderSpeed in descending order: RF > C > MM > DM
• Space in ascending orderSpace in ascending order: RF < C < MM < DM
Slide 2
Memory Hierarchy Design• TheThe gaps in speed and space between the different levels are gaps in speed and space between the different levels are
widening increasinglywidening increasingly:
Level/Name 1/RF 2/C 3/MM 4/DM
Typical size < 1 KB < 16 MB < 16 GB > 100GB
Implementation technology
Custom memory w. multiple ports, CMOS
On-chip or off-chip CMOS SRAM
CMOS DRAM Magnetic disk
Access time (ns) 0.25-0.5 0.5-25 80-250 5,000,000
Bandwidth 20,000-100,000 (MB/s) 5000-10,000 (MB/s) 1000-5000 (MB/s) 20-150 (MB/s)
Managed by Compiler Hardware Operating system OS/operator
Backed by Cache Main memory Disk CD or tape
Slide 3
Memory Hierarchy Design
• Cache performance reviewCache performance review:
Memory stall cycles = Number_of_misses * Miss_penalty
= IC * Miss_per_instr * Miss_penalty
= IC * MAPI * Miss_rate * Miss_penaltyIC * MAPI * Miss_rate * Miss_penalty
where MAPI stands for memory accesses per instruction
• Four Fundamental Memory Hierarchy Design Four Fundamental Memory Hierarchy Design IssuesIssues:1. Block placement issue: where can a block, the
atomic memory unit in cache-memory transactions, be placed in the upper level?
2. Block identification issue: how is a block found if it is in the upper level?
3. Block replacement issue: which block should be replaced on a miss?
4. Write strategy issue: what happens on a write?
Slide 4
Memory Hierarchy Design
1.1. PlacementPlacement: three approaches:
1) fully associative: any block in the main memory can be placed in any block frame. It is flexible but expensive due to associativity
2) direct mapping: each block in memory is placed in a fixed block frame with the following mapping function: (Block Address) MOD (Number of blocks in cache)
3) set associative: a compromise between fully associative and direct mapping; The cache is divided into sets of block frames, and each block from the memory is first mapped to a fixed set wherein the block can be placed in any block frame. Mapping to a set follows the function, called a bit selection:
(Block Address) MOD (Number of sets in cache)
Slide 5
Memory Hierarchy Design
2.2. IdentificationIdentification:
Each block frame in the cache has an address tag indicating the block's address in the memory
All possible tags are searched in parallel A valid bit is attached to the tag to indicate whether the
block contains valid information or not An address for a datum from CPU, A, is divided into a block
address field and a block offset field: block address = (A) / (block size) block offset = (A) MOD (block size)
block address is further divided into tag and index: index indicates the set in which the block may reside tag is compared to indicate a hit or a miss
Slide 6
Memory Hierarchy Design3.3. Replacement on a cache missReplacement on a cache miss:
The more choices for replacement, the more expensive for hardware direct mapping is the simplest
Random vs. least-recently used (LRU): the former has uniform allocation and is simple to build while the latter can take advantage of temporal locality but can be expensive to implement (why?). First in, first out (FIFO) approximates LRU and is simpler than LRU
Data cache misses per 1000 instructions
Associativity
Two-way Four-way Eight-way
Size LUR Random FIFO LUR Random FIFO LUR Random FIFO
16K 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4
64K 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3
256K 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5
Slide 7
Memory Hierarchy Design4.4. Write strategiesWrite strategies:
Most cache accesses are reads: 10% stores + 37% loads + 100% instructions only 7% of all memory accesses are writes
Optimize reads to make the common case fast, observing that CPU doesn't have to wait for writes while must wait for reads: fortunately, read is easy in direct-mapping: reading and tag comparison can be done in parallel (what about associative mapping?); but write is hard:a) Cannot overlap tag reading and block writing (destructive)b) CPU specifies write size: only 1 - 8 bytes. Thus write
strategies often distinguish cache design; On a write hit:i.i. write throughwrite through (or store through):
ensuring consistency at the cost of memory and bus bandwidth
write stalls may be alleviated by using write buffersii.ii. write backwrite back (store in):
minimizing memory and bus traffic at the cost of weakened consistency,
use dirty bit to indicate modification read misses may result in writes (why?)
c) On a write miss:a)a) write allocatewrite allocate (fetch on write)b)b) no-write allocateno-write allocate (write around)
Slide 8
Memory Hierarchy Design An ExampleAn Example:The Alpha 21264 Data Cache
Cache size=64KB, block size=64B, two-way set associativity, write-back, write allocate on a write miss.
What is the index size?
= 64K/(64*2) = 216/(26+1)=29
Slide 9
Memory Hierarchy Design Cache PerformanceCache Performance:
Memory access time is an indirect measure of performance and it is not a substitute for execution time:
Slide 10
Memory Hierarchy Design Example 1Example 1: How much does cache help in performance?
Slide 11
Memory Hierarchy Design Example 2Example 2: What’s the relationship between AMAT and CPU
Time?
Slide 12
Memory Hierarchy Design Improving Cache PerformanceImproving Cache Performance
The average memory access time can be improved by reducing any of the three parameters above:1. R1 reducing miss rate;2. R2 reducing miss penalty;3. R3 reducing hit time;
Four categories of cache organizations that help reduce these parameters:1. Organizations that help reduce miss rate:
larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity, and compiler optimization;
2. Organizations that help reduce miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim cache;
3. Organizations that help reduce miss penalty or miss rate via parallelism:non-blocking caches, hardware prefetching, and compiler prefetching;
4. Organizations that help reduce hit time:
small and simple caches, avoid address translation, pipelined cache access, and
trace cache.
Slide 13
Memory Hierarchy Design Reducing Miss RateReducing Miss Rate
There are three kinds of cache misses depending on the causes:
1. Compulsory: the very first access to a block cannot be a hit, since the block must be first brought in from the main memory. Also call cold-start misses;
2. Capacity: lack of space in cache to hold all blocks needed for the execution. Capacity misses will occur because of blocks being discarded and later retrieved;
3. Conflict: due to mapping that confines blocks to restricted area of cache (e.g., direct mapping, set-associative), also called collision misses or interference misses
While 3-C characterization gives insights to causes, they are at times too simplistic (and they are inter-dependent). For example, they ignore replacement policies.
Slide 14
Memory Hierarchy DesignRoles of 3-C
Slide 15
Memory Hierarchy DesignRoles of 3-C
Slide 16
Memory Hierarchy Design First Miss Rate Reduction Technique: First Miss Rate Reduction Technique: Large Block SizeLarge Block Size
Takes advantage of spatial locality reduces compulsory miss Increases miss penalty (it takes longer to fetch a block) Increases conflict misses, and/or increases capacity misses Must strike a delicate balance among MP, MR, and AMAT, in
finding an appropriate block size
Slide 17
Memory Hierarchy Design First Miss Rate Reduction Technique: First Miss Rate Reduction Technique: Larger Block SizeLarger Block Size
Example: Find the optimal block size in terms of AMAT, given that miss penalty is 40 cycles overhead plus 2 cycles/16 bytes and miss rates of the table below.
Solution: AMAT = HT + MR * MP = HT + MR * (40 + block size * 2 / 16)
High latency and bandwidth encourages large block size Low latency and bandwidth encourages small block size
Slide 18
Memory Hierarchy Design Second Miss Rate Reduction Technique: Second Miss Rate Reduction Technique: Larger CachesLarger Caches
An obvious way to reduce capacity misses. Drawback: high overhead in terms of hit time and higher cost. Popular in off-chip cache (2nd and 3rd level cache).
Third Miss Rate Reduction Technique: Third Miss Rate Reduction Technique: Higher AssociativityHigher Associativity Miss rate Rule of Thumb:
i. 8-way associativity is almost equal to full associativity;ii. Miss rate of (1-way of N-sized cache) is almost equal to
Miss rate of (2-way of 0.5N-sized cache)iii. The higher the associativity, the longer the hit time (why?)
Higher miss rate rewards higher associativity.
Slide 19
Memory Hierarchy Design Fourth Miss Rate Reduction Technique: Fourth Miss Rate Reduction Technique: Way Prediction and Way Prediction and
Pseudoassociative CachesPseudoassociative Caches Way prediction helps select one block among those in a set,
thus requiring only one tag comparison (if hit). Preserves advantages of direct-mapping (why?); In case of a miss, other block(s) are checked.
Pseudoassociative (also called column associative) caches Operate exactly as direct-mapping caches when hit, thus
again preserving advantages of the direct-mapping; In case of a miss, another block is checked (as if in set-
associative caches), by simply inverting the most significant bit of the index field to find the other block in the “pseudoset”.
real hit time < pseudo-hit time too many pseudo hits would defeat the purpose
Slide 20
Memory Hierarchy Design Fifth Miss Rate Reduction Technique: Fifth Miss Rate Reduction Technique: Compiler OptimizationsCompiler Optimizations
Slide 21
Memory Hierarchy Design Fifth Miss Rate Reduction Technique: Fifth Miss Rate Reduction Technique: Compiler OptimizationsCompiler Optimizations
Slide 22
Memory Hierarchy Design Fifth Miss Rate Reduction Technique: Fifth Miss Rate Reduction Technique: Compiler OptimizationsCompiler Optimizations
Slide 23
Memory Hierarchy Design Fifth Miss Rate Reduction Technique: Fifth Miss Rate Reduction Technique: Compiler OptimizationsCompiler Optimizations
IV. Blocking: improve temporal and spatial localitya) multiple arrays are accessed in both ways (i.e., row-major and
column-major), namely, orthogonal accesses that can not be helped by earlier methods
b) concentrate on submatrices, or blocks
c) All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus, there are N3 operations and 2N3 + N2 reads! Capacity misses are a function of N and cache size in this case.
Slide 24
Memory Hierarchy Design Fifth Miss Rate Reduction Technique: Fifth Miss Rate Reduction Technique: Compiler OptimizationsCompiler Optimizations
a) To ensure that elements being accessed can fit in the cache, the original code is changed to compute a submatrix of size B*B, where B is called the blocking factor.
b) To total number of memory words accessed is 2N3//B + N2
c) Blocking exploits a combination of spatial (Y) and temporal (Z) locality.
Slide 25
Memory Hierarchy Design First Miss Penalty Reduction Technique: First Miss Penalty Reduction Technique: Multilevel CachesMultilevel Caches
a) To keep up with the widening gap between CPU and main memory, try to:i. make cache faster, andii. make cache larger
by adding another, larger but slower cache between cache and the main memory.
Slide 26
Memory Hierarchy Design First Miss Penalty Reduction Technique: First Miss Penalty Reduction Technique: Multilevel CachesMultilevel Caches
b) Local miss rate vs. global miss rate::
i.i. Local miss rateLocal miss rate is defined as
ii.ii. Global miss rateGlobal miss rate is defined as
Slide 27
Memory Hierarchy Design Second Miss Penalty Reduction Technique: Second Miss Penalty Reduction Technique: Critical Word First and Early Critical Word First and Early
RestartRestart CPU needs just one word of the block at a time:
critical word first: fetch the required word first, and early start: as soon as the required word arrives, send it to
CPU. Third Miss Penalty Reduction Technique: Third Miss Penalty Reduction Technique: Giving Priority to Read Misses Giving Priority to Read Misses
over Write Missesover Write Misses Serves reads before writes have been completed:
while write buffers improve write-through performance, they complicate memory accesses by potentially delaying updates to memory;
instead of waiting for the write buffer to become empty before processing a read miss, the write buffer is checked for content that might satisfy the missing read.
in a write-back scheme, the dirty copy upon replacing is first written to the write buffer instead of the memory, thus improving performance.
Slide 28
Memory Hierarchy Design Fourth Miss Penalty Reduction Technique: Fourth Miss Penalty Reduction Technique: Merging Write BufferMerging Write Buffer
Improves efficiency of write buffers that are used by both write-through and write back caches: Multiple single-word writes are combined into a single write
buffer entry which is otherwise used for multi-word write. Reduces stalls due to write buffer being full
Slide 29
Memory Hierarchy Design Fifth Miss Penalty Reduction Technique: Fifth Miss Penalty Reduction Technique: Victim CacheVictim Cache
victim caches attempt to avoid miss penalty on a miss by: Adding a small fully-associative cache that is used to
contain discarded blocks (victims) It is proven to be effective, especially for small 1-way cache.
e.g., a 4-entry victim cache removes 20% !
Slide 30
Memory Hierarchy Design Reducing Cache Miss Penalty or Miss Rate via ParallelismReducing Cache Miss Penalty or Miss Rate via Parallelism
Nonblocking Caches (Lock-free caches):
Hardware Prefetching of Instructions and Data:
Slide 31
Memory Hierarchy Design Reducing Cache Miss Penalty or Miss Rate via ParallelismReducing Cache Miss Penalty or Miss Rate via Parallelism
Compiler-Controlled Prefetching: compiler inserts prefetch instructions
Slide 32
Memory Hierarchy Design Reducing Cache Miss Penalty or Miss Rate via ParallelismReducing Cache Miss Penalty or Miss Rate via Parallelism
Compiler-Controlled Prefetching: An Example for(i:=0; i<3; i:=i+1) for(j:=0; j<100; j:=j+1) a[i][j] := b[j][0] * b[j+1][0]
16-byte blocks, 8KB cache, 1-way write back, 8-byte elements; What kind of locality, if any, exists for a and b?
a. 3 rows and 100 columns; spatial locality: even-indexed elements miss and odd-indexed elements hit, leading to 3*100/2 = 150 misses
b. 101 rows and 3 columns; no spatial locality, but there is temporal locality: same element is used in ith and (i + 1)st iterations and the same element is access in each i iteration. 100 misses for i = 0 and 1 miss for j = 0 for a total of 101 misses
Assuming large penalty (50 cycles and at least 7 iterations must be prefetched). Splitting the loop into two, we have
Slide 33
Memory Hierarchy Design Reducing Cache Miss Penalty or Miss Rate via ParallelismReducing Cache Miss Penalty or Miss Rate via Parallelism
Compiler-Controlled Prefetching: An Example (continued)
for(j:=0; j<100; j:=j+1){
prefetch(b[j+7][0];
prefetch(a[0][j+7];
a[0][j] := b[j][0] * b[j+1][0];};
for(i:=1; i<3; i:=i+1)
for(j:=0; j<100; j:=j+1){
prefetch(a[i][j+7];
a[i][j] := b[j][0] * b[j+1][0]}
Assuming that each iteration of the pre-split loop consumes 7 cycles and no conflict and capacity misses, then it consumes a total of 7*300 + 251*50 = 14650 cycles (total iteration cycles plus total cache miss cycles); whereas the split loop consumes a total of (1+1+7)*100+(4+7)*50+(1+7)*200+(4+4)*50 = 3450
Slide 34
Memory Hierarchy Design Reducing Cache Miss Penalty or Miss Rate via ParallelismReducing Cache Miss Penalty or Miss Rate via Parallelism
Compiler-Controlled Prefetching: An Example (continued) the first loop consumes 9 cycles per iteration (due to
the two prefetch instruction) the second loop consumes 8 cycles per iteration
(due to the single prefetch instruction), during the first 7 iterations of the first loop array a
incurs 4 cache misses, array b incurs 7 cache misses, during the first 7 iterations of the second loop for i =
1 and i = 2 array a incurs 4 cache misses each array b does not incur any cache miss in the second
split!.
Slide 35
Memory Hierarchy Design First Hit Time Reduction Technique: First Hit Time Reduction Technique: Small and simpleSmall and simple cachescaches
smaller is faster: small index, less address translation time small cache can fit on the same chip low associativity: in addition to a simpler/shorter tag
check, 1-way cache allows overlapping tag check with transmission of data which is not possible with any higher associativity!
Second Hit Time Reduction Technique: Second Hit Time Reduction Technique: Avoid address Avoid address translation during indexingtranslation during indexing Make the common case fast:
use virtual address for cache because most memory accesses (more than 90%) take place in cache, resulting in virtual cache
Slide 36
Memory Hierarchy Design Second Hit Time Reduction Technique: Second Hit Time Reduction Technique: Avoid address translation Avoid address translation
during indexingduring indexing Make the common case fast:
there are at least three important performance aspects that directly relate to virtual-to-physical translation:
1) improperly organized or insufficiently sized TLBs may create excess not-in-TLB faults, adding time to program execution time
2) for a physical cache, the TLB access time must occur before the cache access, extending the cache access time
3) two-line address (e.g., an I-line and a D-line address) may be independent of each other in virtual address space yet collide in the real address space, when they draw pages whose lower page address bits (and upper cache address bits) are identical
problems with virtual cache:1) Page-level protection must be enforced no matter what during
address translation (solution: copy protection info from TLB on a miss and hold it in a field for future virtual indexing/tagging)
2) when a process is switched in/out, the entire cache has to be flushed out ‘cause physical address will be different each time, i.e., the problem of context switching (solution: process identifier tag -- PID)
Slide 37
Memory Hierarchy Design Second Hit Time Reduction Technique: Second Hit Time Reduction Technique: Avoid address translation Avoid address translation
during indexingduring indexing problems with virtual cache:
3) different virtual addresses may refer to the same physical address, i.e., the problem of synonyms/aliases HW solution: guarantee every cache block a unique
phy. Address SW solution: force aliases to share some address bits
(e.g., page-coloring) Virtually indexed and physically tagged
Third Hit Time Reduction Technique: Third Hit Time Reduction Technique: Pipelined cache writesPipelined cache writes the solution is to reduce CCT and increase # of stages – increases
instr. throughput Fourth Hit Time Reduction Technique: Fourth Hit Time Reduction Technique: Trace cachesTrace caches
Finds a dynamic sequence of instructions including taken branches to load into a cache block: Put traces of the executed instructions into cache blocks as
determined by the CPU Branch prediction is folded in to the cache and must be
validated along with the addresses to have a valid fetch. Disadvantage: store the same instructions multiple times
Slide 38
Memory Hierarchy Design Main Memory and Organizations for Improving PerformanceMain Memory and Organizations for Improving Performance
Slide 39
Memory Hierarchy Design Main Memory and Organizations for Improving PerformanceMain Memory and Organizations for Improving Performance
a) Wider main memory bus Cache miss penalty decreases proportionally Cost:
i. wider bus (x n) and multiplexer (x n), ii. expandability (x n), and iii. error correction is more expensive
b) Simple interleaved memory• Potential parallelism with multiple DRAMs• Sending address and accessing multiple
bands in parallel but transmitting data sequentially (4+24+4x4=44 cycles 16/44 = 0.4 byte/cycle)
c) Independent memory banks
Slide 40
Memory Hierarchy Design Main Memory and Organizations for Improving PerformanceMain Memory and Organizations for Improving Performance
Slide 41
Memory Hierarchy Design Main Memory and Organizations for Improving PerformanceMain Memory and Organizations for Improving Performance
Slide 42
Memory Hierarchy Design Virtual MemoryVirtual Memory
Slide 43
Memory Hierarchy Design Virtual MemoryVirtual Memory
Slide 44
Memory Hierarchy Design Virtual MemoryVirtual Memory
Slide 45
Memory Hierarchy Design Virtual MemoryVirtual Memory
Fast address translation: an example – Alpha 21264 data TLB
»ASN is used as PID for virtual caches;
»TLB is not flushed on a context switch but only when ASNs are recycled;
»Fully associative placement
Slide 46
Memory Hierarchy Design Virtual MemoryVirtual Memory
What is the optimal page size? – It depends: page table size 1/page size large page size makes virtual cache possible
(avoiding the aliases problem), thus reducing cache hit time
transfer of larger pages (over the network) is more efficient: efficiency of transfer
small TLB favors larger pages main drawback for large page size:
internal fragmentation: waste of storage process startup time: large context switching
overhead
Slide 47
Memory Hierarchy Design Summarizing Virtual MemorySummarizing Virtual Memory & Caches& Caches
A hypothetical memory hierarchy going from virtual address to L2 cache access:
Slide 48
Memory Hierarchy Design Protection and Examples of Virtual MemoryProtection and Examples of Virtual Memory
The invention of multiprogramming led to the need to share computer resources such as CPU, memory, I/O, etc. by multiple programs whose instantiations are called “processes”;
Time-sharing of computer resources by multiple processes requires that processes take turns using such resources and designers of operating systems and computer must ensure that the switching among different processes, also called “context switching” is done correctly:
a) The computer designer must ensure that the CPU portion of the process state can be saved and restored;
b) The operating systems designer must guarantee that processes do not interfere with each others’ computations
Protecting processes: a) Base and Bound – each process falls in a pre-defined portion
of the address space, that is, an address is valid if Base Address Bound, where OS keeps and defines the values of Base and Bound in two registers.
Slide 49
Memory Hierarchy Design Protection and Examples of Virtual MemoryProtection and Examples of Virtual Memory
The computer designer’s responsibilities in helping the OS designer protect processes from each other:
a) Providing two modes to distinguish a user process from a kernel process (or equivalently, supervisor or executive process);
b) Providing a portion of the CPU state, including the base/bound registers, the user/kernel mode bit(s), and the exception enable/disable bit, that a user can use but cannot write; and
c) Providing mechanisms by which the CPU can switch between the user mode to the supervisor mode.
While base-and-bound constitutes the minimum protection system, virtual memory offers a more fine-grained alternative to this simple model:
a) Address translation provides an opportunity to check any possible violations – the read/write and user/kernel signals from CPU vs. the permission flags marked on individual pages by virtual memory (or OS) to detect stray memory accesses;
b) Depending on the designer’s apprehension, protection can be either relaxed or escalated. In escalated protection, multiple levels of access permissions can be used, much like the military classification system.
Slide 50
Memory Hierarchy Design Protection and Examples of Virtual MemoryProtection and Examples of Virtual Memory
A Paged Virtual Memory Example – The Alpha Memory Management and the 21264 TLB (one for Instruction and one for Data)
a) A combination of segmentation and paging, with 48-bit virtual addresses while the 64-bit address space being divided into three segments: seg0 (bits 63-47 = 0..00), kseg (bits 63-46 = 0…10), and seg1 (bits 63-46 = 1…11)
b) Advantages: segmentation divides address space and conserves page table space, while paging provides virtual memory, relocation, and protection
c) Even with segmentation, the size of the page tables for the 64-bit address space is alarming. A three-level hierarchical page table is used in Alpha, with each PT contained in one page:
Slide 51
Memory Hierarchy Design Protection and Examples of Virtual MemoryProtection and Examples of Virtual Memory
A Paged Virtual Memory Example – The Alpha Memory Management and the 21264 TLB
d) PTE is 64 bits long, with the first 32 bits contain the physical page number and the other half includes the following five protection fields:1) Valid – whether the page number is valid for address translation2) User read enable – allows user programs to read data within this
page3) Kernel read enable -- allows kernel programs to read data within
this page4) User write enable – allows user programs to write data within this
page5) Kernel write enable -- allows kernel programs to write data within
this page
e) Current design of Alpha has 8-KB pages, thus allowing 1024 PTEs in each PT. The three page level fields and page offset account for 10+10+10+13=43 bits of the 64-bit virtual address. The 21 bits to the left of level-1 field are all “0”s for seg0 and all “1”s for seg1.
f) The maximum virtual address and physical address is tied to the page size and Alpha has provisions for future growth: 16KB, 32KB and 64KB page sizes for the future.
g) The following table shows memory hierarchy parameters of the Alpha 21264 TLB
Parameter Description
Block size 1 PTE (8 bytes)
Hit time 1 clock cycle
Miss penalty (average) 20 clock cycle
TLB size 128 PTEs per TLB, each of which can map 1,8,64, or 512 pages
Block selection Round-robin
Write strategy (not applicable)
Block placement Fully associative
Slide 52
Memory Hierarchy Design Protection and Examples of Virtual MemoryProtection and Examples of Virtual Memory
A Segmented Virtual Memory Example – Protection in the Intel Pentiuma)Pentium has four protection levels, with the innermost level (0) corresponding to
Alpha’s kernel mode and the outermost level (3) corresponding to Alpha’s user mode’. Separate stacks are used for each level to avoid security breaches between levels.
1)User can call an OS routine and pass parameters to it while retaining full protection
2)Allows the OS to maintain the protection level of the called routine for the parameters that are passed to it
3)The potential loophole in protection is prevented by not allowing the user process to ask the OS to access something indirectly that it would not have been able to access itself (such security loopholes are called Trojan Horse).
b)Bounds checking and memory mapping in Pentium by the use of a descriptor table (DT, which plays the role of PTs in the Alpha). The equivalent of PTE in DT is a segment descriptor containing the following fields:
1)Present bit – equivalent to the PTE valid bit, used to indicate this is a valid translation
2)Base field – equivalent to a page frame address, containing the physical address of the first byte of the segment
3)Access bit – like the reference bit or use bit in some architectures that is helpful for replacement algorithms
4)Attributes field – specifies the valid operations and protection levels for the operations that use this segment
5)Limit field – not found in paged systems, establishes the upper bound of valid offsets for this segment.
Slide 53
Memory Hierarchy Design Crosscutting Issues: The Design of Memory HierarchiesCrosscutting Issues: The Design of Memory Hierarchies
Superscalar CPU and Number of Ports to the CacheCache must provide sufficient peak bandwidth to benefit from multiple issues. Some
processors increase complexity of instruction fetch by allowing instructions to be issued to be found on any boundary instead of, say, multiples of 4 words.
Speculative Execution and the Memory SystemSpeculative and conditional instructions generate exceptions (by generating invalid
addresses) that would otherwise not occur, which in turn can overwhelm the benefits of speculation with the exception handling overhead. Such CPUs must be matched with non-blocking caches and only speculate on L1 misses (due to the unbearable penalty of L2).
Combining Instruction Cache with Instruction Fetch and Decode MechanismsIncreasing demand for ILP and clock rate has led to the merging of the first part of
instruction execution with instruction cache, by incorporating trace cache (which combines branch prediction with instruction fetch) and storing the internal RISC operations in the trace cache (e.g., Pentium 4’s NetBurst microarchitecture). A cache hit in the merged cache saves portion of the instruction execution cycles.
Embedded Computer Caches and Real-Time PerformanceIn real-time applications, variation of performance matters much more than average
performance. Thus, caches that offer average performance enhancement have to be used carefully. Instruction caches are often used due to the highly predictability of instructions; whereas data caches are “locked down”, forcing them to act as small scratchpad memory under program control.
Embedded Computer Caches and PowerIt is much more power efficient to access on-chip memory than to access off-chip one
(which needs to drive the pins, buses and activate external memory chips, etc). Other techniques, such as way prediction, can be used to save power (by only powering half of the two-way set-associative cache).
I/O and Consistency of Cached DataCache coherence problem must be addressed when I/O devices also share the same
cached data.
Slide 54
Memory Hierarchy Design An Example of the Cache Coherence ProblemAn Example of the Cache Coherence Problem
Slide 55
Memory Hierarchy Design Putting It All Together: Alpha 21264 Memory HierarchyPutting It All Together: Alpha 21264 Memory Hierarchy
» Instruction cache is virtually indexed and virtual tagged; Data cache is virtually indexed but physically tagged;
» Operations:
1. Chip loads instr serially from an external PROM and loads configuration info for L2 cache
2. Execute preloaded code in PAL mode to initialize: e.g., update TLB
3. Once OS ready, it sets PC to appropriate addr in seg0
4. 9 index + 1 way-predict + 2 4-instr = 12 addr bits are sent to I-$; 48 – 9 – 6 = 33 bits for v. tag
5. Way-prediction and Line-prediction (11-bit for next 16-byte group on a miss and updated by br prediction) are used to reduce I-$ latency
6. The next way and line prediction is loaded to read the next block (step 3) on an intr cache hit.
7. An instr $ miss leads to a check of I-TLB & prefetcher (4 – 7), or access L2 $ (if instr. Addr. Not found) (8)
8. L2 $ is direct mapped, 1-16MB
Slide 56
Memory Hierarchy Design Putting It All Together: Alpha 21264 Memory HierarchyPutting It All Together: Alpha 21264 Memory Hierarchy
9. The instruction prefetcher does not rely on TLB for address translation, it simply increments the physical address of the miss by 64 bytes, checking to make sure that the new address is within the same page. Prefetching is suppressed if new address is out of the page (step 14).
10. If the instruction is not found in L2, the physical address command is sent to the ES40 system chip set via four consecutive transfer cycles on a narrow, 15-bit outbound address bus (step 15). Address and command take 8 CPU cycles. CPU is connected to memory via a crossbar to one of two 256-bit memory buses (16)
11. Total penalty of the instruction miss is about 130 CPU cycles for critical instructions, while the rest of the block is filled at a rate of 8 bytes per 2 CPU cycles (step 17).
12. A “victim file” is associated with L2, a write-back cache, to store a replaced block (victim block) (step 18); The address of the victim is sent out the system address bus following the address of the new request (step 19), where the system chip set later extracts the victim data and writes to the memory.
13. D-$ is a 64-KB two-way set-associative, write-back cache, which is virtually indexed but physically tagged. While the 9-bit index + 3-bit word selection is sent to index the required data (step 24), virtual page # is being translated at D-TLB (step 23), which is fully associative and has 128 PTEs of which each represents page size from 8KB to 4MB (step 25).
Slide 57
Memory Hierarchy Design Putting It All Together: Alpha 21264 Memory HierarchyPutting It All Together: Alpha 21264 Memory Hierarchy
14. A TLB miss will trap to PAL (privileged architecture library) code to load the valid PTE for this address. In the worst case, a page-fault happens, in which case OS will bring the page from disk while context is switched.
15. The index field of the address is sent to both sets of data cache (step 26). Assuming a TLB hit, the two tags and valid bits are compared to the physical page # (steps 27-28), with a match sending the desired 8 bytes to the CPU (step 29)
16. A miss at D-$ goes to L2 $, which proceeds similary to an instruction miss (step 30), except that it must check the victim buffer to make sure the block is not there (step 31)
17. A write-back victim can be produced on a data cache miss. The victim data are extracted from the data cache simultaneously with the fill of the data cache with the L2 data and sent to the victim buffer (step 32)
18. In case of a L2 miss, the fill data from the system is written directly into the (L1) data cache (step 33). The L2 is written only with L1 victims (step 34).
Slide 58
Memory Hierarchy Design Another View: The Emotion Engine of the Sony Playstation 2Another View: The Emotion Engine of the Sony Playstation 2
3 Cs captured by cache for SPEC2000 (left) and multimedia 3 Cs captured by cache for SPEC2000 (left) and multimedia applications (right)applications (right)
Slide 59
Memory Hierarchy Design Another View: The Emotion Engine of the Sony Playstation 2Another View: The Emotion Engine of the Sony Playstation 2
top related