tk 2123 computer organisation & architecture

Dr Masri AyobDr Masri Ayob

TK 2123TK 2123COMPUTER ORGANISATION & COMPUTER ORGANISATION &

ARCHITECTUREARCHITECTURE

Lecture 7: CPU and Memory (3)Lecture 7: CPU and Memory (3)

22

ContentsContents

This lecture will discuss:This lecture will discuss:Cache.Cache.Error Correcting Codes.Error Correcting Codes.

33

The Memory HierarchyThe Memory Hierarchy

Trade-off: cost, capacity and access time.Trade-off: cost, capacity and access time.Faster access time, greater cost per bit.Faster access time, greater cost per bit.Greater capacity, smaller cost per bit.Greater capacity, smaller cost per bit.Greater capacity, slower access time.Greater capacity, slower access time.

Access time - the time it takes to perform a read or write operation.Memory Cycle time -Time may be required for the memory to “recover” before next access, i.e. access + recovery.Transfer Rate - rate at which data can be moved.

44

Memory HierarchiesMemory Hierarchies

A five-level memory hierarchy.A five-level memory hierarchy.

55

Hierarchy ListHierarchy List

RegistersRegistersL1 CacheL1 CacheL2 CacheL2 CacheMain memoryMain memoryDisk cacheDisk cacheDiskDiskOpticalOpticalTapeTape

Internal Internal memorymemory

external external memorymemory

decreasing decreasing cost/bit, cost/bit, increasing increasing capacity, capacity, and slower and slower access timeaccess time

66

Hierarchy ListHierarchy List It would be nice to use only the fastest memory, but because It would be nice to use only the fastest memory, but because

that is the most expensive memory, that is the most expensive memory, we trade off access time for cost by using more of the slower memory. we trade off access time for cost by using more of the slower memory. The design challenge is to organise the data and programs in memory The design challenge is to organise the data and programs in memory

so that the accessed memory words are usually in the faster memory.so that the accessed memory words are usually in the faster memory.

In general, it is likely that most future accesses to main memory In general, it is likely that most future accesses to main memory by the processor will be to locations recently accessed. by the processor will be to locations recently accessed. So the cache automatically retains a copy of some of the recently used So the cache automatically retains a copy of some of the recently used

words from the DRAM. words from the DRAM. If the cache is designed properly, then most of the time the processor If the cache is designed properly, then most of the time the processor

will request memory words that are already in the cache.will request memory words that are already in the cache.

77

Hierarchy ListHierarchy List

No one technology is optimal in satisfying the No one technology is optimal in satisfying the memory requirements for a computer system. memory requirements for a computer system. As a consequence, the typical computer system is As a consequence, the typical computer system is

equipped with a hierarchy of memory subsystems;equipped with a hierarchy of memory subsystems; some internal to the system (directly accessible by the some internal to the system (directly accessible by the

processor) and processor) and some external (accessible by the processor via an I/O some external (accessible by the processor via an I/O

module).module).

88

CacheCache

Small amount of fast memorySmall amount of fast memorySits between normal main memory and Sits between normal main memory and

CPUCPUMay be located on CPU chip or moduleMay be located on CPU chip or module

or cache line.

99

CacheCache The cache contains a copy of portions of main memory. The cache contains a copy of portions of main memory.

When the processor attempts to read a word of memory, a check When the processor attempts to read a word of memory, a check is made to determine if the word is in the cache. is made to determine if the word is in the cache.

If so (hit), the word is delivered to the processor. If so (hit), the word is delivered to the processor. If not (miss), a block of main memory, consisting of some fixed If not (miss), a block of main memory, consisting of some fixed

number of words, is read into the cache and then the word is number of words, is read into the cache and then the word is delivered to the processor. delivered to the processor.

Because of the phenomenon of locality of reference, when a block Because of the phenomenon of locality of reference, when a block of data is fetched into the cache to satisfy a single memory of data is fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that reference, it is likely that there will be future references to that same memory location or to other words in the block. same memory location or to other words in the block.

The ratio of hits to the total number of requests is known as the hit ratio.

1010

Cache/Main Memory StructureCache/Main Memory Structure

1111

Cache operation – overviewCache operation – overview

CPU requests contents of memory locationCPU requests contents of memory locationCheck cache for this dataCheck cache for this data If present, get from cache (fast)If present, get from cache (fast) If not present, read required block from If not present, read required block from

main memory to cachemain memory to cacheThen deliver from cache to CPUThen deliver from cache to CPUCache includes tags to identify which block Cache includes tags to identify which block

of main memory is in each cache slotof main memory is in each cache slot

1212

Cac

he O

pera

tion

Cac

he O

pera

tion

1313

Cache DesignCache Design

SizeSizeMapping FunctionMapping FunctionReplacement AlgorithmReplacement AlgorithmWrite PolicyWrite PolicyBlock SizeBlock SizeNumber of Caches – L1, L2, L3 etc.Number of Caches – L1, L2, L3 etc.

1414

Size does matterSize does matter

CostCostMore cache is expensiveMore cache is expensive

SpeedSpeedMore cache is faster (up to a point)More cache is faster (up to a point)Checking cache for data takes timeChecking cache for data takes time

We would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone.

The larger the cache, the larger the number of gates involved in addressing the cache. The result is that large caches tend to be slightly slower than small ones.

1515

Comparison of Cache SizesComparison of Cache Sizes

Processor TypeYear of

Introduction L1 cachea L2 cache L3 cache

IBM 360/85 Mainframe 1968 16 to 32 KB — —

PDP-11/70 Minicomputer 1975 1 KB — —

VAX 11/780 Minicomputer 1978 16 KB — —

IBM 3033 Mainframe 1978 64 KB — —

IBM 3090 Mainframe 1985 128 to 256 KB — —

Intel 80486 PC 1989 8 KB — —

Pentium PC 1993 8 KB/8 KB 256 to 512 KB —

PowerPC 601 PC 1993 32 KB — —

PowerPC 620 PC 1996 32 KB/32 KB — —

PowerPC G4 PC/server 1999 32 KB/32 KB 256 KB to 1 MB 2 MB

IBM S/390 G4 Mainframe 1997 32 KB 256 KB 2 MB

IBM S/390 G6 Mainframe 1999 256 KB 8 MB —

Pentium 4 PC/server 2000 8 KB/8 KB 256 KB —

IBM SPHigh-end server/ supercomputer

2000 64 KB/32 KB 8 MB —

CRAY MTAb Supercomputer 2000 8 KB 2 MB —

Itanium PC/server 2001 16 KB/16 KB 96 KB 4 MB

SGI Origin 2001 High-end server 2001 32 KB/32 KB 4 MB —

Itanium 2 PC/server 2002 32 KB 256 KB 6 MB

IBM POWER5 High-end server 2003 64 KB 1.9 MB 36 MB

CRAY XD-1 Supercomputer 2004 64 KB/64 KB 1MB —

1616

Cache: Mapping FunctionCache: Mapping Function

Cache lines < main memory blocks:Cache lines < main memory blocks: An algorithm is needed for mapping main memory An algorithm is needed for mapping main memory

blocks into cache lines. blocks into cache lines. Three techniques:Three techniques:

DirectDirectAssociativeAssociativeset associativeset associative

1717

Direct MappingDirect Mapping

Each block of main memory maps to only one Each block of main memory maps to only one cache linecache line i.e. if a block is in cache, it must be in one specific i.e. if a block is in cache, it must be in one specific

place.place. pros & conspros & cons

SimpleSimple InexpensiveInexpensive Fixed location for given blockFixed location for given block

If a program accesses 2 blocks that map to the If a program accesses 2 blocks that map to the same line repeatedly, cache misses are very highsame line repeatedly, cache misses are very high

1818

AssociativeAssociative Mapping Mapping

A main memory block can load into any line of A main memory block can load into any line of cachecache

Memory address is interpreted as tag and wordMemory address is interpreted as tag and word Tag uniquely identifies block of memoryTag uniquely identifies block of memory Every line’s tag is examined for a matchEvery line’s tag is examined for a match Disadvantage:Disadvantage:

Cache searching gets expensiveCache searching gets expensive The complex circuitry is required to examine the tags The complex circuitry is required to examine the tags

of all cache lines in parallel. of all cache lines in parallel.

1919

Set AssociativeSet Associative Mapping Mapping

A compromise that exhibits the strengths of both the A compromise that exhibits the strengths of both the direct and associative approaches while reducing their direct and associative approaches while reducing their disadvantages. disadvantages.

Cache is divided into a number of sets.Cache is divided into a number of sets. Each set contains a number of lines.Each set contains a number of lines. A given block maps to any line in a given setA given block maps to any line in a given set

e.g. Block B can be in any line of set i.e.g. Block B can be in any line of set i. With fully associative mapping, the tag in a memory address is With fully associative mapping, the tag in a memory address is

quite large and must be compared to the tag of every line in the quite large and must be compared to the tag of every line in the cache. cache.

With k-way set associative mapping, the tag in a memory With k-way set associative mapping, the tag in a memory address is much smaller and is only compared to the k tags address is much smaller and is only compared to the k tags within a single set. within a single set.

2020

Replacement AlgorithmsReplacement Algorithms

When cache memory is full, some block in When cache memory is full, some block in cache memory must be selected for cache memory must be selected for replacement. replacement.

Direct mapping :Direct mapping :No choiceNo choiceEach block only maps to one lineEach block only maps to one lineReplace that lineReplace that line

2121

Replacement Algorithms (2)Replacement Algorithms (2)Associative & Set AssociativeAssociative & Set Associative

Hardware implemented algorithm (speed)Hardware implemented algorithm (speed)Least Recently used (LRU)Least Recently used (LRU)

An LRU algorithm, keeps track of the usage of An LRU algorithm, keeps track of the usage of each block and replaces the block that was last each block and replaces the block that was last used the longest time ago.used the longest time ago.

First in first out (FIFO)First in first out (FIFO)replace block that has been in cache longestreplace block that has been in cache longest

Least frequently used (LFU)Least frequently used (LFU)replace block which has had fewest hitsreplace block which has had fewest hits

RandomRandom

2222

Write PolicyWrite Policy

Issues:Issues:Must not overwrite a cache block unless main Must not overwrite a cache block unless main

memory is up to datememory is up to dateMultiple CPUs may have individual cachesMultiple CPUs may have individual caches I/O may address main memory directlyI/O may address main memory directly

2323

Write throughWrite through

All writes go to main memory as well as All writes go to main memory as well as cachecache

Multiple CPUs can monitor main memory Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to traffic to keep local (to CPU) cache up to datedate

Disadvantage:Disadvantage:Lots of trafficLots of trafficSlows down writesSlows down writesCreate a bottleneck.Create a bottleneck.

2424

Cache: Line SizeCache: Line Size

As the block size increases from very small to larger As the block size increases from very small to larger sizes, the hit ratio will at first increase because of the sizes, the hit ratio will at first increase because of the principle of locality.principle of locality.

Two issues:Two issues: Larger blocks reduce the number of blocks that fit into a cache. Larger blocks reduce the number of blocks that fit into a cache.

Because each block fetch overwrites older cache contents, a Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly small number of blocks results in data being overwritten shortly after they are fetched.after they are fetched.

As a block becomes larger, each additional word is farther from As a block becomes larger, each additional word is farther from the requested word, therefore less likely to be needed in the the requested word, therefore less likely to be needed in the

near futurenear future. .

2525

Number of Caches Number of Caches

Multilevel Caches:Multilevel Caches: On-chip cache: On-chip cache:

A cache on the same chip as the processor.A cache on the same chip as the processor. Reduces the processor’s external bus activity and therefore speeds up Reduces the processor’s external bus activity and therefore speeds up

execution times and increases overall system performance.execution times and increases overall system performance.

external cache: Is it still desirable?external cache: Is it still desirable?Yes - most contemporary designs include both on-chip Yes - most contemporary designs include both on-chip

and external caches.and external caches.E.g. two-level cache, with the internal cache (L1) and the E.g. two-level cache, with the internal cache (L1) and the

external cache (L2). Why?external cache (L2). Why? If there is no L2 cache and the processor makes an access If there is no L2 cache and the processor makes an access

request for a memory location not in the L1 cache, then the request for a memory location not in the L1 cache, then the processor must access DRAM or ROM memory across the bus processor must access DRAM or ROM memory across the bus – poor performance. – poor performance.

2626

Number of Caches Number of Caches

More recently, it has become common to split the cache into More recently, it has become common to split the cache into two:two: one dedicated to instructions and one dedicated to data.one dedicated to instructions and one dedicated to data. There are two potential advantages of a unified cache:There are two potential advantages of a unified cache:

For a given cache size, a unified cache has a higher hit rate than For a given cache size, a unified cache has a higher hit rate than split caches because it balances the load between instruction and split caches because it balances the load between instruction and data fetches automatically.data fetches automatically.

Only one cache needs to be designed and implemented.Only one cache needs to be designed and implemented. The trend is toward split caches, such as the Pentium and The trend is toward split caches, such as the Pentium and

PowerPC, which emphasize parallel instruction execution PowerPC, which emphasize parallel instruction execution and the prefetching of predicted future instructions. and the prefetching of predicted future instructions. Advantage:Advantage:

It eliminates contention for the cache between the instruction It eliminates contention for the cache between the instruction fetch/decode unit and the execution unit. fetch/decode unit and the execution unit.

2727

Intel Cache EvolutionIntel Cache EvolutionProblem Solution

Processor on which feature first appears

External memory slower than the system bus. Add external cache using faster memory technology.

386

Increased processor speed results in external bus becoming a bottleneck for cache access.

Move external cache on-chip, operating at the same speed as the processor.

486

Internal cache is rather small, due to limited space on chip Add external L2 cache using faster technology than main memory

486

Contention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.

Create separate data and instruction caches.

Pentium

Increased processor speed results in external bus becoming a bottleneck for L2 cache access.

Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.

Pentium Pro

Move L2 cache on to the processor chip.

Pentium II

Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.

Add external L3 cache. Pentium III

Move L3 cache on-chip. Pentium 4

2828

Locality Locality

Page 129: Stalling. Page 129: Stalling.

2929

Internal Memory (revision)Internal Memory (revision)

3030

Memory Packaging and TypesMemory Packaging and Types

A SIMM holding 256 MB. Two of the chips control the SIMM.A SIMM holding 256 MB. Two of the chips control the SIMM.

A group of chips, typically 8 or 16, is mounted on a tiny PCB and sold as a unit.A group of chips, typically 8 or 16, is mounted on a tiny PCB and sold as a unit. SIMM - SIMM - single inline memory module, has a row of connectors on one side.single inline memory module, has a row of connectors on one side. DIMM – Dual inline memory module, has a row of connectors on both side.DIMM – Dual inline memory module, has a row of connectors on both side.

3131

Error CorrectionError Correction

Hard FailureHard Failure Permanent defectPermanent defect Caused by harsh environmental abuse, Caused by harsh environmental abuse,

manufacturing defects, and wear.manufacturing defects, and wear.

Soft ErrorSoft Error Random, non-destructiveRandom, non-destructive No permanent damage to memoryNo permanent damage to memory Caused by power supply problems.Caused by power supply problems.

Detected using Hamming error correcting code.Detected using Hamming error correcting code.

3232


When reading out the stored word, a new set of K code When reading out the stored word, a new set of K code bits is generated from M data bits and compared with bits is generated from M data bits and compared with fetch code bits. Results:fetch code bits. Results: No errors – the fetch data bits are sent out.No errors – the fetch data bits are sent out. An error is detected, and it is possible to correct the An error is detected, and it is possible to correct the

error. error. Data bits + error correction bits Data bits + error correction bits correctorcorrector sent sent

out the corrected set of M bits. out the corrected set of M bits. An error is detected, but it is not possible to correct An error is detected, but it is not possible to correct

the error. This condition is reported.the error. This condition is reported.

3333

Error Correcting Code FunctionError Correcting Code Function

A function to produce code

Stored codeword: M+K bits

3434


Page 73, 74, 75 Tanenbaum.Page 73, 74, 75 Tanenbaum.

3535

Error Correcting Codes (1)Error Correcting Codes (1)

Number of check bits for a code that can correct a single errorNumber of check bits for a code that can correct a single error

3636


(a) Encoding of 1100(a) Encoding of 1100

(b) Even parity added(b) Even parity added

(c) Error in AC(c) Error in AC

3737


Construction of the Hamming code for the memory word Construction of the Hamming code for the memory word 11110000010101110 by adding 5 check bits to the 16 data bits.11110000010101110 by adding 5 check bits to the 16 data bits.

3838

Thank youThank youQ & AQ & A

tk 2123 computer organisation & architecture

Documents

word of memory

slower memory

faster memory

memory requirements

fastest memory

expensive memory

block of main memory

memory cycle time time