8. memory organizationpaull/chapt8.pdf · 8. memory organization memory is basic to the operation...

Spring 02 Memory Organization copywrite

245

8. Memory OrganizationMemory is basic to the operation of a computer. Access to large memories, such as disks

and even RAM_ memories are slow relative to the operations of the CPU. Organizations ofmemory which increases it effective speed, while maintaining ease of access for the programmersis a major consideration in their design.

8.1. Memory HierarchyBoth [Hamacher 88] and [Hayes 88] cover this topic.

From the progammers’ point of view the memory is ideally thought of as a single systemcontrolling a single high speed memory, receiving addresses of program instructions and data forstorage, and returning these to the CPU. This is called virtual (VM) memory because its simpleappearance to the programmer belies its complex internal structure.

Instead of consisting of a single type of memory with a single speed, the internal structure is ahierachy of component memories, each different speed and size. These are designatedM1, M2, . . . , Mn, Mi being smaller and faster than Mi+1.

The behaviour of these component memories, determines the control necessary to make theaverage speed of their combination, acting as a single memory, fall somewhere between that of Miand Mi+1.

The VM space is much larger than the space available in an actual component memory, but infact, only a part of any large program or data set is needed at any one time. So such programs ordata are partitioned into blocks and the component memories are partitioned into block size areas,called block frames. The fixed size allows efficient use of memory. As a result of bringing in andremoving variable size sections of programs/data, randomly distributed unoccupied parts ofmemory come to exist. To keep track of these and to find reasonable useful information can be tostore therein becomes prohibitive.

Within a program memory locations are needed1. for transfer to a program address, or for accessing specific data, and also

2. by input-output of files by reads and writes.

In the first case the address in the program is the VM address. In the second the address isgiven by the file name. In the first case the VM address must be translated into an address in Miusing tables in Mi, in the second it is translated to a disk location using a directory also on disk.

Information is addressed by its VM location, but for any further use of it would be good to haveit in M1, the fastest and smallest memory system component. And so a block of informationincluding and surrounding that addressed in Mn is moved through the hierarchy into M1. Howeverone might want to move, in addition to the one block, other surrounding blocks from Mn part waydown the hierarchy, so they are close to M1 if they are needed. In fact, whenever a block is to bemoved from Mi Mi+1 some surrounding blocks might also advantageously be also moved. Analternate way to arrange this is to assume the the block size in Mi ≤ Mi+1, with the understandingthat the block at Mi gives the number of bytes to be moved from Mi to Mi+1.

The size of grouping of instructions or data into blocks is largely based on empirical studies ofexecution sequences of programs. In execution programs do not move randomly through their


246

instructions, but rather spend the bulk of their time within small and contiguous sets of instructionsand very little in moving between such sets. They are said to exhibit locality of reference. Thesmaller memories in the hierarchy will soon be filled with these blocks while new blocks continue tobe addressed. So there is the need to move blocks back up through the hierarchy to make room forthe more recently addressed blocks. In fact though, only blocks which have been written (are dirty)need make this ascension. In any case a strategy (replacement algorithm) to determine whichblock(s) to move and/or replace minimizes movement is needed. Blocks which are least likely to beneeded soon should be candidates for movement.

As well as depending on the measure of locality actually found in programs, the size of ablock depends on the time to move it from memory component to memory component. The size ofblock moved could be the same for all memory components but because of the physical structure ofdifferent memories, different block sizes may be appropriate for different levels of the hierarchy.However, because blocks must be moved up and down the hierarchy, different block sizescomplicate that movement. If block sizes are different in Mi and Mi−1 then certainly one should be amultiple of the other.

A program address, An, locates a block in Mn. When demanded at time t that block may bemoved to Mn−1 at block address An−1(t), and on down the hierarchy to M1 at addresses A1(t). Theseaddresses generally depend on the what spaces are available in Mn−1 . . . M1 at time t. Thenwhen the program again issues the block address An there must be some way of associating ofdetermining that An is in M1 at address A1(t).

There may be full or only partial access from a block in Mi to a block frame in Mi−1. That iswhen a block B is to be moved from Mi to Mi−1 then only a fixed subset of all block frames in Mi maybe available for that move. This kind of restriction is imposed to simplify the translation of a givenprogram address An, in Mn to location of that block in Mi. With that block in Mi its location in An iskept. Then by matching the address An with all the adresses stored with blocks in Mi simultaneouslyand choosing the one that matches--or knowing it to be absent if there are no matches.

8.1.1. Overview Of Hierarchical Memory ControlThe general addressing problem then is:

Given the virtual address (VM), issued by the program to find the location of that block in M1 ifit is there, otherwise in M2 if it is there, etc., up to Mn-that is in the lowest indexed memory in whichit appears.

Generally a block can be located in a range of locations in the lowest index memory in which itappears, and furthermore a block may be at different locations at different times. This implies thatthe blocks location in Mi, i < n must be recorded in volatile memory--generally within Mi itself.

8.1.1.1. Linked TablesLinked tables are illustrated in figure 8-1

The most straightforward way of solving the addressing problem is by use of a table (the pagetable) in Mi with one entry per block address. If the block address requires p bits, then the table has2p entries. An entry contains all the information needed when that address is received. Thatincludes its address in Mi, and its status (present or not, written in Mi or not, etc.). When the VM


247

{

0

groups pages words

j

..

..

..

0

1

2

..

..

..

......

0

1

2

...

0

1

2

...

Secondary Page Directory*^

Primary PageDirectory

..

..

..

......

. . .

.

. . .

.

. . .

.

. . .

.

p1 o2p

bits bits bitsk1 k2 q2

0

1

2

..

..

......

..

..

....

....

{{.

. .

..

. .

. pages words

{ bitsn

2 -1k2

2 -1k2

2 -1k2

0

{n bits

2 -1k2

2 -1k1

02 -1q

02 -1q

2 -1n2 -1k1

(At most2 k1)

0

^ If = q then each Secondary Page Directory occupies 1 page in MM.k2

SoftwareOperations

Memory

TablesMain

Paged in

bits

Memory

TablesMain

Permanent

VirtualMemory

base=B

MMadd

. .

. .

bits

Page Directory*

qkp o

0

presenceprotectionmodifiedreference{n bits2 -1k

0

1

2

pages words

..

......

..

..

....

....

..

02 -1q

2 -12 -1k

0

1

2

..

......

..

..

....

....

pages words

2 -1n

02 -1q

MMadd

d(p)

base=B

Memory

TablesMain

Permanent

(a)

(b)

d (p )1

Q

1Q Q2X X

+

X

+

+

+

+

Figure 8-1: Page Directories VM Page Addresses

address, A, is issued by a program the block address part serve as an index to the table. If theblock is not already in Mi it is given space in Mi and its address in Mi is entered at that index. Theindexing allows fast access to the needed information, a very important consideration since thislookup must be done whenever the program issues an address. On the otherhand, the space forthis page table must be permanently assigned--it can not be used for other purposes.

By using linked tables the permanent assignment of space can be considerably reduced at


248

the expense of time on each lookup . The block address, p, can be partitioned into two parts, p1 andp2, having respectively k1 and k2 . The k1 bits refer to a group of blocks in VM, and the k2 bits pickout the block within that group.Then a "primary" table with only 2k1 entries can be kept permanentlyin Mi. When a block is assigned for block address A = A1 || A2 in Mi, an entry in primary table atindex A1 is given the address of another "secondary" table in Mi with 2k2 entries. At the A2 entry ofthis secondary table is the address of a block in Mi where the information of the addressed block iskept. Space for the secondary table is only needed when the first block in the group given by itsindex in the primary table is needed.

The linked table approach can be extended to three or more levels of table with only theprimary table in Mi. Of course this will require additional time for resolving each program address. Itis important to make this time as low as possible. To do this hardware can be used. In some casesthe tables themselves can be in fast registers independent of Mi, and the algorithm of skipping fromtable to table can be implemented in hardware--perhaps in a micro computer. Putting the algorithmin hardware is an option generally available for all i < n. Putting the tables in fast registers is onlyan option for small memories and thus for the memories Mi, with i at the small end of its range.

8.1.1.2. Hashing, Associative MemoryAn alternative approach to translating from the program issued block address to the location

of that block in Mi is to maintain an inverted page table, IT, with 2mi entries, where there is room for2mi blocks in Mi. In each entry for which there is a block assigned in Mi, the corresponding VM blockaddress is stored. When the program issues a VM block address a search is made in IT issearched for that address. That entry contains the information necessary to the find correspondingblock in Mi, either by its 1-1 corespondence to blocks in Mi, or by having the address of thatcorresponding block in Mi actually stored in the inverted page table entry.

The search of IT could be very time consuming. A linear search takes time proportional to thelength of the table--if the item sought is not in the table this is not known until ITs entries have beenvisited. To make this a approach at all viable a much faster search is needed--on the averagerequiring small constant time. This can be achieved in a number of ways.

Using Hashing, a given VM block address, A, coming from a vast range, is used to generatea series of IT addresses in the much smaller range of block addresses in Mi. This is done byapplying a series of hash functions to A which result in a series of addresses to be checked in IT.When A is recieved for the first time, a number, say n, of occupied entries addresses generated bythe hash functions will be checked before the first unoccupied address in IT, say Ainv is generated.When A is later reissued by the program, the same hash functions are applied to it and result inchecking the same n occupied entries before arriving at Ainv. Hash functions will be considered inmore detail later. The generation of addressess by the hash functions can be implemented insoftware or in hardware. Both require the same number of accesses to Mi to arrive at theredestination. Hardware can be used to speedup the generation of hash addresses.

Another way that a fast search can be made is by simultaneously comparing all stored VMblock addresses in the inverted table with an program issued VM block address. This requiresspecial hardware, not only to make the comparison, but also to simultaneously address all of Mi’sentries at the same time. Typically in larger memories only one entry or at most a small number canbe adressed at one time, so to address many at the same time requires a special type of memory--


249

an associative memory. These are only feasible for the smaller, low indexed Mi s. In (a) of figure 8-2Mi is shown with its blocks entries corresponding 1-1 with the entries in the inverted table. Theissued program block VM address, p, is shown being matched simultaneously with all 2p addressesin the inverted table-- the search situation in an associative memory. In (b) the alternative to thematching, the hash search, is shown.

bits

p o

bitsk q2

..

..

..

..

....

....

..

..

..

..

..

2 -1k

0

1

Hashi

i =1

i=i+1

search

MiInverted Table

Program Issued VM address

Associative Memory

(a)(b)

Hashed Access

select

match

Figure 8-2: Inverted Table Control

8.1.1.3. Access Between Levels Of The Hierarchy, GroupingFigure 8-3 supplements this discussion.

Mi has fewer blocks than Mi+1. If each block in Mi+1 has access to any block frame in Mi+1then any of the ways considered above for determining the location of a block in Mi given its VMblock address are applicable. In particular an inverted table with T entries is used in Mi then isteadof a hash through this table a hardware approach using an associative memory a simultaneuscomparison between all of the T entries each with a VM block address in the inverted table in Mi


250

must be simultaneously compared with a given VM block address. The average number of lookupsnecessary to do a search with hashing can be reduced by using a table larger than T so decreasingT may be desireable. A large number of simultaneous comparisons can make the cost of anassociative memory prohibitive. T can in fact be reduced if each block frame in Mi only acceptsblocks from a subset of the blocks in VM.

Blocks in VM can be grouped with 2k1 blocksin a group. Each block has 2q2 words. With 2k2

groups, the address of a word in VM has three parts, p1 with k1 bits, p2 with k2 bits, and o with q2bits. This address in sent to Mi. Its blocks are also grouped. Mi has the same number of groups asVM, but its groups are smaller. There are 2g blocks in an Mi group ( 2g < < 2k2)--4 in the figure ).There is a 1-1 correspondence between groups in both memories. Each block frame in a group inMi receives only blocks from the same group in VM. There is an entry in the inverted table for eachblock frame in Mi.

The address of a block in VM, <p1, p2, o> is given to Mi. p1 is used to select the group ofblocks in Mi which contains the block addressed. p2 then is matched against the addresses within agroup (4 of them in the figure). So the number of simultaneous comparisons or entries in for a hashtable required is now only the size of a group in Mi. The penalty is the limited access of blocks inMi+1 to block frames in Mi. So there may be a block frame available in group, G’, in Mi, while tryingto bring in a block, b from a different group, G, in VM while the G’ is full in Mi. A block must then firstbe removed from G in Mi before b can be moved into Mi. It may however be possible to arrange thegroups so that members of the same group are very unlikely to be needed at the same time.

8.1.2. More On HashingAs stated earlier hashing is based applies the hash functions to a key, ex. the VM block

address, to determine its table location.

Let M be the number of blocks in Mi and V the number in VM

hash1, hash2, . . . hashj . . . to be performed on each key, with the property that, for all j,hashj(K), 1 ≤ K ≤ V , 1 ≤ hashj(K) ≤ M , and with high probability or even certaintyhashi(K) ≠ hashj(K) , if i ≠ j. This can be accomplished in a number of ways.

In general the first few values of the hash sequence are each computed directly from thegiven key, K, in different ways to give numbers with a high probability of differing. These are theprimary hash functions, for example:

hash1(K) = remainderV/M (These are the low order bits of the key if V and M are powers of 2),

hash2(K) = the middle log M bits of K.

the remaining, secondary hash functions, are given by a simple formula based on the primaryhash function values which is guaranteed to have different values in the range 1 to M. For examplethe linear secondary values is

hashp(K) = (hash1(K) + (p − 1))mod m , 3 ≤ p ≤ M.

The primary hash function should be chosen to give different addresses to the each of namesexpected. If the names are equally likely to be anywhere in the range 1 to M then any selection oflog m bits of the name N are equally good and disjoint sequences of log m bits of N are likey to beunrelated. However if the representaion of the names are expected be close to each other then the


251

selectgroup

match

bits

p1 o2p

bits bitsk1 k2 q2

2 -1k 2

1 of 4

Hashi

i =1

i=i+1

search

(a)(b)

Mi

Inverted Table

Associative Memory

HashedAccess

select

selectgroup

1

Groups

..

....

....

....

....

....

....

0

1 of 4

M addressi+1

4

2 >> 4 k 1

Figure 8-3: Inverted Table Control

primary hash functions might better be taken from its log m low order bits.

Hashing For Entering, Searching, and Removing Keys

The use of the hash function table is given in the following pseudo code.


252

<"0" represents absence of usual data><"0!" represents either absence or presents of data depending on context>save = "0"; count = m ;while ( count ≠ 0 & hashi(K) ≠ <K, x> or "0" or "0!" )

{

i##=##i##+##1;##count##=##count##-##1,save##=##0;if hashi(K) = <K, x> then hashi(K) is the entry for key K;

<If a search for use: the entry has been found,furthermore, if save ≠ 0 then the entryin hashi(K) can be moved to address insave and hashi(K)##=##"0!"} and

If search for removal hashi(K) = "0!"if hashi(K) == "0" then

<If a search for use or removal: then there is no entryfor K in the table, Done;

If making a new entry hashi(K) = <K, information>

if hashi(K) == "0!" then<If a search for use or removal: Continue;If making a new entry: hashi(K) =<K, information>

}if count == 0 then the table is full;

The use of "0!" is interesting. "0!" replaces a table entry which is removed so that any searchwhose sequence of hash functions would pass through that entry will still do so. However, if afterpassing through a "0!" entry the search does find a the key it is looking for, that key can be movedto the first "0!"" found in the search. "0!" can also be replaced when a key, not already in the table,generates a sequence of hash values which arrive at a "0!". But to check that it is not already in thetable the search would continue till a "0" is found. Then the key may be placed at the address of thefirst "0!" found which would be kept in location save.

8.1.3. Hashing And Inverted Page TableIn figure 8-4 an inverted table implemented as a hash table T, with 2m entries is shown. In

addition to the VM address space in each entry there is space for a pointer to MM. Note that thereare 2n blocks in MM. If m = n then, except for the addition of the pointer space this is equivalent tofigure 8-3. If however m > n, the pointers allow each VM block access to any MM blocks. Later weshow that this difference in size can be advantageous. Or if the In any case we use figure 8-4 with Tpartially filled to illustrate the following discussion about hashing.

8.1.3.1. Analysis Of HashingThe analysis for the average number of probes per entry in T is calculated assuming each

hash probe is independent and equally likely to find any location in the hash table. The probes arenot guaranteed to be to different locations.

S(f,n) is the average number of probes to find a key when the fraction of the table occupied isf, and the table has n locations.


253

2 -1q

pages words

presenceprotection

modifiedreference

0

1

2

..

..

......

..

..

....

....

2 -1n

00

1

2

......

..

....

....

..

..

..2 -12 -1k

02 -1q

p

VMpage#s

2 -1m

01

VM page addr

MM page

VMMM

Hashed Page Table

p opage

virtual addressoffset pages words

.

.

.

+

base

p

Hashi

search

i =1

i=i+1

VM page T

unoccupied occupied +F =

Average Number Of Probes

F P(F) S(F)

.5

.6

.7

.8

.9

1.391.531.72

2.012.55

2.002.503.33

5.0010.0

p y

Figure 8-4: Hierarchical Page Directories

S(f, n) / (1 − f) = 1 + 2f + 3f 2 + ⋅ ⋅ ⋅ + n f n − 1

− f S(f, n) / (1 − f) = − f − 2 f 2 − ⋅ ⋅ ⋅ − (n − 1)f n − 1 − n f n

------------------------------------------------------------------------------------S(f, n) = 1 + f + f 2 + ⋅ ⋅ ⋅ + f n−1 − n f n

− f S(f, n) = − f − f 2 − ⋅ ⋅ ⋅ − f n − 1 − f n + n f n+1

------------------------------------------------------------------------------------(1−f) S(f, n) = 1 − (n+1)f n + n f n+1


254

(1−f) S(f, n) = 1 + n f n(−(1 − f)) − f n

S(f, n) = 1 / (1 − f) − n f n − f n / (1 − f)

S(f, n) = 1 / (1 − f) − f n(n + 1) / (1 − f)

S(f) < 1 / (1 − f) as n goes to infinity.

S(f) Is The Average Number Of Probes To Find A Key When theFraction Of The Table Occupied is f, as the table size approachesinfinity.

P(F) = Average number of probes to find an unoccupied location whilemaking entries in the Table until its filled to fraction F.

P(F) = (1 / F) ∫0 to F [S(f)]

= (1 / F) ∫0 to F [1 / (1 − f)]

P(F) = −ln (1 − F) / F

The table at the bottom of figure 8-4 gives the values of P(F), and S(F) for a number of valuesof F

The operating model for hashed symbol table in a compiler is: the table is filled to a fraction Fwith symbolic addresses and corresponding actual addresses and then searched without makingfurther entries. In such a case P(F) gives the applicable average number of probes. Using a fractionof as much as .8 of the symbol table the average number of probes is about, a reasonable number.

The operating model for a hashed inverted table is: the table is first filled to a fraction F, andthen entries are removed and made, a mode of operation whose evaluation is better approximatedby S(F)

If the inverted table has a number of entries equal to the number of pages. Each entry canhave a fixed correspondence to an MM page (an Inverted Page Table). This means no informationof the page location associated with a VM address need be stored with that VM address.

Hash Table Oversized

On the other hand for the inverted table model, the table would be filled and then removalsand new entries would be made with the fraction filled remaining very high. Then the averagenumber of probes for an entry or a removal would likely be greater than 10. So it would seemadvisable to increase the size of the hashed inverted table so as to use perhaps only half of it. Thismeans the fixed correspondence between the hashed inverted table entry and an MM page mustbe abandoned. The entries in T must then be made large enough to carry an MM page address inaddition to its other information.(This is illustrated in figure 8-4) if m > n) So the table size increasesup to 4 times for this arrangement.

8.1.3.2. Implementation: Hardware SoftwareFirst of all one can keep a subset of most frequently used page addresses in hardware. Then

when the virtual block address is issued by the program it can be matched in parallel with all thesemost frequently used addresses. If there is a match this can be used as the MM page address.


255

Hardware

words

p o

latest most often used pages

x y

select

VM page MM page

page virtual address

offset

NG

match

X

2 -1q

pages

presenceprotection

modifiedreference

1-1 in order

0

1

2

..

..

......

..

..

....

....

2 -1n

00

1

2...

...

..

....

....

..

..

..2 -12 -1k

02 -1q

p VM

page#s

page#sMM

1-1 as assigned

2 -1n

01

VM page addr

VMMM

Hashed-Inverted Page Table

pages words

x ...

Inverted Table

y

select

+

base

or

With Associative Memory

Hashi

search

i =1

i=i+1

Figure 8-5: Hashed Page Directories

This is all hardware and is considerably faster than doing the same operations in software. Thisarranngement, called associative memory, is often used for an initial attempt to go from the issuedvirtual memory page address to the MM page address. If this fails then one next goes to thehierarchical or hash schemes to find the correct MM page address. The combination of anassociative memory with an inverted page table (a redrawing of that in figure 8-3) is shown in figure8-5.

As noted previously, much of the work required to find the MM page address correponding toa given VM address in the process required when the associative memory does not hold the VMaddress can also be assigned to hardware.

When hashing is used the computation of the hash functions can be done completely inhardware. Again however the bottom line is the fact that a number of memory accesses isnecessary and these are required for each access to memory issued by the program.


256

8.2. Cache And VM In OperationTo take advantage of locality of reference, the sets of instructions or data, within which the

program tends to linger, the coherent sets, are the blocks. These are loaded from their relativelypermanent home in a large capacity memory system MII into a higher speed but considerablysmaller cache, M1, the memory from which they are executed. When the program departs toanother block, it will be loaded into the cache, and if necessary replace a previously loaded block,(the replaced block having been read back to MII if it has changed while in the cache).

Different arrangements of the MII-M1 combinations give different tradeoffs between simplicity(hardware cost) and restrivctive access of blocks in MII to block frames in M1. These result fromgroupings of blocks in M2 and their smaller size companion groups in M1 as described in theoverview above.

Three different ways of coordinating the two component memories of the MII-M1(cache)system are shown in figure 8-6. These correspond to different group sizes in M1. Blocks typicallyrange in size from 16 to 2056 bytes.

In the pure associative design any block in MII can be placed in any block frame in M1. Theaddress is that of a block in M2. A block M2 can be placed in any block frame in M1. The addresshas Tag and Offset field. The tag is the block number in M2, the offset is the words location in thatblock. When a block is placed in M1 its tag accompanies it. To determine whether an addressedblock is in the cache the tag field is compared with the tags in all blocks in the cache, it is anassociative memory. Of the three cache organizations this is the least restrictive. When any blockin MII is to be placed in the cache, it can be places in any empty cache frame in the cache-norestriction. On the otherhand the comparator must compare all tags of blocks in the cache againstthe tag of the addresss, making this the most costly of the three alternatives.

In the set associative design (See figure 8-6) , M2 is paritioned into disjoint numbered blockGroups numbered 0 to 2k2 − 1. Blocks within each group are numbered 0 to 2k1 − 1 these are theBlock_In_Group) numbers. All the blocks in a group have access to same cache block group ofcache block frames-there are 2k of them . The address in M2 gives a word’s location with a Tag,giving its block_in_group number, an Index, giving its group number, and an Offset to locate theword within the block. Since all blocks with the same group number have access to one small groupof cache block frames in M1 program layout within the memory will, as far as possible, assure thatblocks with the same block number are from non-communicating programs in M2. To determinewhether an addressed block, b, is in the cache: first the only group, say q, in the cache which couldcontain b is found using the group number field (Index) of the address. Then the tag fields of q areall compared, in parallel, to the group field (tag) of the address. There is a match picking out thecorrect block only if it is in the cache, otherwise it is in MII. This design is less costly than the pureassociative, but more restrictive.

In the direct design is set associative with k = 1, and so the least costly but most restrictiveone. Note also that the pure associative is also a special case of set associative cache.

To the programmer all of this communication between memories is transparent.


257

D I R E C T if k = 1

A S S O C I A T I V E

Address

BlockNumber

Word

2q

words / block0 1

TAG OFFSET

match

2 -1p

op

S E T A S S O C I A T I V E

match

Address

INDEX

Word

OFFSET

GroupNumber

TAG

tagblock

VM

Number

Block_In_Group

N u m b e rG r o u p0

0

1

1

Block_In_Group Number

k bits1 k bits2 q bits

2 -1k 1

2 -1k 2

words / block

VM

2 q 2

2 -1k

01

k bits q bits

p1 p2 o

CACHE

CACHE

tag

tag...

group

group

Figure 8-6: Cache Architectures

8.2.1. Software-Hardware Control Of Memory SystemThe memory management unit, MMU, and the translation look-aside buffer, TLB,

(physically the MMU may contain the TLB) receive inputs from the address field of each instruction,which gives the segment, page and offset within the page of the word addressed, i.e, the position of


258

the word in M3.

The TLB, provides rapid access to some of the most frequently used segment and page tableentries, by associating an address field triplet of segment, page and offset with the real address, inM2. Small independent memories with full and set associative caches have been used toimplement the TLB. If the TLB has an entry at the virtual address presented to it, and thisinformation shows the memory request to be valid, the addressed page in M2, is passed on to theMM (where it will be checked for availability in M1 or M2). If there is no entry for the given virtualaddress in the TLB, it is passed to the MMU which, referring to tables in M2, determines where instorage, the MM or M3, the addressed object resides. If it is in the latter, it is given for fetching tothe MM, otherwise a memory fault is generated, which results in the memory system bringing thepage addressed into M2 and then into the cache from which it will be fetched. The informationneeded to perform these functions is kept in a group of tables, the page directory managed by theMMU. A number of different ways of to organize a page directory are extant.

8.2.2. Virtual Memory--Segments-Pages, Locating Bytes in MMA program consists of meaningful parts, including tables, stacks, and code. The parts are

called segments. Some of the parts, ex. the run time stack are created by a complier, others arewithin the programmers design. The collection of parts composing the program be termed aprogramming environment. The segment is meaningful in that in is characterized with a set offeatures distinguishing it from othe segments. A segment may be of fixed or variable length, read-only, writeable, it may be temporary or permanent. It may be a unit to which the owner wishes togive others access, or not. It gives a natural scope for protection. The attempt is made to have asingle programming environment run to completion before another is used. This is not alwayspractical. When the computer is shared because of time sharing and/or multi-programming it isoften necessary to jump from one environment to another, and thus from one set of segments toanother. Such a change is called a context switch. To avoid massive duplication of programs, it isoften necessary have several different program environments share segments. For example, anumber of programs may need the same compiler, the same data base system, or the same datafiles. They may want to share its code while keeping some tables to themselves. There are thentwo types of segments, the shared segment and unshared or unshared segments. For eachprogramming environment there is a segment table with one entry for each segment in thatenvironment. If such a segment is shared that entry may point of to an entry in a shared segmenttable. Having one shared segment table containing all shared segments means that there need beonly one copy of the associated shared pages. Also changes in attributes of such shared segmentsand pages need only be recorded in the shared tables. Every time there is an access to the sharedtable a reference count is incremented, and when such a reference terminates the reference countis decremented. While segment tables may move in and out, the shared table stays until thereference count is zero. All segments are composed of pages. Each page consists of a number ofwords or bytes, or, of a number of blocks each of which is composed of a number of words or bytes.The size of a page, block, and word are all powers of two.

Remember we assume three levels of memory, M1, M2, and M3. All segments are stored inM3. Each word is addressable in M3. The address field of every instruction referring to memoryconsist of three parts. In figure 8-8 the table layout for virtual memory implementation is given text.)

1. The SEGMENT identification number, called i3


259

. . .

. . . . . .

SEGMENT PAGE OFFSETaddress field

cacheaddress

cacheaddress

memoryfaultsVIRTUAL

MMU

MEMORYMANAGEMENT

UNIT

LOOK-ASIDEBUFFER

TRANSLATION

TLB CACHE M1

compare

2g

- 1

p bitsg bits b bits

BlockNumber

TAG

GroupNumber

INDEX OFFSET

Word

MAIN MEMORY M2

2z-1

2p

words / block

1

2q

blocks / page

VIRTUAL MEMORY M3

Seg 1

Seg j

Virtual Address

2 -1bBlocks/ Group

1

. . . .

Seg n

0

0Page #/ Segment

0 1

0 1

. . . . . . .

. . .

10

Groups/ M2

Cache Frames / Cache

12 -1b0

Pages/Seg 1

Associative Memory

Figure 8-7: A Virtual Memory Architectures

2. The PAGE identification number relative to the start of the segment, called i2, and theOFFSET number gives the address of the word relative to the start of the page called,


260

i1.

We have also considered the case where the pages are composed of blocks and thereal page address, say p picks out one of several parts, each a power of 2 in size, of agroup, and thus a number of blocks. (Alternatively a page may consist of one or moregroups.) This amounts to supplying part of the block address so as to locate the wordwithin M2 (which is partitioned into groups and blocks for cacheing purposes). Givenp, finding the word requires remainder of the block address and the word location inthat block. This is given by the OFFSET field. See "real page location" in figure 8-8,where i1’ is the remainder of the block, and i0’ the word address.

The segment tables are in M2. At each moment, the address of the single active table is in theprocess segment base address register, while the shared segment base table points to the start ofthe shared segment. The segment tables are addressed by the segment identification (number) inthe address fields. Each addressed entry contains information about that segment.

1. Presence bit, indicating by 1 that the page table for this segment is in M2, by 0 if it isnot. If it is an entry for a shared segment table this bit is also 1.

2. Share bit, indicating if it is shared or not. If it is then the segment number in theshared segment table is given (see 4)

3. Access control bits (assume there are three indicating whether this segment can beread (1 bit), written (1 bit), executed (1 bit)), or a combination of some of theseconditions.

4. The address of its page table in M2 if the page is in M2, or the index into sharedsegment table.

5. Its length in pages.

Each segment is composed of a number of pages. The segment table gives the address ofthe page table, which in turn contains information about the pages in that segment.

1. Presence bit, indicating if the page is in M2 or not.

2. A"dirty bit" indicating whether the page has been written into or not.

3. The address of the page, given as the page frame number (M2 having beenpartitioned into page frames) in M2 if the page is in M2.

4. The location in M3 of the page.

8.2.3. Memory Management ControlThe flow of control of memory management system is given below. The steps are executed in

the order shown. Sometimes the flow will terminate at an itermediate step as indicated. As in thefigure 8-8, ns and np are respectively the size of an entry in the segment, and the page table. sp isthe page size.

1. Fetch at (segment base register contents) + i3 × ns.

2. Check access bits against the operation attempted. If there is no agreement thengenerate "access error" to the operating system. This causes a trap which terminatesthe program.

3. Check the presence bit for the presence in M2 of the page table for the addressedseqment. If it is 0 (no) generate segment absence fault. The fault causes the


261

presence bitdirty bitpage frame no.location in secondary mem.

length in pages

presence bitshared bitaccess bits

+

pagetable

processsegment

table

tablesegmentshared

page

.

..

.

..

block word

i 2

SEGMENT PAGE OFFSET

+

i 2∗ pn

+

ns∗

+

register

segmentbase

shared

process

register

segmentbase

number ofshared segment

address ofpage table

i 3

i 3

i 1

i 1

||

blockgroup word

real page location

k

assuming2 pages/group

cache address

k

i 1

pagetable

to a

address fieldinstruction

Figure 8-8: Virtual Memory Management Tables

operating system to bring in the missing page table and to update the segment tableaccordingly. Then the program restarts the instruction that caused the memory fault.

4. Check the shared bit. If 1 (yes) go to shared table at shared segment address.Return to 2 if present. If not present generate absent segment fault, etc.

5. Check the length, in pages, of segment against i2. If it is out of bounds generate "outof bound" fault to operating system. This fault causes a trap to terminate procedure.


262

6. GO TO PAGE TABLE at (page base table address given in segment table entry) +np × i2.

7.a. Check presence bit in the page table entry. That is a check of whether the

page has been brought into M2. If 0 (no) generate page table absence fault.The fault causes the operating system to bring in the missing page and toupdate the page table accordingly at its location in M3. Then the programrestarts the instruction that caused the memory fault.

b. If the word is to be written mark the dirty bit.

c. The word is at: (the page table entry for the M2 location of the page) + i1. The"+" is sometimes replaced with concatenation, ||, as in figure 8-8.

8.2.3.1. Variations In Moving Data Through a HierarchyIn the description of the operation of the virtual memory a dirty bit was used to indicate that

data has be changed in the cache, M1, so that M1 and M2 can be kept consistent (maintain theirintegrity). There are other ways to keep the copies of the same data at different levels of thememory hierachy consistent. These are shown in figure 8-9. As previously M1, and M2 are thecache and high speed memory of the hierarchy and the CPU registers are referred to as M0, andassume that instructions get data from, and put into these registers.

When the microcode has set up the memory address in IMAR or DMAR and wants to read atthat address it issues a get command. If it wants to write it issues a put command. If there is acache the memories internal response to these commands depends on the location of the dataassociated with the given address. When data is to be moved it is either in, or not in M1 (the cache)when a put or get is received by the memory. The memory can respond in different ways to theserequests as shown in figure 8-9. Variations are in the implementation of the puts, because changesin the cache due to puts must appear in M1 and in M2 either immediately or eventually when theblock in which they appear is replaced.

M 0 M 1

M 1 M 2M 0 M 1

Mnew2M 1

M 0 M 2M 0 M 1

Mnew2M 1

M1dirty bit

1

M 1 M 2M 0 M 1

REPLACE

PUT

GET

M 0 M 1

M 1dirty bit

1

M 0 M 2M 1 M 2

In M1

M 1 M 2

Mnew2M 1

M1dirty bit

0

if dirty bit = 1:

MemoryOperation

dirty bit

M 0 M 1

dirty bit

M 1 M 2M 0 M 1

dirty bit not used dirty bit not used

Not In M1 1

Figure 8-9: Cache Updates-Maintaining Integrity


263

The table in figure 8-9 is formulated under the assumption that the gets and puts all refer todata which is definitely in M2. M0 in the figure represents the instruction register. This same chartcan be applied to M3, if we add 1 to all the subscripts in the chart, and assume the data is in M3.

8.2.4. Segment Allocation--FragmentationSuppose it were necessary to load entire segments at the same time into contiguous locations

in M2, and many segments can be in M2 at the same time. As time goes by there will be segmentsremoved and others brought in. In general there will be alternate filled and empty parts of M2.When, into this distribution of available and unavailable space of different sizes, it becomesnecessary to place the next segment to be loaded, a plan or policy should guide the placement.Some popular plans for locating a segment S are listed below:

1. First Fit, FF: The runs of available spaces are searched in location order until the firstone is found that accomodates S--S is paced there

2. Next Fit; This is first fit as above, with the available runs of space linked together, andthe starting point for the search for one large enough moving one space along thechain after each segment is accomodated. The links have to be repaired after eachstorage of a new segment and after each removal. This should be done in such away that when a removal leaves two runs of empty space abutting, they are gatheredinto a single larger available run.

3. Best Fit, BF: The runs of available spaces are searched in location order, until allthose that accomodate S are found--S is placed in the one it most closely fits, i.e., theone which with S in it leaves the smallest unoccupied remainder.

One might think that the BF, which clearly is more complex and considerably more timeconsuming than FF, would always be at least as good as FF. This is not true. Assumen > k > j > 0, at some point there are available spaces of size n + k and n in that order. Thensuppose a block of size j arrives. FF would leave available spaces of n + k − j > n and n, while BFwould leave n + k and n − j. Now if the next two segments were size n they would both fit using FF,but only the first would fit in the space left by BF.

Clearly if one could wait until more than one segment needed placing then one could do betterwith any of the algorithms.

In the page segmentation memory allocation this problem does not arise since the units thatare moved in and out are fixed size pages. They are moved into page frames so any time pagesare removed they leave space for the number of pages that were removed. Of course it isnecesasary to keep track of available space.

Fragmentation is the accumulation of many isolated small availble spaces, none of which canaccomodate a reasonable size segment or more, although their sum is enough to do so. One mustthen rearrange memory allocation to make one large available space.

8.2.5. Analysis Of Memory Hierarchy PerformanceA useful performance measure of a computer memory system in executing a program P, is

the hit ratio, H. H is the fraction of cache accesses which are succesful (in which the word neededis already in the cache) in executing P. M = 1 − H is the miss ratio or the fraction of cacheaccesses that fail. Generally there is an initial access to a block, b1, followed by sequence of


264

accesses within b1, then again an initial access to b2, b1 ≠ b2 follwed by a series of access within b2,etc. The sequence of blocks, b1, ... , bn with bi ≠ bi+1 thus formed is a block input sequence. Ablock, bi in such a sequence is a hit if the first instruction in bi is a hit, otherwise it is a miss. Theblock hit ratio, Hb of a block input sequence is the fraction of blocks in the sequence which arehits. Note that if there is a miss on the first access to block bi in the block input sequence, allsubsequent accesses of this occurence of that block will be hits.

Nb is the total number of blocks in the block input sequence.

na is the average number of succesive acceses to the same block.

H = ( Hb Nb na + (1 − Hb)Nb (na − 1) ) /na Nb

= (Hb na + na − 1 − Hb na + Hb) /na

= (na − 1 + Hb) /na

= 1 − (1 − Hb) /na

Figure 8-10 illustrates that the word hit ratio may still be quite large even if a programexecution frequently jumps from block to block. (ex. half the time). Again this is a consequence oflocality of reference reflected in a high na. na is a non-decreasing function of block size.

.5

.6

.8

.95

.96

.98

.975

.980

.990

HHbBLOCK W O R D

an =10 an =20

H i t R a t i o

Figure 8-10: Cache Architectures

8.2.5.1. The Access Time Of Cached MemoryThe average number of hits/block, and the block hit ratio reflect the degree to which locality of

reference holds as well as the ratio pf cache to MM size. These hit ratios, together with accesstimes to M1 and M2, determine the efficiency and speedup of a cached memory system. Theefficiency is the ratio of the time to access the cache directly, to the average access time to theentire memory system. The speedup is the ratio of the time of a direct main memory access, to theaverage time to access the entire memory system; the higher the better. The cache, main memorysystem is an example of a two mode system, as is a machine with vector and scalar operations.The formulation of the efficiency, and speedup measures for a general two mode system applies toboth.


265

8.2.5.2. Two Mode Systems-Efficiency And SpeedupAssume that a program P executes each instruction in one of two modes. In this mixed mode

operation the execution time of an instruction depends on its mode of operation. P could in theorybe executed using only one of the modes. Since one mode is typically faster than the other mode itwould be best if P were implemented entirely in the faster mode, but the nature of the two modesmakes this impossible in reality. Mixed mode is a must. The net performance of the mixed modedepends on the ratio of the times in each mode, and the relative speed of the two modes. Theefficiency, which is always less than one, and speedup, always greater than one, provide measuresof that performance.

T1 is the time to execute P entirely in mode 1, assumed to be the smaller time.

T2 = is the time to execute P entirely in mode 2, assumed to be the larger time.

R = T2/T1 is the ratio of the speed of the two modes. Note that T2 = RT1,T1 = T2/R,

f is the fraction of the time P spends in mode 1 during P operating in mixed mode.

T12 = is the time to execute P in mixed mode

T12 = fT1 + (1 − f)T2 = f(T2/R) + (1 − f)T2 = fT1 + (1 − f)RT1

Efficiency, e = (T1/T12) = T1/(fT1 + (1 − f)RT1) = 1/((f) + (1 − f)R)

Speedup, s = (T2/T12) = T2/((f(T2/R) + (1 − f)T2) = 1/((f/R) + (1 − f))

s = Re

Then for a Cached Memory System: mode 1 = cache, mode 2 = main memory, f = hit ratio =H, R = ratio between the time to run P if all accesses were to the main memory to that if allaccesses are to cache ( > 1).

e = Tcache/Tcache&main = Tcache/(HTcache + (1 − H)RTcache) = 1/((H) + (1 − H)R)

s = 1/((H/R) + (1 − H))

Figure 8-11 with f and R interpreted as above, illustrates that, even when the speed of themain memory is close to that of the cache, very high hit ratios are needed for efficient operation ofthe memory system.

...fraction f

R

.70 .90 .95 .99 .999

1 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0

5 .33/1.6 .71/3.5 ---/--- .96/4.8 1.0/5.0

10 .18/1.8 .52/5.2 .67/--- .91/9.1 .99/9.9

20 .10/2.0 .34/6.8 ---/--- .83/ 17 .98/ 20

50 .04/2.0 .22/11 .29/--- .67/34 .96/ 48

100 .02/2.0 .11/11 ---/--- .50/ 50 .91/ 99

Figure 8-11: Efficiency And Speedup Of Two Mode Systems


266

8.2.6. REPLACEMENT POLICY[Feldman 94]

A replacement policy or algorithm chooses which cache block in a full cache is to replacewhen a new block must be entered. Maximizing the hit ratio is an important requirement of areplacement algorithm. It is also important to acheive this result in hardware at minimal cost. Thecache structure itself constrains the possible replacement policies. For the direct cache there is nochoice, for the set associative there is limited choice, and for the associative there is no limit,imposed by the hardware on the block in the cache that is to be replaced.

A simple algorithm called RANDOM chooses the block to be replaced randomly. It is simpleto implement with a simple random number generator and has been shown, experimentally, to beone of the best replacement algorithms if the cache is sufficiently large.

In the First-In-First-Out, FIFO, algorithm , of blocks in the cache that entered earliest is theone replaced. It is fairly inexpensive to implement. The cache block locations over which a choice isto be made are thought of as being arranged in a circle. A replacement pointer started anywhere onthe cycle gives the next place to store a block. After a block is stored at the pointer (empty orreplacement) the pointer is moved one location on the circle. Unfortunately this replacementalgorithm is less effective than RANDOM.

Another algorithm is LRU in which the least recently used block is replaced. Though itsperformance is good, comparable in many cases to RANDOM it is more expensive to implementexcept in caches in which only two choices are available, ex., set size 2 in an associative cache.

A simplification of the LRU, Deville-Gobert algorithm, [Deville92 92] is inexpensivelyimplemented and has performance which, as shown experimentally, even surpasses that of theLRU.

1. All the (empty) block frames are always in one of two states, 0, (LRU) or 1, (MRU)(most recently used).

2. Initially all block frames are in state 0.

3. A randomly selected one of those in 0 state is the next one filled, if empty or replacedif occupied.

4. A frame goes to state 1 when a block first enters it, or when the block in that frame isused.

5. When all frames are in the 1 state, all are switched to the 0 state.

In summary when the cache is full this algorithm randomly selects a block frame which hasnot been used since the last time all blocks in the cache had been used. If a block frame is emptythat one is used.

Replacement algorithms are compared by running them on sample input sequences, actualand random, and comparing their block hit ratios. There are some theoretical results which make itpossible to make such tests efficiently.


267

8.2.6.1. The Optimal Replacement Algorithm, OPTAny real replacement algorithm has access only to the input block sequence up to and

including the current block. It can not know the future. The information it does have is used topredict the future block sequence. To understand the goal of that prediction it is well to examine amythical algorithm whose input is the entire input sequence to the present, and beyond. Given thefuture, in fact, the past is irrelevant. Of replacement algorithms which use the future of their inputblock sequence, OPT gives the highest possible hit ratio.

OPT: When replacement is necessary, replace the cache block whose next appearance in thefuture block input sequence occurs latest.

Briefly it can be designated "longest till reuse" LTR.

8.2.6.2. Proof Of OptimallityUntil a cache is full no replacement is necessary. So only algorithms which make no

replacements until caches are full are considered.

In figure 8-12 the following lemma is restated. Also its proof is charted in the same notation asused below

It is assumed that for any input sequence OPT uniquely determines which block in the cachewill be replaced-that is all members of the cache can be uniquely ordered as to their time to nextoccurence. The following lemma shows that for any block input sequence to which an algorithmother than OPT is applied there is another algorithm which will do at least as well ( ≥ hit ratio). SoOPT is the only possibility for a replacement algorithm uniformally (for all input block sequenced) atleast as good as all others.

Lemma 8.2.1: Let the block input sequence be S = {s1 , s2 , . . . , sL}. Let X and Y be fullcaches of the same size whose initial states are:

X = T + {b}

Y = T + {a} , a , b ∉ T

and the first appearance of a at ta in S is before b s at tb. For any replacement algortihmaX corresponding to S and X, there is an algorithm aY, corresponding to S and Y, with nofewer hits.

PROOF:1. aY is the same as aX up to the time that a replacement is required. If this happens

before ta anda. the replacement is made for a member of T, aY will substitute for the same

member of T and the situation will remain the same for aX and aY as initially (Tis changed but it is the same in caches X and Y).

b. the replacement by aX is of b in cache X. Then let a be replaced in Y at thattime. So both caches experience a miss and both will have the same contents.aY will be the same as aX from then on. aY has the same number of hits as aX.

2. b is first replaced in cache X at ta--a miss. Since a is in Y there is a hit in Y at that time.Now aX may choose to do one of two things:

a. It can replace b with an a and then both caches have the same contents. Fromthat point aX is the same as aY. aY has 1 more hit than aX. Or


268

LEMMA: Ifa is before that of band the first occurence of

aY will produce at least as many hits as aXin the subsequent block input sequence then

PROOF:

is equivalent toc before b

a before b

Cases

a b

a b

*

aX HA HA’aY

*NO

YES

YES

hit +0 +1

+0 +1

+0+1

hit

hit

ab

ac

bc

orc before bb before c

Block Input Sequence

Cache Contents

Hitsat * X = Y

aX aY=

&?

State of Cachesat *

Cache Changes

ba

algorithm aX works on cache X: algorithm aY works on cache Y: &{ T

T

YESa bd* db da +0 +0 d

dTT

aaT

T

cbT

T

+0 +0ex exa b ce* c

bTT

bbT

Tcb

b before c

TT

Ss1 sLs2

at

aX chooses

aY chooses

abT

TXaX onYaY on

*a b cx

NO-T changesbut the same inX and Y

a bd* dx dx +0 +0 a

bTT

YES+0 +0eb eaa b ce* e

eTT

NO-T changesbut the same inX and Y

Figure 8-12: OPT LEMMA

b. Not b, but c is replaced in cache X at time ta (there must be a replacement in Xat ta). aY has a hit at ta and still Y, unlike X, has c. ** aY has 1 more hit thanaX **. The states of X and Y are now

X = Tta+ {b}

Y = Tta+ {c} , b , c ∉ Tta

, a ∈ Tta, Tta


269

i. If the next occurence of c precedes that of b this is the same as thesituation at the start, with a replaced by c, and with a shorter inputsequence in place of S. This can only occur a finite number of times.So aY has at least 1 more hit than aX.

ii. If the next occurence of c follows that of b then1.

a. If before tb a replacement by input block e is requiredand aX replaces a member of T a{Y will replace thesame block in T leaving the situation basically the same(the revised T is the same in both X and Y.

b. If instead, b leaves X before tb with aX, then let aYreplace c in Y at that time. Now both caches are thesame. So in total now aY has 1 more hit than aX.

2. Otherwise at tb aX has a hit, aY has a miss and it now choosesto replaces c with b. Now both X and Y are the same, but aYhas lost some ground, namely 1 hit, however it was 1 hit ahead(see above ** **) so aY still has as many hits as aX.

Theorem 8.2.1: Let the block input sequence be S = {s1 , s2 , . . . , sL}. Let X be a fullcache, and A0 a replacement algorithm run on that state and block input sequence. ThenOPT with the same starting state and block input sequence will have at least as manyhits as A0.

A0

A1

A2

A3

OPT

OPT

OPT

OPT

at least 1

A4OPT

Figure 8-13: OPT THEOREM

The following development is pictured in figure 8-13. Run A0 until a time t0 at which it firstreplaces a block a0 that is closer in the future input block sequence than a block b0--up to this point


270

it mimics OPT. Now consider an algorithm A1 which is like A0 up to this point but instead of A0 saction at this point replaced b0 and not a0 (it mimics OPT for another replacement). Thecorresponding caches X0 and Y0 of the two algorithms will then be:

X0 = T + {b0}

Y0 = T + {a0} , a0 , b0 ∉ T

From lemma 8.2.1, A1 will have at least as many hits as A0. So continue running A1 until asimilar situation occurs--a time t1 at which it first replaces a block, say a1 that is closer in the futureinput block sequence than a block b1 (it is no longer like OPT). Now consider an algorithm A2 whichinstead of A1’s action at this point replaced b1 and not a1 (thus being like OPT for one morereplacement). So A2 is like OPT from the beginning to this point in the block input sequence. Thecorresponding caches Y1 and Z1 of the two algorithms will then be:

Y1 = T + {b1}

Z1 = T + {a1} , a1 , b1 ∉ T

From lemma 8.2.1, A2 will have at least as many hits as A1. So continue running A2 until asimilar situation occurs--etc. Each new algorithm is introduced when the previous one fails tosatisfy the OPT property. Each new algorithm advances at least one position through the inputblock sequence. So finally for some j, Aj finishes up that sequence. So Aj is OPT for the given blockinput sequence.

8.2.6.3. Stack Replacement AlgorithmsFirst note a restriction which applies to every replacement algorithm considered here.

A block in the cache is replaced by an incoming block, b, iff b is not in the cache, and thereare no empty block frames in the cache. If there are empty frame(s) and b is not in the cache, boccupies one of them.

Here then any replacement algorithm determines which cached block to replace when one isto be replaced under the above restrictions.

A replacement algorithm is a Stack Algorithm iff its use guarantees that the blocks found in acache of size n is always subset of those found in one of size n + 1.

Such algorithms can be efficiently simulated because all size caches, from 1 to n, can besimulated as a group, with considerable saving in space and time. That is by simulating a size ncache and ordering the entries properly the state of all smaller size caches can be seen.

(An algorithm which replaces blocks when there are empty spaces or keeps more than onecopy of a given block could be designed--though its not clear why anyone would want one.)

The block in a cache that is replaced is called the dispensable block. So as to make thehandling of empty block frames consistent with that of non-empty ones, assume each cache to befilled initially with "empty" blocks, They are numbered 1 to k in the cache of size k. If there are"empty" blocks in a cache the lowest numbered is the dispensable one. Note that empty blocks incaches of different sizes are removed from the lowest numbered to the highest. A replacementalgorithm is a stack algorithm if and only if its assignment of dispensable blocks obeys the followingconstraints.


271

Lemma 8.2.2: A is a stack algorithm iff using A the dispensable block D in Cn, isdispensable in any smaller cache, Cn−k, k > 0 in which D appears.

PROOF:

This proof is illustrated by figure 8-14, in which the smaller cache, Cn−k has 3 block frames.The larger, Cn, has 5 block frames.

Initially every block (they are all numbered empty blocks) in Cn−k is in Cn.

This inclusion property will not be destroyed in replacements involving empty blocks. Whenthere are empty blocks in Cn and Cn−k the lowest numbered one in both will be the same. Soreplacement of these empty blocks by an input block will meet the theorems requirement and stillany block in Cn−1 empty or otherwise is also in Cn. Similarly when there is an empty block in Cn butnone in Cn−k the resultant replacement will maintain the property that every block in Cn−k is in Cn nomatter what block in Cn−k

Now consider the cases in which only non-empty blocks are dispensable

The "only if" part follows because if it were false, i.e., if D was dispensable in Cn, but not inCn−k, though it appeared there, then when a block present in neither, were received, D would bedisplaced from Cn but not from Cn−k. So Cn−k, the smaller cache, would contain a block, D, nolonger in the larger cache, Cn

The inclusion property holds initially and is not destroyed by any replacement involving emptyslots in the cache. The algorithm is not a stack algorithm only if this inclusion is destroyed by someinput block replacing cache blocks in Cn and Cn−k. As long as the inclusion holds the only possibleway Cn−k can obtain a block not in Cn is if a block B, not in either is received and the block Dreplaced in Cn is in Cn−1 but not displaced from it. (If B is in both, no change will occur, if it is in Cnbut not in Cn−1, inclusion will still hold after replacement with B in Cn−1-a block included in Cn isreplaced by one also in Cn, B cannot be in Cn−1 and not in Cn). Thus the condition for an algorithmto be a stack algorithm given in the theorem must be violated

In figure 8-14 all possible combinations of types of inputs (heading the columns) and relationsof the dispensable blocks in Cn and Cn−k (heading the rows) are illustrated. Inclusion holds initially,and is shown to be maintained in the all cases except in the bottom rightmost entry in which therelation between the dispensable blocks violates the requirement for a stack algorithm in Lemma8.2.2, and an input in neither stack is received.

8.2.6.4. Examples Of Stack And Non-Stack AlgortithmsFigure 8-15 shows the behavior of FIFO with a 2 and 3 input cache for a block input sequence

1, 2, 3, 1, 4. Inclusion holds until the final state in which block 1 is still in the smallerr cache but not inthe larger.

Figure 8-15 also shows (in the last column) the behavior of a 2 block cache using the LRUand OPT strategy respectively for a very short block input sequence, 1, 2, 3, 1. Even for this shortsequence OPT already is better than LRU. It is also better than FIFO (for a 2 block cache), whichlike LRU misses on every input in the block input seqquence in this illustrative example. 8-15.


272

Cn-kD( )CnD( ) =

I n p u t

Cn-k

Cn&

ε

ε Cn-k

Cn&

ε

ε Cn-k

Cn&

ε

ε

Cn-kD( )CnD( ) =&

Cn-kCnD( ) ε

Cn-kD( )CnD( ) =

&Cn-kCnD( ) ε

a

b

dde e

a

a

b

dde e

c*

a*

c c

b

dde e

c* c*

a

b

dde e

c* c*

a

b

dde e

a

a

b

dde e

a c

ad*

b

de e

c* c

b

dde e

a c

b

dde e

c* c*

a

b

dde e

c*

x

b

dde e

a

b

dde e

xx

b

dde e

c*

a*

c

x

x

x

b

de e

c

xd*

b

de e

c* c

a aa

b

dde e

cx

x

Cn-k

Cn

afterbefore

input

a

Figure 8-14: The Four Cases Of The Lemma Proof

8.2.7. General Characterization Of A Stack Algorithm.Figure 8-16 illustrates the effect of a replacement algorithm in which the dispensable block is

the highest numbered block in the cache. At the top of the figure the contents of caches of size 1through 7 for a block input sequence 1 , 2 , 6 , 4 , 9 , 3 , 4 , 7 , 3 is shown in a stack chart with sevenrows. The top k rows display the contents of a k block-frame cache as each block is presented. Astarred block is a dispensable block, in this example the highest numbered one in its cache. Thisfigure provides an example to illiustrate some general principles.


273

2

11*

2*

11** __

3 31*

2*

1

*331*

**11

____

_

_

*

1

_22

1*

33

_ 21*2*

321*

1

*3

1

2

3

1

block inputsequence

FIFO

2 3 2 2

Replacement Algorithm

Block Frames Block Frames

LRU OPT

344

1*

4 41*

42*

⊆

⊆

⊆

⊆

MM

M

MM

MM

M

H

M

OPT doesbetter than LRU

FIFO is not astack algorithm

1

Figure 8-15: FIFO is NOT a Stack Algorithm // LRU-OPT

The Stack Star Interpretation

If there are two succesive starred blocks, b and b’, within a column, and these are inrows j and j + p respectively, then b is dispensable in all caches of size j to size j + p − 1.

Lemma 8.2.3: ]

This way of displaying dispensable blocks for a sequence of cache sizes is possible iffthe replacement algorithm is a stack algorithms.

PROOF:

Consider a set of caches C1 to Cn, all filled by replacement algorithm A. Blocks are numbered1 to n with blocks numbered 1 through k being in cache Ck. The dispensable block in cache Ck isthat numbered jk so jk ≤ k. So consider the sequence ji, i going from 1 to k. That sequence must bemonotonic non-decreasing. If it were not then then jk+a < jk ≤ k for some k and some positive a. that


274

EXAMPLE

4

1 112

32

21 1

29***42 6

66*

321

9*6*4*

9*

41 ** * * * 7*

12

346

6*CacheSizes

34

56

12

1 2 346 79 4 blockinputs

7

7*12

46

9*

3

9*

3*

*

**

*

**

*

GENERAL

largest number is dispensable

dcba

*

*

gh

fe*

*

i

d

bc*

a*

gh

fe*

*

d

*

*

*

d*c

ba

g

ef*

h*

before

*

emptyblockframe

d

bc

g

eh

f

i

a

after

b

g

eh

a

c*

after

f

d

*

insert d whichis currently in C .3

n

insert i which isnot currently in C , n < 9

89

1

3

6

89

1

3

6

NBRS

OBRS

1

3

6

5

Figure 8-16: The Stack Chart Construction Rules

implies that the dispensable block in Ck+a is in Ck but is not dispensable in Ck. This, according toLemma 8.2.2 cannot happen if A is a stack algorithm.

This representation of a stack algorithm is illustrated in figure 8-17, in (1) caches of size 1through 5 are shown with contents, including the starred entries, that satisfy lemma 8.2.2. To the


275

(1)

✔

c*

e*

b

a

d

b

dc*

a

dc*

d

b

d

e* e*

c*

C1C2

C3C4

C5 StackRepresentation

(3)

e ee

✔

c*

e*

a*

b

ddc*

d

b

d

e* e*

c*

a*

c

(2)

e eeb

b

✕

✕

c*

e*

b

d

c*

e*d

b*

✕

e*d

b*

c

aaa

dd

c*

d

b

de* e*

c* cb*

e ee

a

b is dispensable in C b is in C‘ , but not dispensable there

54

Figure 8-17: Multiple Cache Representation

right of these is a single list (vertical) in which all 5 of these caches are represented. It legitimatlyfollows the stack star interpretation. In (2) another set of caches is shown along with its legitimaterepresentation in a single list. In both these cases the set of caches satisfy the stack condition ofthe lemma, i.e when a dispensable block in a cache is also in a smaller cache it is also dispensablethere. In (3) the larger caches contents includes all of the contents of smaller caches as in (1) and(2), but here a dispenable block in the cache of size 5, namely b (starred) also appears in the cacheof size 4, but is not dispensable there. So this set of caches cannot be represented in such a waythat the stack star interpretation holds. Three failed attempts are shown.

So the state of a sequence of cache sizes can be represented in a single column, c, of lengthnc, in a stack chart. How about the change in state when a next block appears? In general the nextblock will be found in a stack column c at a row k ≤ n where If it is not there we say by conventionthat it is in n + 1 (actually an unoccupied position in a cache of size n + 1). There is a rule of howthe n + 1 st column is formed: This is illustrated for our running example in figure 8-16. The generalrule is

Let <j1 , j2 , . . . , jn+1>, ji > ji−1, be the sequence of increasing row positions with starredentries in column c of a stack chart. j1 is a constant, namely 1 since the first row entry is alwaysstarred. And jn+1 = n + 1 is always the first unoccupied page frame. Call this the "New BlockReplacement Sequence", NBRS. (It is used when a block, not already in the stack, is entered)


276

If the block b that is to be entered into column c is already in the cache at row je then thesequence <j1 , j2 , . . . , jr , q> where je > jr ≥ jr+1 is called the "Old Block Replacement Sequence",OBRS.

Stack Update Rule

For whichever sequence NBRS, or OBRS is appropriate the replacement rule is:

The block at j1 = 1 is replaced by the block which is now entering thecaches. For i = 2 to jn in the NBRS, and for i = 2 to je in the OBRS: Theblock originally at row ji is removed and replaced by that originally atji−1. And the block that was at row jn is moved into row jn+1, thusextending the number of rows with entries to n + 1.

This is illustrated in the lower part of figure 8-16. Here the replacement in a column with anNBRS = <1 , 3 , 6 , 8 , 9> when a block, i not already in the cache is entered. And also with an OBRS of<1 , 3 , 5> for the case that d at position 5 in the column is also the block being entered into thecache. When i is entered this procedure will give the correct representation of all caches up to sizen + 1 = 9 after the block is inserted because the procedute removes the block which is dispensablein all caches of size 8 or less. For the case that d is entered, since d is present in all caches of size5 or greater the result gives the content of all caches of size 8 or less.

Now the question arises: "which blocks are dispensable in the updated cache?" Under certainconditions these are the same starred blocks present before the update. Some of these have beenmoved to different rows. The conditions are:

Update Inclusion Conditions

Let all starred blocks are ordered by their row positions. By this "dispensability" ordering thestarred block in row j is more dispensable than that in the closest previous row say j − k. This mustbe true before and might hold for the same dispensable (*’ed) blocks after update. The block justentered and now in row 1 is the dispensable block in cache of size 1 and it must be lessdispensable than all other starred blocks remaining in the cache. i.e., it must be the most recentlyused, MRU. This implies that the relative dispensability of other starred blocks is unchanged due tothe update. This condition holds if the replacement algorithm always replaces the MRU block, andfor one that always replaces the LRU (least recently used) block. But it does not hold for all stackreplacement algorithms.

For many stack replacement algorithms the update inclusion condition is not met. After eachreplacement according to the stack chart update rule an algorithm could reassign dispensableblocks in any way that insured the maintenance of the stack star interpretation, and it would be astack algorithm. OPT and the RANDOM algorithm, unlike LRU and MRU, generally required suchreassignment.

8.3. OVERVIEW OF STACK ALGORITHMSNow a general formulation of properties of a stack algorithm is explored.


277

8.3.0.1. Properties Of A Stack AlgorithmBasic Requirements

In order to have a uniform basis for selection of the block to be replaced (dispensable block)when replacement is necessary, a number, mj, is associated with the j th block, bj, in the inputsequence. Similarly a number, ni, is associated with the block which is in Ci but not in Ci−k , k ≤ i − 1(in row i of the stack chart) . The dispensable block is chosen by a function whose arguments arethe numbers assigned to blocks in the cache. The evaluated function = one of those numbers (oneof its arguments) and the block associated with it. The number associated with a block, b, in thecache can only change when b appears again in the block input sequence. Then the numberassigned to b in the cache becomes its number in the block input sequence. No two blocks in thecache can have the same associated number.

Given the requirements above the nature of the selection function is strongly constrained.

The dispensable block, Dk ∈ Ck (in the first k block-frames of the cache), is chosen with afunction defined as follows:

Necessary Properties Of Stacking Function

Since the block ∈ Ci and ∉ Ci−1 is ni

dsk(n1, . . . , nk) = nd , 1 ≤ d ≤ k is the number associated with the dispensable block in Ck.

Inorder for this function to maintain the inclusion in a stack algorithm:

Either

1. Dk+p = dsk+p(n1, . . . ,nk, . . . ,nk+p) = nd with d ≤ k, so necessarily:

Dk = dsk(n1, . . . , nk) = nd or

2. Dk+p = dsk+p(n1, . . . ,nk, . . . ,nk+p) = nk+e , e > 0

And since this is true for all k ≥ 1 and all p ≥ 1 it follows that:

Dm = dsm(n1, . . . ,nm) = dsm(Dm−1,nm) = dsm( . . . (ds4(ds3(ds2(n1,n2),n3),n4), . . . ,nm))

or using infix notation, ⊕ i, in place of dsi.

n1 ⊕ 2 ⋅ ⋅ ⋅ ⊕ m nm = Dm−1 ⊕ m nm = ((((n1 ⊕ 2 n2) ⊕ 3 n3) ⊕ 4 n4) ⋅ ⋅ ⋅ , ⊕ m nm)

i.e., the function is a sequence of possibly different (indicated by the subscripts) binaryoperations which is associative from the left. It is called a stacking function.

Given all the basic requirements for a stack algorithm it follows that its stacking function musthave these properties. The description of the function exactly matches the premise of lemma 8.2.2.So

Theorem 8.3.1: An algorithm is a stack algorithm satisfying the Basic Requirementsabove if and only if it chooses its dispensable block with a stacking function .

PROOF: As stated above a proof can be based directly on 8.2.2. In fact conditions 1 and 2above are clearly equivalent to the conditions of that lemma.

An alternative proof depends on the representation of a stack algorithm as one in which thecontents of a set of caches can be represented in a single column in which the stack star


278

interpretation hold.

If there are two succesive starred blocks, b and b’, within a column, and these are in rows jand j + p respectively, then b is dispensable in all caches of size j to size j + p − 1. So b isdispensable in Cj, Cj+1, . . . , Cj+p−1 and b’ is dispensable in Cj+p. Since k is either the row betweentwo starred entries or at a starred entry it follows that for all k > 1 a block is dispensable in a cacheof size k if either

1. it was dispensable in a Ck−1 (no star in row k) or

2. the block which in Ck (and thus not in Ck−1) is the dispensable (star in row k).But this is true iff the replacement algorithm A is built with the stacking function.

8.3.1. ExamplesDefinition: In what follows the highest / lowest member of a set of numbers associated with a

block in the cache is denoted H1 / L1. To represent the i th highest / lowest number we use Hi / Li.

All the examples given satisfy the basic requirements and use stacking functions which arebinary and jointly associative from the left. In particular the dispensable block in each case ischosen by a compostition of minimum (↓) and maximum (↑) functions. Also in each example thecache contents have interesting properties.

8.3.1.1. Replacement Algorithm A1BLOCK L1 IS DISPENSABLE

Stacking Function

Di = Di−1 ⊕ i ni = Di−1 ↓ ni , for all i ≥ 2

(If the number assigned to a block is its index in the block input sequence, that is the j th blockin the block input sequence is associated with the number j (mj = j) then A1 is clearly the LRUalgorithm. In this case whenever a new number for a block appears it is greater than the previousnumber associated with that block in the cache.

Analogous remarks hold if

Di−1 ⊕ i ni = Di−1 ↑ ni,

which is the case with MRU (additionally for MRU the highest number is assigned to the blockjust entered so it is the dispensable block in caches of all sizes). In this case too whenever a newnumber for a block appears it is greater than the previous number associated with that block in thecache. )

Inorder to insure that the caches will have the property, P1, given below it is necessary torestrict the block input sequence.

Restriction R1:

Whenever a block reappears in the block input sequence it’s associatenumber is ≥ its number in the caches that hold it.If bj+k = bj then mj+k = mj.

Such a stack algorithm has


279

(a1)

1

3

1

2

1*5 2*415

24

15

3*3 1*2

24

3*3

L1 Dispensable

L Dispensable1

(b)

3* 9* 5* 2* 7* 8* 1* 3* 2* 6* 8* 9* 4*

3 3 3* 2 2 2* 1 1 1 1 1 19* 5* 3 3 3* 2 3* 2 2 2 2

9 9 9 9 9 9 9 9 8* 95* 7* 8 8 8 8 6* 6* 8

5* 7 7 7 7 7 7 7 5 5 5 5* 5* 5* 6

3* 3* 3* 5 3*

1

3

1

2

4

5

9

7

8

6

1 2 3 4 5 6 7 8 9 10 11 12 13

for C with k > 2kL Dispensable3

_

Restriction

m > mj+k j

j+k jb = b thenIf

No Restriction

1 2 3 4

k j

k j

1

3

1

2

4

5

9

7

8

6

1*6 2*516

25

16

3*1 1*9 5*7 8*8

25

4*33*1 19

253*1

192*54*33*1

6*0

3*4

57

19

254*34*3 3*4

5*7

19

2*5

4*3

88

34

57

19

25

43

3*4

57

6*019

25

43

88

7*10

6*0 3*1

5*7

1*9

2*5

43

8*8

710

6*0

3*4

57

19

25

88

3*1

4*3

88

9*11

3*1

4*3

7*10

6*0

5*7

1*9

2*5

8*8

(a2)

1 2 3 4 5 6 7 8 9 10 11 12 13k j

nkbk

mj

nk

bm jj

Median Dispensable

5* 5* 5*

3* 9* 5* 2* 7* 8* 1* 3* 2* 6* 8* 9*HH1 L1 L1H2 H2 H4L2 H1L3 H2

3* 9 9 9 9 9 9 9 9 9 8*H1 H1 H1 H1 H1 H1 H1 H1 H1 H2

3 3* 2 2 2 1 1 1 1 1L1 L1 L1 L2 L1 L1 L1 L1L1

5* 7 7 7 7 7 7H3 H3 H3 H3H3 H3

5* 5* 5* 3 3 3L3 L3 L3

5 5* 7* 8 8 8 8 6* 6*H2 H2 H2H3 H2 H2 H4 H4

3 3 3* 2 3* 2 2 2L2 L3L2 L2 L3 L2 L2 L2

3

1

2

4

5

7

8

6

(c)

1 2 3 4 5 6 7 8 9 10 11 12 k j

bm jj

*

*

*

**

**

*

*

*

**

**

*

*

*

**

**

**

***

**

*

**

**

*nk

nkbk

bm jj

Figure 8-18: Examples: Lowest, kth Lowest, Median Stack Algorithms

Properties P1:


280

Of the numbers associated with the most recent occurences of eachblock that has appeared in the input sequence, the k − 1 highestare in cache Ck, for all k ≥ 1.

Or equivalently:

Of all numbers appearing in all caches the k − 1 highest are incache Ck

Prior to the proof, consider the examples in figure 8-18. Each chart in this figure has thecontents of cache Ck in the its top k rows, in the usual condensed representation for stackalgorithms.

In (a1) Restriction R1 is violated. That is, when block 1 arrives for the second time as the 4 thinput (12) its associated number, given by subscript 2, is less than its associated number, 5, on itsfirst arrival (15). As a consequence, in the 4 th column the highest associated number is not in acache of size 2. So if R1 is violated, Property P1 does not necessarily hold.

In (a2) does holds and so Property P1 holds. For example in column 6: The highest number,9, associated with block 1 is in every cache Ck, k ≥ 2, the two largest, 9, 7 are in all Ck, k ≥ 3, the 3largest, 9, 7, 5 are in Ck k ≥ 4, etc. Notice that in the last column the associated numbers(subscripts) are completely sorted from 11 to 0.

Proof Of P1:

Remember Hi is the i th highest of the block numbers associated with blocks in the block inputsequence (but if the same block has appeared more than once, only its most recent number isincluded in the selection of the i th highest.)

When Ck is first filled certainly it contains H1, . . . , Hk−1 and Z. As long as no block already inthe cache arrives this remains so. (Although the block numbers for those blocks may increase.)

If a block, not in the cache arrives, its number, say W is the largest number for that blockwhich has arrived. Now Z, the smallest numbered block, the dispensable one, in the cache isbumped (removed). H1

before W, and Hk−1before W , ∀ k−1 ≥ j ≥ 1, and W remains. If W is less than

Hk−1before W then Hj

before W = Hjafter W , for all k−1 ≥ j ≤ 1. If it is greater than Hk−1

before W then the Ckstill contains the k−1 highest block numbers that have arrived and Hk−1

before W becomes thedispensable block in Ck.

On the otherhand, a block, b, already in Ck associated with a different block number than withits occurence in Ck may arrive. So the number in Ck associated with b changes to, say y. If yreplaces Ha , for some a, k−1 ≥ a ≤ 1 and must, according to R1, be higher than Ha so if y replacesthe dispensable block in Ck, P1 still holds.

8.3.1.2. Replacement Algorithm A2BLOCK Lj IS DISPENSABLE

Di = Di−1 ⊕ i ni = Di−1 ↓ ni , ∀ i, i > j

and

Di = Di−1 ⊕ i ni = Di−1 ↑ ni , ∀ i, i ≤ j


281

An example with j = 3 is given in figure 8-18 (b). Only the number associated with a block isshown, It is assumed that a block has the same block number at each of its occurences in the inputblock sequence so only that number is shown. Notice that there are two numbers less than thedispensable block number in every cache of size greater than 3.

Restriction R2

The property given will hold if reappearances of a block in the block input sequence has thesame associated number as it has in the cache and different blocks have different numbers.

(Alternatively, though the associated number of a reappearance is different, it is constrainedas follows: Whenever a block whose current block number in the cache is one of the k − 1 lowest itis given a new block number lower than that in the cache, and for any other block in the cache thenew number is greater or equal to that currently in the cache.)

Properties P2

In each cache of size greater than j it is Lj, the j th lowest block (number), that is dispensable.

Also Cj contains the lowest j − 1 block numbers that have occured in the block inputsequence. And Ci , i > j contains the i − j th highest block numbers that occur in the block inputsequence.

Proof Of P2

By the proof of P1, Cj contains the j − 1 lowest most recent block number instances from theblock input sequence. The other number, O included therein is the dispensable one in Cj. Cj+1 whenfirst filled also contains the highest block number, say H1, from the input block sequence. If H1 is inCj then it is dispensable there (because it is the maximum there), and O is dispensable in Cj+1(because it is H1 ↓ O). So if a new block with number Y arrives it will replace O and H1 ↓ Y becomesdispensable in Cj+1 and H1 ↑ Y which is the maximum so far remains in Cj. If H1 ∉ Cj but is in Cj+1,then it is not dispensable in either and H1

before Y ↑ Y = H1after Y is in Cj+1 while H1

before Y ↓ Y isdispensable there.

8.3.1.3. Replacement Algorithm A3THE DISPENSABLE BLOCK IS FOUND BY ALTERNATNG ↑ AND ↑ FUNCTIONS

If

De−1 ⊕ e ne = De−1 ↓ ne , where e ( ≥ 2) is an even number, and

Do−1 ⊕ o no = Do−1 ↑ no , where o ( ≥ 3) is an odd number.

An example is given in figure 8-18 (c) in which block input numbers are shown with subscriptsgiving the rank of each number in the largest cache in the column in which it occurs. Here thenumber associated with a block is the same at each appearance in the block input sequence.

Restriction R3

If the same block always has the same number then the dispensable block is the median.However that means that when a block reappears in the block input sequence it cannot change itsnumber--which is a significant restriction. (Under certain very tight constraints such changes can beallowed).


282

Properties P3

If e is even, and o an odd number,

Ce contains H1 > . . . > He/2 > M > L(e−2)/2 > . . . > L1 and M is dispensable.

Co contains H1 > . . . > H(o−1)/2 > M > L(o−1)/2 > . . . > L1 and M is dispensable.

Proof Of P3

Consider C2 first. n1 ↑ n2 when 2 entries first appear in C2 is H1. As long as there are no newblocks in the block input, H1 remains in C2. When a block different from those in C2, (with anassociated number) say N, arrives, since H1

before N ↓ N is bumped the maximum, H1before N, remains.

And H1before N ↑ N = H1

after N, which is the maximum so far, namely H1, remains. It is not dispensablein C2. H1

before N ↓ N on the otherhand is dispensable and is the median in C2.

C3 contains H1, X and Y. X or Y is dispensable ∈ C2, and the other is ∈ C3 and ∉ C2. So X ↑ Yis dispensable in C3 and X ↓ Y is the other member of C3. Since H1 > X ↑ Y > X ↓ Y, X ↑ Y is themedian. Also note that the first time there are three members in C3 X ↓ Y = L1. As long as there areno new blocks in the block input L1 remains in C3. When a block different from those in C3, say N

arrives. Since L1before N ↓ N) = L1 remains trapped.in C3. So C3 always contains L1. L1

before N ↑ N onthe otherhand is dispensable and is the median.

As shown for C2 and C3, the general situation with use of minimum (↓) and maximum (↑) as thechoice functions is that certain quantities get "trapped" within the lower caches, or near the top of acolumn of the stack representation of a set of caches. If o is an odd number ≥ 3.

We prove the conclusions for the general case of Co when o is and odd number.

Assume, after the latest block enters the caches, Co−1 containsH1 > . . . > H(o−1)/2 > M > L(o−3)/2 > . . . > L1, M, the median is dispensable in Co−1. The cache Cocontains all these and also X. So either:

H1 > . . . > H(o−1)/2 > X ↑ M > X ↓ M > L(o−3)/2 > . . . > L1

H1 > . . . > H(o−1)/2 > M > X > L(o−3)/2 > . . . > L1

where X ↑ M, the median, is dispensable in Co

When Co is first completely occupied its contents is

H1 > . . . > H(o−1)/2 > M′ > > H(o−1)/2 > L(o−3)/2 > . . . > L1 where M′ = X ↑ M

As long as only members of Co arrive the contents of Co are as given. If a new block number,Z, not in Co arrives it bumps M′. We now have one of the following in Co.

1. Z > H((o−1)/2)before Z

2. Z < L((o−1)/2)before Z

3. H((o−1)/2)before Z > Z > L((o−1)/2)

before Z

In each of these cases after Hi s and Li s are recalculated

H1 > . . . > H(o−1)/2 > M′ > H(o−1)/2 > L(o−3)/2 > . . . > L1.

So after each block entry the highest and lowest (o−1)/2 block numbers are in Co and themedian there is the dispensable block number.


283

The proof for the general case of Ce when e is an even number is analogous and is left as anexercise.

NOTE: The interchanging Li and Hi in each of the arguments above leads to analogousinteresting conclusions.

8.3.1.4. The Distance StringNotice that in figure 8-16 block 4 which is the 7 th block input (in column 7) is in column 6,

where it is present in caches of size ≥ 4, and thus in these caches when the 7 th block input arrives.On the otherhand the 6 th block input, block 3, is not in the size 5 cache when that block inputarrives.

In general the stack chart reveals if the i th block input, bi, is in the stack in column i − 1. If it isnot, then the i th input would cause a miss in any size stack. This is indicated by saying that d(i) = ∞.If it is in the stack in column i − 1 at level n then all stacks of size n or greater will have a hit for thei th input b, while all stacks of size less than that will have a miss. This is indicated by saying d(i) = n.It follows then that for any input string b0,b1, . . . , bn, there is a corresponding string (the distancestring) ∞, d(1), d(2), . . . , d(n), where d(i) is the smallest size cache in which block inputs bi−1. So ifHm is the hit ratio of a cache of size m as determined by a page input sequence. The block hit ratioexperienced by a particular block input sequence can then be determined from the correspondingdistance string.

Hm = ( ∑j=1 to n [ truth_value_of ( m ≥ d(j) ) ] ) / n

LRU Implementations

A two way link list is maintained connecting the blocks in the cache in order of use from theMRU to the LRU. Also separate head pointers to the MRU and LRU are maintained. When theblock referred to is changed that block becomes the MRU and is pointed to by the MRU headpointer--if that block is one in the cache already, the other pointers are rearranged (simply)appropriately to maintain the list order. If the block is not in the cache already, the block at the LRUpointer is replaced and the LRU pointer is updated to the one preceding the removed LRU block.The pointers can be store along with the address in the cache. So that the rearrangement of the listonly occurs when the block reference changes a register can hold the last block referenced andcompared with each new reference.

8.4. Interleaved Memory--Prefetching And PipeliningConfounding the effort to increase the speed of computers, is the relatively long access time

to memory. Generally this time increases with the size of the memory. The memory bottleneck canbe widened by the use of parallel memories (modules) provided that related items can besimultaneously accessed in the different memory modules. To get a speed advantage memorymodules would each be cached. See figure 8-19 (a). The memory modules addressing allows asingle access to read all modules at the same "cross module" address. Succesive programinstructions, or related data words, are placed, when possible, at the same cross module address insuccesive modules. These pass through module data registers and then to a buffer from whichthey can be read at high speed, thus decreasing the average memory access time. To get a word,not already in the output buffer, the cross buffer address x chooses a row through all modules, and


284

Memory Modules

address

m2 -10

n mbitsbits

Instruction Buffer

1

2n

-1

01

ModuleData Registers

x

y

m2 -10 1

2n

-1

01

n bits

MAR1 MAR2

n mbits bits

MARAddress Registers

Module

MdAR

Memory Modules

MdMEM

ModuleData Registers

MdDRMemory

Data Register

MDR

Address RegistersMemory

C P U

(a)

(b)

Address Selector

Data Selector

SA

SD

Figure 8-19: Interleaved Memory--Pipelining

the module address y chooses the word within that buffer. There are 2m modules and 2n words ineach module. So x has n, and y, m bits. The outputs from each module at a cross module address xgo to individual data registers, and thence to instruction buffer registers. The addressed buffer canbe obtained using address y. Often the first of the buffer registers is the first output sought, then thesecond etc. so the buffer registers can be shifted to obtain succesive words. The reason there is a


285

data register as well as an instruction register associated with each module is to allow newaccesses as soon as the buffers have been emptied into the instruction registers, but even withoutthe extra set of registers, and the accompanying overlap the interleaved memories provide a speedup.

Access to memory arranged in independently addressable modules can be made much moreflexible. and still provide significant speedup through pipelining. The arrangement of access to themodules is shown in figure 8-19 (b). The operation of the pictured system in executing a read is:

1. First a memory address found in an instruction is brought into the Memory AddressRegister, MAR, in the CPU. This address has two parts. MAR1 identifies a module,and MAR2 an address within that module. MAR provides input to the SA, AddressSelector which selects a Module Address Register, MdAR(MAR2), and stores MAR1there. (After this storage in MdAR(MAR2), MAR is free and ready to accept the nextmemory address, provided the next address is to a different module.)

2. MdAR(MAR2) provides input to the addressing circuitry of the selected module,(MdMEM(MAR2))(MAR1), which reads the data at that address into the Module DataRegister for that module, MdDR(MAR2). (At which point the MdAR(MAR2) becomesavailable for use by subsequent memory accesses, even if the same module isaddressed by these.)

3. MdDR(MAR2) provides input to the Data Selector, SD, which behaves functionallylike a sets of "or" gates to feed MdDR(MAR2) back to the CPU’s MDR.

The table 8-20 is a standard Task Table. It is assumed that reading the memory modules,MdMEM, in contrast to other reads, requires 3 units of time (even if longer time is more realistic theprincipals remain the same). The MdAR has a forbidden interval of 3 the largest in the table. It isread for 4 time units. So with a constant pipe start interval of 4 units, pipelining is sensible, even ifthe same module is read each time. The output interval then seems to be longer that the modulememory access time. This can be improved.

In fact if two succeeding memory addresses in MAR were to different modules, then, asnoted, the second could write into an MdAR while the first was still using an MdAR since they woulduse different MdAR’s, and different MdMEM’s. This is shown in figure 8-21, in which succesivetasks (reads) to different modules are traced (two tasks with the start of a third). The outputs in thiscase emerge with an interval of 2separating them. Thus the interval between succesive accessesis 2/ 3 the a modules access time. This is the major benefit of this arrangement. This reduction to a2 stage interval between tasks holds during a sequence of k ≥ 2 instructions accessing k differentmodules--for module access times ≥ 2. This is analogous to what happens with the multiphasingapproach to removal of forbidden intervals.

When there are two succeeding accesses to the same module in the program a NOOP can beinserted or a one instruction stall executed, so as to prevent that condition from reaching thepipeline.

By eliminating the forbidden interval of 3 in the use of MdAR, it is possible to increase thethroughput even when the same module is addressed by succesive instructions. This isaccomplished by adopting the "shift register technique, as shown in table 8-22. It requires doublingthe number of MdAR registers. As the address is shifted from MdAR0 to MdAR1 the input toMDMEM must remain unchanged though it is from MdAR0 for a stage and then from MdAR1 for astage. This technique does not generalize simply to the case in which three or more succesive


286

stages are needed to read the memory.

...Task Table

Block

1 2 3 4 5 6 7 8 9 10

MAR w r

SA w r

MdAR w r r r

MdMEM r r r

MdDR w r

SD w r

MDR w r

Figure 8-20: Memory Pipeline Task or Resource Use Table

...Pipe Table

Block

1 2 3 4 5 6 7 8 9 10

MAR 1 1 2 2

SA 1 1 2 2 3 3

MdAR 1 1 12 12 23 23 3

MdMEM 1 1 12 2 23 3 3

MdDR 1 1 2 2 3

SD 1 1 2 2

MDR 1 1 2

Figure 8-21: Table For Reads To Different Modules Of A Pipelined Memory

The conclusion drawn for reads only still hold If reads and writes are interspersed.

8.5. Cache CoherenceIn figure 8-23 (a) a group of processors, Pi s, are connected through a connection network to a

group of memory modules, MMi s for parallel operations. Under the assumed tightly coupledarrangement any processor can access any memory and work done by one processor can betransmitted through memory to another processor. In a later chapter we will consider theconstruction of the connection network. The concern now is the consequence of adding caches tothe memory modules.

Let the time to set up and transmit information through the network be tNet, and that to read orwrite to a memory module, an MM, is tMM. So a memory access costs tNet + tMM. Greater speedcan be obtained by caching, however their must be restrictions on cache usage.


287

...Task Table

Block

1 2 3 4 5 6 7 8 9 10

MAR w r

SA w r

MdAR0 w r

MdAR1 w r r

MdMEM r r r

MdDR w r

SD w r

MDR w r

Figure 8-22: Memory Pipeline Task With Added Resources

Greater speed can be achieved with caching, however there must be restrictions on cacheusage. In this section we consider restrictions based on static--compile time, classification ofmemory blocks. In the next section a more permissive classification of memory blocks which isdynamic--can change at run time will be considered.

In part (b) of figure 8-23 a cache has been added to the processor side of the network, one toeach processor. Each such, private cache has access to every memory module. The problem ofcache coherence arrises in this configuration. In parallel operation communication betweenprocessors is accomplished by having one processor write and then another one or more read fromthe same locations. Use of semaphores or other means in the program must guarantee thatinformation is sent before it is retreived, but since information can generally be stored in either aMM module or an associated cache exactly shere it is must be determined for success incommunication. So suppose processor A writes to its cache, which, for efficiency, is not generallyimmediately written back to MM. An attempt to get the information say by processor B with adifferent cache must be aware of its presence in processor A’s cache, and find a means to retreiveit. In general coherence requires that a LOAD X to processor P follow the latest STORE X of anyprocessor whatsoever. While cache coherence must be maintained it is desireable to giveprocessors access to all words in all memories.

8.5.0.1. Static SolutionsWith private caches of the figure, one can assure coherence to all caches by making any

change in a private cache visible to all other processors. One can broadcast each write to anycache. This and other dynamic approaches will be considered in the next section. A simpler,though less efficient approach, is to restrict the nature (state) of the blocks read into a cache. Thestate of a block can be RO (read-only), or RW read-write but available to only one processor (Pexcl),or RW and available to more than one processor. These are permanent classifications determinedat compile time. For figure 8-23.

Access Restrictions (b) in the figure1. All RO and RW (Pexcl) words available to Pi only are allowed in the processor Pi s


288

cache. These may be from any MM.

2. All other RW blocks to/from any processor must pass through the net and memory(NET + MEM).

This is shown at the bottom of (b) in the figure. In the entries on the left of the "/" the nature ofthe word accessed is given, RO, or Pexcl, or other RW blocks on its right the feasible accesses (setof devices) through which such a word can be acquired. These are shown with the fastest possibleaccess first.

(Even allowing modest violated of these restrictions, ex., Pj s cache can have RW blocks fromMMi. Now suppose an RW block in MMi is written by Pj , j ≠ i, directly through the network (noviolation). And suppose that RW word was in Pi s cache i. Then to maintain coherence: wheneversuch write occurs, the word written must also be updated in cache i if it is there. This impliesdynamic information exchange between processors, or central maintenance of information of allprocessors.)

Let the access time to memory when the information is in the cache be tcache, when not incache the time is tNet + tMM, so if hpriv is the fraction of time access is to the cache, and access toRO blocks is f1 fraction of all accesses.

Tprivate−coherent = f1(hpriv tcache + (1 − hpriv)(tNet + tMM)) + (1 − f1)(tNet + tMM)

Note that no write back Pexcl blocks can be written to, so may be dirty and require write-back.

Access Restrictions (c)1. All RO and all RW blocks in MMi are kept in MMis cache

2. All blocks to/from any processor must pass through the net and memory (NET +MEM).

In this arrangement each cache has access to only one MM, but can be accessed or sharedby all processors to that memory processor. Such a cache is called a public cache . Write back isrequired when a cached block is written but there no coherence problem with this arrangement soagain no information about the state of other caches is necessary. On the other hand, all accessesto the cache or MM go through the network.

In addition to the delay necessary in memory access because of the need to go through thenetwork and/or access main memory there is the possibility of delay due to blockage in the network.Networks may be limited as to the number of simultaneous connections they can accomodate. So ifa number of different processors needing access to memory or a public cache, some may have towait due to network blocks. A variety of networks, from those that may accomodate any number ofconnections no two of which involve the same processor the same MM, to those that canaccomodate only one are considered in the final chapter.

Actually in most systems there are at least three levels of memory, that is above the MM’s infigure 8-23 there is another level of memories in the disk class. A network might also be imposedbetween this third and the second level.

In another memory arrangement: instead of assigning each public/private cache to one MM/P,one may be used for a group MM’s/P’s.

The average access time in this arrangement is given by:


289

Tpublic−coherent = hpub (tNet + tcache) + (1 − hpub) (tNet + tMM)

Processors(a)

MemoryModules

1MM nMM

Connection Network

1P nP

(b)

Private Caches

RO CACHE,NET + MEMRW other NET + MEM

RW Pexcli CACHE,NET + MEM

cachencache1

nP

1MMnMM

nP

Connection Network

1P

(c)

Connection Network

Public Caches

RWRO NET + CACHE,NET + MEM

1P nP

1MM nMM

cachencache1

Connection Network

(d)

cachem1 ncachem

cachepncachep1

1P nP

Public-Private Caches

1MM nMM

CACHEp, NET+CACHEm,NET + MEM

RW Pexcli

RO CACHEp, NET+CACHEm,NET + MEM

RW others / NET+CACHEm,NET + MEM

Figure 8-23: Multi-Processor Multi-Memory Module Systems

Access Restrictions (d) in the figure


290

In part (d) of figure 8-23 a combined configuration using two caches per processor is shown.They use the the best of the private and public cache accesses.

1. All RO and Pexcl blocks are access through a private cache on a hit, otherwise eitherNET + MEM, or NET + CACHE on a miss.

2. All blocks are accessed through NET + CACHE, on a hit of the public cache, NET +MEM otherwise.

In this arrangement a write-back is required from the public-caches, and also from the privatecache. If one dropped the seperate category of Pexcl blocks write back from the private cache canbe eliminated.

The average time of execution of this configuration is better than the average time for either ofthe two previous configurations since it uses the fastest access of (a) for RO words and the fastestof (b) for RW words. Beyond this it can access RO words on a miss in two ways, one of which isfaster and inaccesible to (a). Furthermore it allows cached access to some RW blocks, and (b)allows none. Write back is required in this model, but it may be done immediately, or on effectivereplacement of the block.

Tcombined−coherent = f1(hpriv1 tcachep + hpriv2(tNet + tcachem) + (1 − hpriv2 − hpriv2)(tNet + tMM)) +(1 − f1)(hpub (tNet + tcachem) + (1 − hpub)(tNet + tMM))

8.5.1. Dynamic Parallel Memory AccessConsider now a single memory, MM, to which many processors, each with its own cache,

have access.

The memory may consist of one or more modules. There must be a separate data path fromthe memory to each processors cache. In figure 8-24 there is a block diagram of a likelyarrangement. A processor’s request for block movement to and from MM is directed to a memorycontroller. The memory controller then establishes the connection between the cache and memorymodule and initiates the data transfer. It is possible also for the controller to broadcast informationto all processor caches, or to all memory modules.

Also the figure shows a bus (which may or may not be present) on which all processors maycommunicate. This can be used to communicate information about changes in a cache whichrequire processing by other caches. This bus can be used in the snooping mode to handle allcommunication between caches. That communication can also be achieved by a directory in MMwhich can written and/or read by all processors.

8.5.1.1. Keeping CoherentThe key problem is to develop a complete description of the actions necessary to maintain

coherence with as much autonomy given to the individual processors and their caches as possible.To do this we enumerate the relevant states that the memory block may be in relative to thecontents of a given block in a processor cache.

A block may be in 0 or more caches, it can be dirty or clean and in general there are only afinite number of significant states that a cache block can occupy. For example, consider a block Bwhich is to be read, or written by process pA to cache cA when B is either not present at all, orpresent but inconsistent with the most recently written version of B. We say B is in the invld local


291

blocks

statetag

cache1

P1CPU

addr

bus

blocks

statetag

cachen

M E M O R Y

. . . .

CONTROL

CONNECTOR TO STORAGE

. . . . . . . . . . . .

S T O R A G E

. . . . . . . .

MMUn

CPU

PnMMU1

snoop send snoop send

addr

Figure 8-24: Mutiple Process Memory Access

state in cA. The read or write is a miss in this case and a valid copy of B must be found. Asanother example: if cA in the one cache contains a valid copy of block B which has been written tosince its arrival there, then its local state in cA is excl−d (exclusive and dirty). In addition to the localstate of a block B in a particular cache there is also a global state which indicates whether a blockis valid in any cache, shared by more than one cache, or exclusively held in one of the caches. Thisalerts cache cA, when about to read block B, if another cache holds a valid copy of B. Block B in theexcl−d local state in cA may change to the shared local state in cA if the same block, B, is read toanother cache. There are other block states, which for a given block never change. Block B may beRO (read-only), or it may be allowed in one processor cache only, Pexcl. The inclusion of thisinformation may require additional cost but is conceptually straightforward and may give significanttime savings where much read and/or write sharing is expected.

Generally then, it is necessary to know the local state of B in the cache, say cA, in which aread, write, replace, or write-back is contemplated, as well as its global state which summarizes thecondition of all cached copies of B (except for RO and Pexcl in which it merely identifies the blocktype), to know exactly what actions should be taken to preserve coherence. There are seven localstates in cA and seven global states that are of significance. Though their meaning differs whenlocal or global, they are given the same name in each. The possibilities that are considered are:

1.a. Local state in cA: Invalid (either completely absent, or present but incorrect) in

cA, invld

b. Global state: Invalid in all caches, invld

2.a. Local state in cA: A valid copy held exclusively in cA, but not written in cA since


292

moved there, exclusive and clean, excl-c

b. Global state: Valid exclusively in exactly one of the caches, say in cX, but notwritten in cX since moved there, exclusive and clean, excl-c

3.a. Local state in cA: A valid copy held excusively and written in cA cache but not

updated in MM, exclusive and dirty, excl-d

b. Global state: Valid excusively and written in exactly one cache, but notupdated in MM, exclusive and dirty, excl-d

4.a. Local state in cA: Valid in cA and also in one or more other caches, shared.

b. Global state: Valid in two or more caches, shared. In this case shared isviewed as a number. If it is 1 that means there are a total of 2 caches sharinga valid copy of the block. We use the notation shared +/− 1 to indicate that thenumber of shared copies of a block (its global state) are to beincreased/decreased by 1. Instead of keeping this number the global sharedstate can be simplified to boolean yes-no- there exist shared local states ornot. If this is done however, when a shared block is replaced it is not knownwhether there is only 1 valid copy of the block left which could then be madeexcl−c.

Though we consider only one type of shared state one could distinguish two types:a. the other cache copies all have been written into, but still are shared−dirty.

This state will not be allowed here--that is any time a cache needs to sharea dirty block that block must be written back to MM. (Sharing such a a dirty-shared block could be accomplished by sending blocks between cacheswithout putting a copy back in MM --if instead it is sent from the first cache toMM and then to the second cache it is not much less efficient.)

b. shared−clean, this is what is meant by shared here.

In addition there are the static states. There are only two such global states, RO or Pexcl.These are associated with the block in MM, can be determined at compile time and do not change.Corresponding to these there are three local states RO, and to the global Pexcl, the two local statesPexcl−c, or Pexcl−d (these can change from, one to the other).

1.a. Local state: RO: Block B in cache A is read only

b. Global state: RO This is determined when the block B is read or writtenbecause it is associated with the block in MM. (The notation invld//!RO in thestate diagram indicates that the global state in this case merely identifies thenature of the block being read written or replaced. In other cases the globalstate indicates the state of the caches other than B.) There may be a numberof copies of an RO block in different caches but unliked a shared block thereis no reason to keep track of that number)

2.a. Local state: Pexcl−c (clean)

b. Global state: Pexcl This is determined when the block B is read or writtenbecause it is associated with the block in MM. (The notation invld//!Pexcl in the


293

state diagram indicates that the global state in this case merely identifies thenature of the block being read written or replaced. In non-static cases theglobal state indicates the state of B in all caches.)

3.a. Local state: Pexcl−d (dirty)

b. Global state: Pexcl

This set of seven states is closed:

If there is an operation, read, write or a replace, of block B in cache cA and if the local state Bin cA and the global state of B are given from the above lists, then the associated actions necessaryto maintain coherence, and the next local and global states, chosen from those above, aredetermined.

The goal is to insure that the correct value of B is in cA, when the operation of read or write orreplace is performed on B in cA even though the action needed to do this may make other copies ofB invalid. For example, if B is shared and the valid copy in cA is written, all B elsewhere becomelocally invld, it is not necessary to immediately update B elsewhere. (One might update the othercopies of B in the background, but in the mode of operation considered here this would not be doneuntil one of the invalidated copies were read or written.)

What must be done to keep the information correct for each possible state-operation pair isdetailed in figure 8-25 (a). This state table gives, for a block B, the next state (local//global) as afunction of the current state (local//global) and the operation affecting B. This operation maybeinitiated locally, ex., read, or remotely, ex., other−read. Given such an operation there are a numberof other possible consequences beyond a state change. Since reducing the number of states willbe an issue these will be important when state merging is considered. An operation may beimpossible for a a given state--if B is RO (read−only) then it will never be written--so we don’t carewhat action is taken (dc). Also a next state may result from an action in a given state, but theprocessor in the current state may not have to perform that action. This is indicated by a dc1. Forexample if the current state with relation to block B of a processor pX is invld//shared and anotherprocessor pY reads block B then the number of shared processors will either increase by one orremain the same depending on whether B is invld//shared or shared//shared in pY. pX need not takeany acrion on this read, pY will take care of it.

If the local state of B is invld and a read or write is executed the MM copy of B may not bevalid. The valid copy may be a dirty copy of B in another cache. This happens when the global stateis excl−d and is indicated in the state table with **. In this case in addition to getting the valid copyfrom the cache in which it is excl−d the state of that cache must be changed as shown in the table.From the point of view of the cache holding B in state excl−d//excl−d this change of state must occurwhen some other cache reads or writes B. This also is indicated by ** in the table. Other cases inwhich a local operation may cause a change of state at another cache are indicated by *, and ^. Forexample if the state is invld//excl−c at cA and the operation is a read the state changes toshared//[shared=1] (this entry is *ed). At the same time the remote cache whose state wasexcl−c//excl−c must also be changed to shared//[shared=1](this entry is tagged with *).

Also note that in figure 8-25 (a) when an entry gives only a local state, ex. upper left handcorner excl−c it is shorthand for the given local state and the same global state. ex. upper left hand


294

( invld //

invld //

shared//shared

(b)

(a)

excl-c// excl-c Pexcl-c///Pexcl~= excl-c// excl-c

excl-d// excl-d Pexcl-d//Pexcl~= excl-d// excl-d

invld// invld Pexclinvld//~= invld// invld

Mergings in (a) that give (b)

RO // RO( excl-c // excl-c

RO // RO)~shared // shared

~= invld // sharexcl

shared

Mergings in (b) that give (c)

=sharexcl // sharexcl

invld // shared excl-c

~ invld // ROinvld //

RO )invld // invld // excl-c~

excl-d//

excl-c//

replreadother-write

other-repl

excl-dexcl-c

excl-dexcl-d

dc

invld

invld

invld

write-back?

excl-c, yes

invld//

sharedshared//

RO

excl-d

excl-c

shared

[shared-1]shared-/

no

RO// dc no

no

state

excl-d

excl-d*

excl-d

dc

excl-c

shared//[->shared+1]

no

excl-d

RO

shared

invld

no

no

no

noexcl-d**

excl-c

read writeother-action

excl-d

RORO

dc

**shared//[shared=1]

RO

[shared-1invld//

0 -> excl-c]

^

*

* *

^ ^

**

*shared

dc

RO

shared//[shared=1]

**

*shared//[shared=1]

shared//[shared=1]

excl-c

invld//

invld//excl-d

excl-d

excl-dinvld// H

iT

M ISS

~excl-c // excl-c~

excl-d//

excl-c//

Pexcl-c//

other-repl

excl-dexcl-c

Pexcl-c Pexcl-d

excl-dexcl-d invld//

Pexcl-d Pexcl-d

dc

invld

invld

write-back?

excl-c, yes

Pexcl-c, yes

invld//

sharedshared// excl-d

!RO!Pexcl

excl-d

excl-c

shared

Pexcl-d//

invld

invld

dc

dc

no

no

RO// RO dc no

no

dc

dc

excl-d

excl-d

excl-d

Pexcl-ddc

Pexcl-cexcl-c

shared//[shared+1]

no

excl-d

RO

shared

Pexcl

Pexcl

invldnono

no

no

noexcl-d**

dc1

read write repl = B

other-read

other-writestate

on B

** dc1

excl-c

invld

**

shared//

shared//[shared=1]

[shared=1]**

RO

[shared-1]invld//

[shared-1shared//

0 -> excl-c]^

* *

* *

^

shared//[shared=1]

shared

dc1 dc1

dc

dcshared//

[shared=1]

0 -> excl-c]invld//

invld//excl-d

excl-d

excl-d

H iT

M ISS

action

dc1

dc

dc1

dc1

dc1

dc1

dc1

dc1

dc1

dc1

dc1

dc1

dc1

dc1

dc1dc1dc1

dc1dcdc1

dc1dcdc

dc1

dc1

dc1

dc1

dc1

dc1

dc1

dc1 dc1

dc1 dc1 dc1

dc1 dc1

dc1

dc1 dc1 dc1

dc1 dc1

dc1

dc1

dc1

Figure 8-25: Maintain Coherence: State Tables, Original And Merge 1

corner excl−c stands for excl−d//excl−d

Some state merging is posssible. The results of the mergers listed in figure 8-25 next to thetable in (a) result in the table in (b). Further mergers are possible if certain simplifications areallowed these are shown on the left of (b). They result in the state table in figure 8-26 which has 4local state. This further merging is possible if one is willing to give up the advantage of keeping


295

track of the number of shared states, and the simplicity of managing the RO block when it isdistinguished from other shared and excl−c blocks.

Two states X and Y are not mergeable iff for any given operation1. They require incompatible output actions

2. They go to next states Z and W, where Z and W are not mergeable (note the dc anddc1 output is compatible with any other output state and/or action).

So states excl−c//excl−c is compatible (~) with Pexcl−c//Pexcl−c, and given this mergeexcl−d//excl−d is compatible with Pexcl−d//Pexcl−d. And from this it follows that invld//invld ~invld//Pexcl. Applying these merges to table (a) on figure 8-25 gives table (b). Further merging canbe applied to (b) if the count associated with the global shared state is dropped so notation like[shared+/−1], [shared=1] and is made simply shared and [0−>excl−c] is removed. So excl−c//excl−c ~shared//shared provided that when there is a write in the resultant sharexcl state there may be othersin that state (which is true in the shared//shared but not in the excl−c//excl−c state}. Also the shared*output action that occurs with the other−read operation applied to the excl−c//excl−c state drops the *with this merger.

8.5.1.2. CommunicationFrom the final reduced state table the cases in which some sort of communication is

necessary can be enumerated.1. δ(sharexcl//sharexcl, write) = excl−d//excl−d, and

δ(sharexcl//sharexcl, other−write) = invld//excl−d

Others in sharexcl//sharexcl state need to become invld//excl−d

2. δ(invld//sharexcl, write) = excl−d//excl−d, andδ(sharexcl//sharexcl, other−write) = invld//excl−d

Others in sharexcl//sharexcl state need to become invld//excl−d

3. δ(invld//excl−d, read) = sharexcl//sharexcl, andδ(excl−d//excl−d, other−read) = sharexcl//sharexcl

Other in excl−d//excl−d state must change to sharexcl//sharexcl

4. δ(invld//excl−d, write) = excl−d//excl−d, and δ(excl−d//excl−d, other−write) = invld//excl−d

Other in excl−d//excl−d state must change to invld//excl−d

8.5.1.3. Residence Of The Global State- Memory Controller DirectoryFor each block the global state can be kept in a directory located at the memory controller.

Even if we maintain much of the complexity in figure 8-25 (a) the feasability of implementing thecommunication necessary to maintain coherence can be directed from the directory.

Except for the count of the number of shared blocks, the global state of block B can berecorded with a few bits in association with that block. Counts of the number of sharers of a blockcan be stored seperately in association with blocks that are shared. In general then when an outputaction of B in cA is performed and information is needed about the global state, the intendedoperation is sent to the directory where the operation together with the global state determine thevalid location of B. If it is MM the block can be sent back to cA, if it is already in another cache then


296

excl-d//

read write replreadother-write

other-repl

excl-dsharexcl

excl-dexcl-d

invld

write-back?

excl-c, yes

invld //

shared // excl-d invld

excl-d

excl-c

no

no

state

excl-d

excl-d*

excl-d

excl-d

shared

invld

no

no

noexcl-d’nlr

nlr

nlr

nlr

nlr

nlr

nlr

sharexcl//sharexcl

nlr

nlr

nlr

nlr

nlr

no

nlr

nlr

invld nlr

nlr

nlr nlr

**

*

other-action

*

sharexcl

sharexcl

sharexcl sharexcl

sharexcl

excl-d**

~ excl-c// excl-c( excl-c // ~RO // RO excl-c

RO // RO)~shared // shared

~= invld // sharexcl

( invld //shared

Mergings in (b) that give (c)

shared//shared

= sharexcl // sharexcl

invld // shared invld // excl-c

~ invld //ROinvld //

RO )invld // invld// excl-c~

sharexcl

readother-read

write-back?, noother-repl

write ; /

/sharexcl

readwrite

invldexcl-d

write-back?, yes

other-read**

write-back?, no

other-write -> excl-d repl,

sharexclwrite ; //excl-d’write; //sharexcl*

write ; //nvld

read ; //sharexclread; // invld

other-write -> excl-d*repl,

invld//

invld//excl-d

excl-d’

H iT

M ISS

Figure 8-26: Maintain Coherence: State Table And Diagram, Merge2

the controller can broadcast to all caches that the one in which B is excl−d is to return its block toMM, and then that block would be sent back to cA. Also intended operations sent from cAdetermines the next global state which is recorded at the directory. Then the next local state at cA isdetermined at the directory or locally from the other information returned from the directory. Finallycaches other than cA may have to change state. The command to do this can be broadcast fromthe directory.

8.5.1.4. Residence Of The Global State Distributed And SnoopingStart by considering figure 8-25 (a). Information about the global state of B is implicit in the

local states of other caches. So it may be possible to get this information directly from other cacheswithout keeping them in a directory. However determining cumulative information, like the numberof shared valid copies of B, would require relatively complicated communication amongst the


297

caches. Loss of such information will decrease efficiency somewhat. If, for example, the count ofshared blocks is eliminated (and only shared or not shared remained) then once a block is shared itremains shared until one of the sharers writes to that block or until all copies are replaced--but not ifall but one copy is replaced. If it is OK to lose this determination of transfer from shared to exclusivestatus then fairly simple communication on a bus is sufficient to handle all other transformations inthe charts. Note that the count is not important for a block that is RO. Now the feasibility ofcommunication with snooping is considered for figure 8-25 (b) with the global state shared beingbinary--no longer containing a count.

If a request for their local states is addressed to the bus, and all caches catch that requestthen:

1. if there is no response the current global state is all invld or RO or Pexcl, in all thesecases the valid version of B is in MM-where the determination of which of the threepossibilities holds will be revealed.

2. if at least one locally shared responds, (note that then there should only be sharerresponses), the global state is shared. Or

3. if there is one locally excl−c response, (note that then there can only be one), then theglobal state is excl−c.

4. if there is one locally excl−d response, (note that then there can only be one), then theglobal state is excl−d.

So given the local state of B in cache A, its global state can be determined. This informationtogether with the operation being performed and the state table gives the next state and outputaction necessary. So if B is invld in cache A it needs to know whether the global state is invld RO orsomething else to determine its next local state which would respectively be excl−c, RO, or shared.

In addition to determining global states it is necessary, as a result of some operations toinform others of changes which should be made in their local states. So if there is a write operationfor a cache in the invld//excl−c state the cache whose local state is excl−c must become invld locally.In general wherever there is a *, **, or ^ in the state table information must be transmitted to changelocal states.

There is also the problem of synchronization of the information in the different blocks. If eachtransaction on the bus is completed before others are attempted this problem is solved.

8.5.1.5. Notes In SummaryGenerally there is a group of processors, each with their private cache. Each block in MM is in

one of k state and all n caches are in a some local state. So there are kn possible states ofcollection. Thinking as the operations executed by the processors each to one processor, as asequence of inputs, the inputs determine the next collective state of this giant state machine. Thefact is however that the bulk of the kn possible states are impossible, ex. if one cache is in the excl−dstate all the others must be invld. As has been described all the possible states are included in theset of pairs, <x,y> where x is a cache state of a block (7 including read−only and cA only.), and y isthe state of the others (also 7). Even amongst these there a large number that are impossible, ex.cA state: excl−d, global state: shared. Despite these impossibilities it is convenient to divide thecollective states into two sets whose product includes the total of all possible states because it isuseful to have one of the state sets kept locally at the cache, which allow many operations to be


298

performed without need of interrogating the global states. (the division into local and global statesresults in there being many impossible local, global pairs, ex., local: excl−d, global: all invalid.)Another interesting generality is that the global states can be effectively determined throughcommunication on a bus if the shared count is restricted to shared or not shared. This is possiblebecause the communication required is from at most one cache (exclusive)or from more than oneonly when all the communications are the same (exclusive)or from none at all.

8.6. Large Scale Memories

8.6.0.1. DisksA disk pack consists of a stack of magnetically coated platters connected permanently at their

centers to a spindle which is driven by the disk drive. The disk drive keeps the stack of plattersrotating at constant rotational speed (rpm). There is a read - write head for each platter surface, allare similarly positioned and always move in synchronism. Each bit stored on the platter is locatedby two coordinates, its radial position and its angle around the circumference. To read that bit read-write head must be moved radially to r, and wait for the continuing rotation to place the head at thecorrect angle. The time for the radial movement is the seek, the rotation time is the rotationallatency . In general the seek time is long (10−100 milliseconds) compared to the latency time (10−1

millisecond) which in turn is large compared to the transfer time to move a file between the disk anda buffer (103 µseconds).

The radius is partitioned into number of positions. The entire circumference at a radial positionis a track. A track is partitioned into segments. Each segment is addressable. The unit of storageon a disk is a file which is distributed over a set of sectors ordered by two way pointers. A sector ofthe disk is reserved for a disk map, which associates each file with its sector start address.

After reading part of a file on demand the remainder of the file may also be read into a bufferin anticipation of its imminent use. This buffer together with the identity of the information containedtherein is the disk cache. Now every read to a disk is first directed to its cache and retreived fromthere if available.

The memory hierarchy, from fastest to slowest, is proccessor registers, cache, main orprimary memory, disk cache, secondary memory.

8.7. Parallel Operation of Disk PacksSeveral disk packs with overlapped seeks can be used to speed up access. Two simple ways

this can be done is with shadowing and striping. .

Shadowing Uses two disks, say A and B, each of which contains the same information. Thedisks are operated independently (not in synchronism) and in parallel.

1. Doubles the cost.

2. Each write must be made to both A and B.

3. Read is from the first available disk pack of the pair. This provides a speedup, mostlyby allowing overlapped seeks.

4. It also provides automatic backup. Failure of one drive is not catastrophic.


299

Striping The entire file is spread over n disks, D1 to Dn Assuming the file is partitioned intopages/word/bytes/bits the k th page/byte/bit is stored on the k modn th disk. This distribution is calledround robin by pages/word/bytes/bits. (Multiple memory modules discussed earlier uses round-robin by words). This has interesting advantages and is used in RAID systems to be discussed.

For striping which round-robin by pages:1. Each write is to one disk pack only and its speed is not effected.

2. A read of a record of n or more pages will require that all n disks be searched. Therewill be savings in time because of overlap of seek and rotational times of differentpacks. The fact that all n can be read in parallel is not so important because very littletime is needed for data transfer compared to the search time.

3. Because of the "by paging" Disk Packs can be added to this system with relative easeas compared to "by byte" and "by bit".

RAID is a generic name of "by bit, byte or word" schemes which may use shadowing andstriping together with parity checking, disk pack failure recovery, and backup. The goal is reliablespeedup with many inexpensive disk packs.

Striping with a parity check disk pack:

The data is arranged round robin by units of pages/word/bytes/bits over n disk packs. A diskpack, containing a parity unit over each sequence of n units in one slice of the round robin, isadded. So there are n + 1 disks. Now if any disk drive fails the information on the failed drive can berecovered so operation can continue with one less disk, and/or a new disk may be added whichcontains units which are the parity of each slice of the remaining healthy disks. The data disks areDi , i = 1 through n − 1, and the Parity disk P = Dn. chosen so that:

D1 ⊕ D2 ⊕ ⋅ ⋅ ⋅ Dn = 0 So

Let Sum(D~j) = D1 ⊕ D2 ⊕ ⋅ ⋅ ⋅ Dj−1 ⊕ Dj+1 ⋅ ⋅ ⋅ ⊕ Dn

So Sum(D~j) ⊕ Dj = 0

So Sum(D~j) ⊕ Dj ⊕ Dj = 0 ⊕ Dj

Sum(D~j) = Dj

So the contents of Dj can be computed from the contents of all the other disks, Dk s. If the Djdisk drive fails, then its contents can be recovered. This provides error-correction when an entiredisk pack fails. Thus a simple parity gives error correction if the failed bit is known.

In this striping system the work required on a write varies depending on storage unit size.Assume a word is available to be written. If the unit is small, a bit or a byte, then all units to bewritten to a slice are normally available at once so that the parity can be computed immediately andwritten on parity disk pack P. If on the otherhand the units are large, ex. a word, then the availableinformation, say a word, is not enough to compute the parity. So a read of both the current value onthe disk D to which the unit is to be written, the current corresponding parity on P is necessary tocompute the new parity.

In detail:

If Djold is replaced with Dj

new then the new parity P new ( = Dnnew) must be computed.

To start with we have:


300

Sum(D~j,~n) = D1 ⊕ D2 ⊕ ⋅ ⋅ ⋅ ⊕ Dj−1 ⊕ Dj+1 ⊕ ⋅ ⋅ ⋅ ⊕ Dn−1

Sum(D~j~n) ⊕ Djold ⊕ Dn

old = Sum(D~j~n) ⊕ Djnew ⊕ Dn

new = 0)

Djnew ⊕ Dj

old ⊕ Dnold = Dn

new

or

Djnew ⊕ Dj

old ⊕ P old = P new

which is why Djold and P old must be read to compute P new. The fact that P must be read with

every write to a Dj means that it will be more active than other disk packs. This can be alleviated bythe next version of RAID.

Striping with a parity check distributed over disk packs: triping, with parity, as above, isused here, but now the parity units, instead of all being on the same disk pack are uniformlydistributed over all packs. So reading parities will involve the disk packs with equal frequency.


301

8.8. Problems1. Consider two caching methods:

a. C1:Direct (enough room for 1 complete group of blocks in the cache)

b. C2: Set Associative with enough room for 2 groups in the cache.

Consider a program P1, that is to be layed out in a sequence of addresses (all ordersare possible). P1 is straightline code, i.e. there are no loops or branches. P1 just fillsmemory i.e., the number of instructions in P1 = the number of addresses in M2.

a. DESCRIBE a layout(s) for P1 which gives the LOWEST possible hit ratio whenC1 is used, and when C2 is used. Give the hit ratios.

b. DESCRIBE a layout(s) for P1 which gives the HIGHEST possible hit ratiowhen C1 is used, and when C2 is used. Give the hit ratios.

c. It seems that the highest possible hit ratio that can be obtained with P1 is thesame for both C1 and C2. IS that so? WHY?

d. Let P2 be a program which passes through each location in M2 twice.DESCRIBE a sequence of locations for P2 with a better hit ratio when usingC2 than when using C1.

2. Consider a machine with a cache. Its behavior on a set of programs, P, is givenbelow.

a. The cache has room for B blocks.

b. There are b instructions in a block.

c. No program in P is bigger than B blocks.

d. The number of instructions in any program in P is a multiple of b.

e. Every time a program in P is run, all its instructions are executed at least once.

f. The average number of times that each instruction is run during the executionof any program in P is T.

g. Each program is run to completion before another one is run (nomultiprogramming).

GIVE THE HIT RATIO in running a series of programs in P in terms of b and T.

3. Consider memory system S. It has two caches, C1 and C2, and main memory, MM.MM has a higher access time than C2, and C2 has higher access time than C1. Amemory access in for a block b is always finally from C1. If it is not initially in C1 it willbe brought there in the following way.

a. If b is not in C1 then S looks for b in C2,

b. if b is in C2 then the block is read into C1. (This may require replacing a block,B, in C1, in which case B is returned to C2, replacing b in C2. )

c. If b is not in C1 nor in C2, S finally goes to MM (assume every block is in MM).In this case it is read directly into C1 and accessed from there. (This mayrequire replacing a block in C1, say B1, if so the B1 is placed in C2. If C2 isfull B1 will replace a block in C2, say B2. B2 should be the LRU block in C2(the block whose closest previous arrival in C2 occurred longest ago).If B2 is


302

dirty it will also have to be read back to MM.)The replacement algorithm used for both C1 and C2 is LRU. Given the followinginformation:

______________________________________________________| || | || Location of Block Addressed ||Prob| Access Time ||================================||====|=============||in C1 or C2 ||H12 | ||in C1, given in C1 or C2 || H1 | t1 ||~in C1, in C2, given in C1 or C2|| | t2 ||~in C1, ~in C2 in MM || | T ||________________________________||____|_____________|

a. FILL IN THE remaining probabilities in the table

b. GIVE an expression for the average time to access a block.

tave = __________________________

c. The blocks found in the union of C1 and C2 are the same as if there were onecache of the size of C1 and C2 using the same LRU algorithm, TRUE orFALSE?

4. In choosing a replacement strategy the cost of its implementation (often in hardware),is significant. The LRU strategies implementation is costly. Often a RANDOM oralmost RANDOM strategy is prefered because it can be done relatively inexpensively.Here is a possible almost RANDOM implementation for a set associative architecture.

As blocks are enter into the cache assigns them a pseudo random id number which isdifferent than all other block numbers remaining in the cache. (So the same blockcan have a different number each time it enters the cache, and different blocks canhave the same number at different times in the cache) When a replacement isrequired choose the block with the lowest id number.

Assume program P is run with two different caches, one with 3 and the other with 4blocks in their set associative caches for all memory blocks with the same blocknumber. Assume that, whenever at a program step, a new block must be entered inboth caches it is assigned the same id.

IS the larger cache guaranteed to require no more replacements than the smaller.PROVE your answer.

5. Consider the REPLACEMENT POLICY, RP:

When replacement is necessary replace the block whose identification number is nextto the highest in the cache.

For example, if a cache of size 3 contains 3 blocks with identifications 1, 2 and 3, andthe next block has identity 5, RP will replace 2 with 5 in the cache. IS an algorithmwhich implements this RP a stack algorithm? Justify.

6. In the MU replacement algorithm the block which has the highest use-count is thedispensable one--in the case that two or more have have the same use-count theblock with the higher address is dispensable. Each time a block enters the cache itsuse-count is set to 0. Every time a word in a block in the cache is accessed its use-


303

count is incremented. IS MU A STACK ALGORITHM? JUSTIFY.

7. Describe an algorithm for a FIFO replacement algorithm for an associatve cache ofsize m , asuming the identities of the blocks in the cache are kept in an array S =S1,...,Sn.

8. Is the MRU (most recently used) a stack algorithm? If so describe a a stack updatealgorithm for MRU.

9. For the block input sequence <2, 3, 2, 1, 5, 2, 4, 5, 3, 5, 2, 1> give the block hit ratio forOPT and for MRU assuming the cache holds 3 blocks.

10. Assume a direct cache-main memory system. Consider a program, P, that occupiesall 24. blocks in group 5 and the first k of the blocks in group 6. The program runs asone big loop and runs many times through that loop.

a. GIVE the block hit ratio in terms of k for P as it runs from its 2 nd time to itslast time through its loop.

The block size is 25 and the total size of main memory is 215.

b. HOW many groups in the main memory?

c. WHAT is the address in binary of the last block of P if k = 3.

d. WHAT is the hit ratio of P as run above?

11. Replacement Policiesa. Using any block replacement strategy which cannot see the future, there is

always an infinite block input sequence for which the block hit ratio is 0 if thenumber of blocks is larger than the cache capacity. WHY?

b. On the other hand it is said that for every infinite block input sequence theblock hit ratio is always greater than 0 using the OPT replacement strategyprovided the cache size is greater than 1/2 the main memory. True or False?PROVE your answer.

12. Find the smallest percentage fill of a set of M bins each of size N at which WF (worstfit) fails and FF still succeeds.

a. Give the necessary sequence of chunk sizes.

b. Do the same for the case that FF fails and WF succeeds.

13. Prove that in if, allocating a new block, it is found to fit exactly in one of the availablespaces then that is the optimal position for it.

14. Consider the static solution to cache coherence with multiple memorys andprocessors interconnected by a network, figure 8-23. Three kinds of memory wordsare distinguished-RO, RW-Pexcl, and RW-(by two or more processors). It is claimedthat the restrictions given there guarantee that: Each memory word has access to atmost 1 cache and so different versions of that word can never be in two differentcaches at any time. For each of the three cache arrangements, Private, Public, andPrivate-Public indicate whether this is true-explain briefly.

15. Consider the dynamic solution to the cache coherence problem. If processor P wantsaccess to a memory block which is not present (invld) in its cache, in order to write init, it needs to know the global state of that block. For each of the following global statedescribe what must be done to get a valid version of that block into P’s cache and


304

write it. Global states: excl-c, shared, excl-d.

16. Consider a n X n (n ≥ 2) array of disks. A parity disk is added to each row and to eachcolumn which contains the parity of its repective row or column. Can this arrangementbe used to correct bit errors due the failure of any two of the identified disks in thecomplete array? If not can one or a number of disks, independent of n, so this can bedone? Can a larger number of identified disk failures be corrected?

8. memory organizationpaull/chapt8.pdf · 8. memory organization memory is basic to the operation...

Documents