virtual memory for the sprite operating system

8/12/2019 Virtual Memory for the Sprite Operating System

1/27

Virtual Memory for the Sprite Operating System

Michael N. Nelson

Computer Science DivisionElectrical Engineering and Computer Sciences

University of CaliforniaBerkeley, CA 94720

Abstract

Sprite is an operating system being designed for a network of powerful personal worksta-tions. A virtual memory system has been designed for Sprite that currently runs on theSun architecture. This virtual memory system has several important features. First, it al-lows processes to share memory. Second, it allows all of the physical pages of memoryto be in use at the same time; that is, no pool of free pages is required. Third, it performsremote paging. Finally, it speeds program startup by using free memory as a cache forrecently-used programs.

This work was supported in part by the Defense Advanced Research Projects Agencyunder contract N00039-85-C-0269, in part by the National Science Foundation under grant ECS-8351961, and in part by General Motors Corporation.


2/27

Virtual Memory for the Sprite Operating System January 5, 1993

1. Introduction

This paper describes the virtual memory system for the Sprite operating system.Sprites virtual memory model is similar in many respects to Unix 4.2BSD, but hasbeen redesigned to eliminate unnecessary complexity and to support three changes in

computer technology that have occurred recently or are about to occur: multiprocessing,networks, and large physical memories. The resulting implementation is both simplerthan Unix and more powerful. In order to show the relationship between the Sprite andUnix virtual memory systems, this paper not only provides a description of Sprite virtualmemory, but also where appropriate describes the Unix virtual memory system and com-pares it to Sprite.

Although Sprite is being developed on Sun workstations, its eventual target isSPUR, a new multiprocessor RISC workstation being developed at U.C. Berkeley [HILL85]. In order to support the multiprocessor aspect of SPUR, the user-level view of virtualmemory provided by Unix has been extended in Sprite to allow processes to share writ-able segments. Section 2 compares the Sprite and Unix virtual memory systems from the

users view, and Section 3 describes the internal data structures used for the implementa-tion of writable segments.

Although Sprite provides more functionality than Unix with features such as sharedwritable segments, an equally important contribution of Sprite is its reduction of com-plexity. Unix requires that a portion of the page frames in memory be kept available tohandle page faults. These page frames are kept on a list called the free list. The problemwith the free list is that extra complexity is required to manage it and page faults arerequired for processes that reference pages that are on the list. Sprite has eliminated theneed for a free list which has resulted in less complex algorithms and more efcient useof memory. The Sprite page replacement algorithms are described in section 4.

Another major simplication in Sprite has been accomplished by taking advantageof high-bandwidth networks like the Ethernet. These networks have inuenced thedesign of Sprites virtual memory by allowing workstations to share high-performancele servers. As a result, most Sprite workstations will be diskless, and paging will be car-ried out over the network. This use of le servers has allowed Sprite to use much simplermethods than Unix for demand loading of code and managing backing store. Section 5describes the characteristics of Sprite le servers and how they are used by the virtualmemory system, including how ordinary les, accessed over the network, are used forbacking store.

Another key aspect of modern engineering workstations that Sprite takes advantageof is the advent of large physical memories. Typical sizes today are 4-16 Mbytes; withina few years workstations will have hundreds of Mbytes of memory. Large physicalmemories offer the opportunity for speeding program startup by using free memory as acache for recently-used programs; this mechanism is described in Section 6.

All future references to Unix in this paper will refer to Unix 4.2BSD unless otherwisespecied.

- 1 -


3/27


The design given here has been implemented and tested on Sun workstations. Sec-tion 7 gives some statistics and performance measurements for the Sun implementation.

2. Shared Memory

Processes that are working together in parallel to solve a problem need an interpro-cess communication (IPC) mechanism to allow them to synchronize and share data.Since Sprite will eventually be ported to a multiprocessor architecture, IPC is needed toallow users to exploit the parallelism of the multiprocessor. The traditional methods of IPC are shared writable memory and messages. Sprite uses shared writable memory forIPC because of considerations of efciency and the SPUR architecture:

Shared writable memory is at least as efcient as messages since shared writablememory uses the lowest-level hardware primitives without any additional layers of protocol.The SPUR architecture is designed to make sharing memory both simple to imple-ment in the operating system and efcient for processes to use.

The rest of this section describes the user-level view of shared memory in both Unix andSprite. The user-level view of shared memory in Unix is given because the Sprite view is just an extension of the Unix view.

2.1. Unix Sharing

In Unix the address space for a process consists of three segments: code, heap, andstack (see Figure 1). Processes can share code segments but they cannot share heap orstack segments. Thus writable shared memory cannot be used for IPC in Unix. Thereare two system calls in Unix that allow the code sharing. One call is fork which is usedfor process creation. The other call is exec which is used to overlay the calling processsprogram image with a new program image. When a process calls fork , a new process is

after code end

Highest address

Lowest address

Stack

Heap

CodeNext page boundary

Figure 1 . The Unix and Sprite Address Space. The address space for a process is made

up of three segments: code, heap, and stack. The code segment begins at address 0, is of xed size, is read-only and contains the object code for the process. The heap segmentbegins on the next page boundary after the end of the code segment and grows towardshigher addresses. It contains all statically-allocated variables and all dynamically-allocated memory from memory allocators. The stack segment begins at the highest ad-dress and grows towards lower addresses. Thus the stack and the heap segment grow to-wards each other.

- 2 -


4/27


created that shares the callers (parents) code segment and gets its own copy of theparents heap and stack segments. When a process calls exec , the process will get newstack and heap segments and will share code with any existing process that is using thesame code segment.

2.2. Sprite Sharing

In Sprite processes can share both code and heap segments. Sharing the heap seg-ment permits processes to share writable memory. Sprite has system calls equivalent tothe Unix fork and exec that allow this sharing of code and heap. The Sprite exec is ident-ical in functionality to the Unix exec . The Sprite fork is similar to the Unix fork exceptthat it will also allow a process to share the heap segment of its parent. When a processis created by fork it is given an identical copy of its parents stack segment, it shares itsparents code segment, and it can either share its parents heap segment or it can get anidentical copy. Whether or not a child shares its parents heap is the option of the parentwhen it calls fork .

3. Segment Structure

Although to the user the address spaces of Unix and Sprite look identical, the twooperating systems have different internal representations of segments. In this section theinternal segment structure of each system is described. For a description of how the seg-ment structure is used in each system to implement shared memory on the Sun-2 andVAX architectures see Appendix A.

3.1. Unix Segments

When a Unix process is created or overlaid, its address space is broken up into thethree segments as previously described. Associated with each process is a page table for

the code and heap segments, a page table for the stack segment, backing store for theheap and stack segments, and a pointer to a structure that contains the backing store forthe code segment (see Figure 2). In addition, all processes that are sharing the same codesegment are linked together. Thus page tables are kept on a per-process basis rather than

Code and heap

table.

page table.

table.

page table.Code and heap

Stack page

Pointer to code

Stack page

Pointer to code

Process A Process B

data structure. data structure.

store.Stack backingstore.Heap backing

Code backingstore.

Heap backingstore.Stack backingstore.

Figure 2 . The state of two Unix processes that are sharing code. Both process A andprocess B have their own page tables for code, heap, and stack segments, and their ownbacking store for heap and stack segments. Each points to a common data structure thatcontains the backing store for the code segment. In addition the two processes arelinked together.

- 3 -


5/27


a per-segment basis and backing store is kept on a per-segment basis. If a code segmentis being shared, then the code portion of the code and heap page table for all processesthat are sharing the segment is kept consistent. This means that when any process that issharing code has the code portion of its code and heap page table updated, the page tables

in all processes that are sharing the segment are updated.3.2. Sprite Segments

Sprite has a different notion of segments than Unix. In Sprite, information aboutsegments is kept in a separate data structure, independent of process control blocks.Each process contains pointers to the records describing its three segments. Each seg-ment has its own page table and backing store. Thus segments and backing store are bothallocated on a per-segment basis. Since page tables are associated with each segmentinstead of each process, processes that share code or heap segments automatically sharethe same page tables. Figure 3 gives an example of two Sprite processes sharing code andheap.

4. Page Replacement Algorithms

Several different page replacement algorithms have been developed. The mostcommonly used types of algorithms are those that are based on either a least-recently-used (LRU) page replacement policy, a rst-in-rst-out (FIFO) page replacement policy,or a working set policy [DENN 68]. LRU and FIFO algorithms are used to determinewhich page frame to replace when handling a page fault. In an LRU algorithm the pagethat is chosen for replacement is the page that has not been referenced by a process forthe longest amount of time. In a FIFO type of algorithm, the page that is replaced is theone that was least recently paged in. Both LRU and FIFO can be applied either to allpages in the system (a global policy) or only those pages that are owned by the faultingprocess (a local policy). Algorithms that use a global policy allow processes to have avariable-size partition of memory whereas algorithms that use a local policy normally

Segment 4

Process B

stack segment.Pointer to

heap segment.Pointer to

code segment.Pointer to

Segment 1

Segment 2

Segment 3Page table.

Page table.

Page table.

Page table.

Process A

stack segment.Pointer to

heap segment.Pointer to

code segment.Pointer to

Backing store.

Backing store.

Backing store.

Backing store.

Figure 3 . The state of two Sprite processes that are sharing code and heap. All that theprocess state contains is pointers to the three segments. Since process A and process Bare sharing heap and code, they point to the same heap and code segments. However,each process has its own stack segment.

- 4 -


6/27


give processes xed-size memory partitions.Unlike the LRU or FIFO page replacement policies, the working set policy does not

determine which page to replace at page fault time, but rather attempts to determinewhich pages a process should have in memory to minimize the number of page faults.

The working set of a process is those pages that have been referenced in the last seconds. Any pages that have not been referenced in the last seconds are eligible to beused for the working sets of other processes. A process is not allowed to run unless thereis a sufcient amount of free memory to t all of the pages in its working set in memory.

Studies of algorithms that use LRU, FIFO and working set policies have yielded thefollowing results:

Algorithms that use an LRU policy have better performance (where better perfor-mance is dened to be a lower average page fault rate) than those that use a FIFOpolicy [KING 71].Algorithms that use global LRU have better performance than ones that use local

LRU with xed-size memory partitions [OLIV 74].The global LRU and working set policies provide comparable performance [OLIV74, BABA 81a].

These results indicate that Sprite should use either a global LRU algorithm or an algo-rithm that uses the working set policy. In practice, these two types of algorithms areapproximated rather than being implemented exactly. The rst part of this section pro-vides a description of three operating systems that have already implemented an approxi-mation of one of these algorithms. In addition a fourth operating system, VMS, ispresented that implements a different type of algorithm that can give an approximation toglobal LRU. The second part of this section provides a description of the Sprite pagereplacement algorithms.

4.1. Implementations of Page Replacement Algorithms

4.1.1. Multics

The Multics paging algorithm [CORB 69] is generally referred to as the clock algo-rithm. The algorithm has been shown to provide a good approximation of global LRU[GRIT 75]. In the algorithm all of the pages are arranged in a circular list. A pointer,called the clock hand, cycles through the list looking for pages that have not beenaccessed since the last sweep of the hand. When it nds unreferenced pages, it removesthem from the process that owns them and puts them onto a list of free pages. If the pageis dirty, it is written to disk before being put onto the free list. This algorithm is activated

at page fault time whenever the number of free pages falls below a lower bound. When-ever the algorithm is activated several pages are put onto the list of free pages.The clock algorithm is relatively simple to implement. The only hardware support

required is reference and modied bits. In fact the Unix version of the clock algorithmwas implemented without any hardware reference bits [BABA 81a, BABA 81b].

4.1.2. Tenex

There are very few examples of operating systems that have implemented algo-rithms that use the working set policy. An example of an operating system that

- 5 -


7/27


implements something similar to it is Tenex [BOBR 72]. There are two main differencesbetween the policy used by Tenex and the working set policy. First, instead of deningthe working set of a process to be those pages that have been referenced in the last seconds, it is dened to be the set of pages that keeps the page fault frequency down to an

acceptable level. Second, instead of removing pages from the working set as soon asthey have been unreferenced for seconds, pages are only removed from the working setat page fault time.

The actual algorithm used in Tenex is the following. Whenever a process experi-ences a page fault, the average time between page faults is calculated. If it is greater thanPVA , a system parameter, then the process is considered to be below its working set size.In this case the requested page is brought into memory and added to the working set. If the page fault average is below PVA , then before loading the requested page into memorythe size of the working set is reduced by using an LRU algorithm. Since removing pagesis costly, whenever the working set is reduced all sufciently old pages are removed.

Chu and Operbeck [CHU 76] performed some measurements of an algorithm that is

very similar to the algorithm used in Tenex. They showed that this type of algorithm isable to produce performance comparable to that of algorithms that use Dennings workingset policy.

4.1.3. VMS

VMS uses a xed-size memory partition policy [LEVY 82, KENA 84]. Each pro-cess is assigned a xed-size partition of memory called its resident set. Whenever a pro-cess reaches the size of its resident set, the process must release a page for every pagethat is added to its resident set. VMS uses a simple rst-in-rst-out (FIFO) policy toselect the page to release. There are two lists that contain pages that are not members of any processs resident set: the free list and the dirty list. Both of these lists are managed

by an LRU policy. Whenever a page is removed from a processs resident set it is putonto the free list if it is clean or the dirty list if it is dirty. The two lists are used whenhandling a page fault. If the page that is faulted on is still on one of the lists then it isremoved from the list and added to the processs resident set. Otherwise the page on thehead of the free list is removed, loaded with data, and added to the processs resident set.Once the dirty list gets too many pages on it, some are cleaned and moved to the free list.

The VMS algorithm is interesting for two reasons. First, it does not require refer-ence bits. Second, it is an example of a hybrid between a FIFO and a LRU algorithm.Each processs local partition is managed FIFO but the global free list is managed LRU.It has been shown that for a given program and a given memory size, the resident set sizecan be set so as to achieve a fault rate close to that of LRU [BABA 81a]. However, the

optimal resident set size for a xed free list size is very sensitive to changes in memorysize and the program. With a xed sized partition for all programs it would be verydifcult to choose one partition that would be optimal for all programs. Another problemwith xed size partitions is that programs cannot take advantage of a large amount of freepages once they have reached their maximum partition size. More recent versions of VMS [KENA 84] have relaxed the restriction on the size of the resident set by allowingthe resident set size to increase to a higher limit if the free and dirty lists contain asufcient number of pages.

- 6 -


8/27


4.1.4. Unix

The Unix page replacement policy is based on a variant of the clock algorithm[BABA 81b]. Unlike Multics the clock algorithm is run periodically instead of at pagefault time. The data structures used to implement the clock algorithm are the core map

and the free list (see Figure 4). The core map is a large array containing one entry foreach page frame with the entries stored in order of appearance in physical memory. Itforms the circular list of pages required by the clock algorithm. The free list is a list of pages that is used at page fault time. It contains two types of pages: those that are notbeing used by any process and those that were put onto the list by the clock algorithm.The unused pages are at the front of the list.

The clock algorithm is simulated by a kernel process called the pageout daemon. Itis run often enough so that there are a sufcient number of pages on the free list to handlebursts of page faults. The pageout daemon cycles through the core map looking forpages that have not been referenced since the last time that they were looked at by thepageout daemon. Any unreferenced pages that it nds are put onto the end of the free

list. There are four constants used to manage the pageout daemon. As long as thenumber of free pages is greater than lotsfree (1/8 of memory on the Sun version of Unix)then the pageout daemon is not run. When the number of free pages is between desfree(at most 1/16 of memory) and lotsfree then the pageout daemon is run at a rate thatattempts to keep the number of pages on the free list acceptably high, while not takingmore than ten percent of the CPU. When the number of free pages falls below desfreefor an extended period of time then a process is swapped out. The process chosen to beswapped out is the oldest of the nbig largest processes. Minfree is the minimum tolerablesize of the free list. If the free list has only minfree pages then every time a page isremoved from the free list the pageout daemon is awakened to add more pages to the list.

In-use pages

5

Core Map Free list

6

3

1

4

2

Unused pagesDetached pages

5

2

3

6

Figure 4 . Unix page replacement data structures. In this example there are 6 pages in

physical memory. Pages 1 and 4 are currently being used by processes, pages 3 and 6have been detached from the processes that were using them by the clock algorithm, andpages 2 and 5 are not being used by any process. Thus pages 5, 2, 3, and 6 (in that ord-er) are eligible to be used when a new page is needed and pages 1 and 4 are not eligible.Pages 3 and 6 are also eligible to be reclaimed by the process that originally owned themif the original owner faults on them.

- 7 -


9/27


The free list is used when a page fault occurs. Like VMS, there are two ways inwhich the free list is used to handle a page fault. The rst is if the requested virtual pageis still on the free list. If this is the case, then the page can be reclaimed by taking it off of the free list and giving it to the faulting process. The second way in which the free list

is used is if the requested virtual page is not currently on the free list. In this case a pageis removed from the front of the free list, it is loaded with code or data, and it is given tothe faulting process.

4.2. Sprite Page Replacement

We decided to base the Sprite page replacement algorithm on the clock algorithmbecause of its inherent simplicity. However, we decided to do things differently thanUnix. The main difference is that Sprite no longer uses a free list. The free list serves animportant function by eliminating the need to activate the clock algorithm on every pagefault. However, it has the disadvantage of reducing the amount of memory that can bereferenced without causing page faults; any reference to a page on the free list requires apage fault to reclaim the page. Sprite does not need a free list because it uses the clock algorithm to maintain an ordered list of all pages in the system. When a page faultoccurs this ordered list of pages can be used to quickly nd a page to use for the pagefault. This section describes the Sprite algorithm that maintains and uses this list.

4.2.1. Data Structures

There are three data structures that the page replacement algorithm uses: the coremap, the allocate list, and the dirty list. Diagrams of all three data structures are shownin Figure 5. The core map is identical in form to the Unix core map. It is used by theSprite version of the clock algorithm to help manage the allocate list. The allocate listcontains all pages except those that are on the dirty list. All unused pages are on thefront of the list and the rest of the pages follow in approximate LRU order. The list isused when trying to nd a new page. The allocate list is managed by both a version of the clock algorithm and the page allocation algorithm. The dirty list contains pages thatare being written to backing store. Pages are put onto the dirty list by the page allocation

Unused pagesIn-use pages

Core Map Dirty List51

2345

1

2

3

4

Allocate List

Figure 5 . Sprite page replacement data structures. Every page is present in the coremap and either the allocate list or the dirty list. In this example there are 5 pages in phy-sical memory. Pages 1, 3, and 4 are currently being used by processes and pages 2 and 5are unused. Of the pages in use, pages 1 and 4 are on the dirty list waiting to be cleaned.

- 8 -


10/27


algorithm.

4.2.2. Clock Algorithm

Sprite uses a variant of the clock algorithm to keep the allocate list in approximate

LRU order. A process periodically cycles through the core map moving pages that havetheir reference bit set to the end of the allocate list. All pages have their reference bitcleared before they are moved to the end of the allocate list. Since unreferenced pagesare not moved by the clock algorithm, they will migrate towards the front of the allocatelist and referenced pages will stay near the end. The rate at which the core map is cycledthrough is not known yet.

4.2.3. Page Allocation

Figure 6 summarizes the Sprite page allocation algorithm. The basic algorithm is toremove pages from the front of the allocate list until an unreferenced, unmodied page isfound. If the frontmost page has been referenced since the last time that it was examined

by the clock algorithm, then it is moved to the end of the allocate list. If the page ismodied but not referenced since the last time that it was examined by the clock algo-rithm, then it is moved to the dirty list. Once an unreferenced, unmodied page is foundit is used to handle the page fault. If the page that is selected to handle the page fault isstill being used by some process, the page must be detached from the process that owns itbefore it can be given to the faulting process.

4.2.4. Page Cleaning

A kernel process is used to write dirty pages to backing store. It wakes up whenpages are put onto the dirty list. Before a page is written to backing store its modied bitis cleared. After a page is written to backing store it is moved back to the front of the

allocate list.4.2.5. How Much Does a Page Fault Cost

When a page fault occurs on Sprite there are three steps to processing the page fault:nd a page frame, detach it from the process that owns it (if any), and ll the page withcode or data. The overhead required to ll the page on a Sun-2 is between 3.6 ms for azero-lled page fault (see Section 8) up to as much as 30 ms to ll the page from disk.The actual time required to the ll a page will vary with the page size. The overhead of lling a page must be paid by any page replacement algorithm. The additional cost thatSprite has to pay at page fault time because it does not keep a free list is searching theallocate list for an unreferenced, unmodied page and then detaching the page from its

Take page

reference bitand clearof page listMove to end

ReferencedUnreferenced

Modieddirty listMove toUnmodied

Figure 6 . Summary of Sprite page allocation algorithm.

- 9 -


11/27


owner.Since the allocate list is in approximate LRU order, an unreferenced page should be

able to be found quickly. How quickly an unmodied page can be found is dependent onthe percentage of memory that is dirty. Since code pages are read-only, there must be

clean pages. If a large percentage of memory is dirty then it might take a long search ona page fault to nd a clean page. However, since all dirty pages that were found duringthe search will be cleaned and then put back onto the front of the allocate list, subsequentpage faults should be able to nd clean pages very quickly. Therefore in the averagecase the number of pages that have to be searched to nd an unreferenced, unmodiedpage should be small.

The cost of detaching a page from its owner is shown in Section 8 to be 0.3 ms.This cost combined with the cost of searching a small number of pages on the allocatelist is small in comparison to the cost of lling the page frame. Therefore the overheadadded because Sprite does not keep a free list should be small relative to the large cost of either zero-lling a page or lling it from the le server.

5. Demand Loading of Code and Backing Store Management

A virtual memory system must be able to load pages into memory from the le sys-tem or backing store when a process faults on a page and write pages to backing storewhen removing a dirty page from memory. This can be done by either using the le sys-tem both to load pages and to provide backing store or using the le system to load pagesand using a separate mechanism for backing store. For Sprite we chose to use the lesystem for both demand loading and backing store. Examples of other systems that usethe le system for backing store are Multics [ORGA 72] and Pilot [REDE 80]. An exam-ple of a system that uses a separate mechanism for backing store is Unix. In this sectionthe methods that Sprite and Unix use for demand loading of pages and managing backing

store are described in detail and compared.

5.1. Lifetime of a Page

There are three different types of pages in Unix and Sprite: read-only code pages,initialized heap pages, and uninitialized heap or stack pages (i.e. those whose initial con-tents are all zeroes). There are also three places where these pages can live: in the lesystem, in backing store, or in main memory. This section describes where the threetypes of pages spend their lifetimes.

5.1.1. Lifetime of a Unix Page

Figure 7 shows the lifetime of the three types of Unix pages. Code pages can live in

three places: main memory, in an object le in the le system, and in backing store. Anobject le is a le that contains the code and initialized heap pages for a program. Whenthe rst page fault occurs for a code page, the page is read from the object le into a pageframe in memory. When the page is replaced, its contents is written to backing storeeven though the page could not have been modied since code pages are read-only. Sub-sequent reads of the code page come from backing store. Since a code page is read-only,it only has to be written to backing store the rst time that it is removed from memory.

The lifetime of an initialized heap page is almost identical to that of a code page.Like a code page an initialized heap page is loaded from the object le when it is rst

- 10 -


12/27


StoreBacking

heap or stack

heap page

Zero lled

SystemFile

page

Code page

Memory

Uninitialized

Initialized

Figure 7 . Lifetime of a Unix page. Code pages and initialized heap pages begin theirlife in the le system. However, once they are thrown out of memory they are written tobacking store, and spend the rest of their life in memory and backing store. Uninitial-ized heap and stack pages are lled with zeroes initially and then spend the rest of theirlife in memory and backing store.

faulted on, it is written to backing store the rst time that it is replaced and all faults onthe page after the rst one load the page from backing store. Unlike code pages, initial-ized heap pages can be modied. Because of this, in addition to being written to backingstore the rst time that they are replaced, initialized heap pages must be written to back-ing store if when they are replaced they have been modied.

Unlike code and initialized heap pages, uninitialized heap pages and stack pagesnever exist in the le system. Their entire existence is spent in memory and in backingstore. When an uninitialized heap or stack page is rst faulted in, it is lled with zeroes.When the page is taken away from the heap or stack segment, it is written to backingstore. From this point on whenever a page fault occurs for the page, it is read from back-ing store. Whenever the page is thrown out of memory, if it has been modied then it iswritten back to backing store.

Initialized

Uninitialized

Memory

Code page

page

FileSystem

Zero lled

heap page

heap or stack

BackingStore

Figure 8 . Lifetime of a Sprite page. Code pages live in two places, the le system andmemory. Code pages never have to live on backing store because they are nevermodied. Initialized data pages live in the le system and memory until they getmodied, and then spend the rest of their life in memory and backing store. Uninitial-ized heap and stack pages are lled with zeroes initially and then spend the rest of theirlife in memory and backing store.

- 11 -


13/27


5.1.2. Lifetime of a Sprite Page

The lifetime of a Sprite page (see Figure 8) is very similar to the lifetime of a Unixpage. The only difference is the treatment of code pages. Unlike Unix code pages,Sprite code pages can only live in memory and in the object le. When a page fault

occurs for a code page, the page is read from the object le into a page frame in memory.When the page is replaced, the contents are discarded since code is read-only. Thuswhenever a page fault occurs for a code page it is always read from the object le. Thereasons why Unix writes read-only code pages to backing store and Sprite does not areexplained in the next section.

5.2. Demand Loading of Code

Unix and Sprite both initially load code and initialized heap pages from object lesin the le system. However, the method of reading these pages is different. In Sprite thepages are read using the normal le system operations. In Unix the pages are readdirectly from the disk without going through the le system. In addition after a code

page has been read in once, the object le is bypassed with all subsequent reads comingfrom backing store. The reasons behind the differences between Unix and Sprite arebased on different perceptions of the performance of the le system.

The designers of Unix believed that the le system was too slow for virtualmemory. Therefore they tried to use the le system as little as possible. This is whyUnix bypasses the le system and uses absolute disk block numbers when demand-loading code or initialized heap pages and why it writes code pages to backing store forsubsequent page faults. When a code or initialized heap segment is initially created for aprocess, the virtual memory system determines the physical block numbers for each pageof virtual memory that is on disk. When a code or initialized heap page is rst faulted in,the physical address is used to determine where to read the page from. All later reads of

the code page are able to use an absolute disk address for the code page in backing store.There are two reasons why the Unix designers believed that the Unix le system

was too slow for virtual memory. The rst reason is that when performing a read throughthe le system there is overhead in translating from a le offset to an absolute disk address. Thus the Unix virtual memory system uses absolute disk addresses to eliminatethis overhead. The second reason is that the le system block size was too small. In theoriginal versions of Unix (prior to Unix 4.1BSD), the block size was only 512 bytes[THOM 78]. This severely limited le system throughput [MCKU 84]. The solutionwas to write code pages to backing store which, as will be explained later, is allocated incontiguous blocks on disk. This allows many pages to be transferred at once, providinghigher throughput. Since Unix 4.2BSD uses large block sizes (4096 bytes or larger) there

is no longer any advantage for Unix to write code pages to backing store.We believe that the Sprite le system is fast enough for the virtual memory system

to use it for all demand loading; there is no need to use absolute disk block numbers orwrite code pages to backing store. The reason is that the le system will be implementedusing high-performance le servers which will be dedicated machines with local disksand large physical memories. The Sprite le system will have the following characteris-tics:

- 12 -


14/27


The large memory of the le servers will be used as a block cache. The cache sizefor the present will be around 8 Mbytes. In a few years the cache size will probablybe between 50 and 100 Mbytes. It has been shown that when page caches are largeenough (several Mbytes), disk trafc can by reduced by 90% or more [OUST 85].

Combined with the large cache will be large le block sizes (16 Kbytes or more).This results in the fewest disk accesses [OUST 85].In addition to caching pages, the large memory can be used to cache le indexinginformation to allow faster access to blocks within a le.Since the le system is used for demand loading of code and initialized heap, the

normal le system operations can be used. When a code or heap segment is rst createdthe open operation is used to get a token to allow access to the object le for future pagefaults. The close operation is used when the virtual memory system has nished with anobject le. The read operation is used to load in a page when a page fault occurs. Whena read operation is executed the le server is sent the token for the le, an offset into thele where to read from, and a size. The le server then returns the page.

The Sprite method of using a high-performance le system has several advantagesover the Unix method of bypassing the le system. First, the virtual memory system issimplied because it does not have to worry about the physical location of pages on disk.This is because when demand loading a page, offsets from the beginning of the object leare used instead of absolute disk block numbers. These offsets can be easily calculatedgiven any virtual address. Second, the le servers cache can be used to increase perfor-mance. Page reads may be able to be serviced out of the cache instead of having to go todisk. Third, code pages do not have to be transferred to backing store.

5.3. Backing Store Management

Backing store is used to store dirty pages when they are taken away from a segment.In Sprite each segment has its own le in the le system that it uses for backing store.Unix on the other hand gives each CPU its own special partition of disk space which allsegments use together for backing store. The reason for the difference is once againbased on different ideas about the performance of the le systems in Unix and Sprite.

As was said before, the Unix designers believed that the Unix le system was tooslow for the virtual memory system. Because of this a special area of disk is allocatedfor the virtual memory system. When each process is created, contiguous chunks of backing store are allocated for each of its code, heap, and stack segments. As segmentsoutgrow the amount of backing store allocated for them, more backing store is allocatedin large contiguous chunks on disk. A table of disk addresses for the backing store iskept for each segment. When a page is written to disk the physical address from thistable is used to determine where to write the page. If the machine has a local disk thenall backing store is allocated on its local disk. Otherwise all backing store is allocated ona partition of the servers disk.

Since Sprite uses a high-performance le system, all writing out of dirty pages isdone to les in the le system instead of to a special disk partition dedicated to backingstore. The rst time that a segment needs to write a dirty page out, a le is opened. Wechose not to open the le until a dirty page has to be written out on the assumption thatmost segments will never have any dirty pages written out. From this point on all dirty

- 13 -


15/27


pages that need to be saved in backing store are written to the le using the normal lewrite operation. This just involves taking the virtual address of the page to be writtenand presenting it along with the page to the le server. All reads from backing store canuse the normal le read operation by using the virtual address of the page to be read.

When the segment is destroyed, the le is closed and removed.There are several advantages of using les for backing store instead of using a

separate partition of disk. First, the virtual memory system can deal with virtualaddresses instead of absolute disk block numbers. This frees the virtual memory systemfrom having to perform bookkeeping about the location of pages on disk. Second, nopreallocated partition of disk space is required for each CPU. This can represent a majorsavings in disk space. For example, under Unix each Sun workstation requires a disk partition of approximately 16 Mbytes for backing store, most of which is rarely used.Third, process migration - the ability to preempt a process, move its execution state toanother machine, and then resume its execution - is simplied. In Unix, backing store isprivate to a machine, so the actual backing store itself would have to be transferred if a

process were migrated. In Sprite, the backing store is part of a shared le system so onlya pointer to the le used for backing store would have to be transferred. Finally, Spritehas the potential for higher performance. Since le servers will have large physicalmemories dedicated to caching pages, the virtual memory system may be able to getpages by hitting on the cache instead of going to disk.

There is one problem with the Sprite scheme of using les. Since disk space is notpreallocated it is possible that there may be no disk space available when a dirty pageneeds to be written out of memory. This will not be a problem as long as the le serverscache has enough room to hold all dirty pages that are written out. However, if the leserver cannot buffer all pages that are written out, then this could be a serious problem.A solution is to create a separate partition on the servers disk that is only used for back-

ing store les. This should still represent a savings in disk space over the Unix methodbecause the partition for backing store les can be shared by all workstations.

6. Fast Program Startup

In order for a program to run it must demand load in code and initialized heappages. These page faults are much more expensive than the initial faults for stack anduninitialized heap pages because code and initialized heap faults require a le servertransaction whereas stack and uninitialized heap faults just cause new zero-lled pages tobe created. The section under shared memory described one method of eliminating thesecode and initialized heap page faults by having a newly-created process share code withanother process that is already using the desired code segment. However this is of no useif there is no process that is actively using the code segment that is needed. In order tohelp eliminate code page faults in the case when the normal shared memory mechanismwill not work, Sprite uses memory as a cache to hold pages from code segments thatwere recently in use. The use of memory for this purpose is possible because of theadvent of large physical memories and the fact that code is read-only.

Caching of code pages is implemented by saving the state of a code segment whenthe last process to reference it dies. The saved state includes the page table and all pagesthat are in memory. Such a code segment is called inactive . All of the pages associatedwith an inactive segment will remain in memory until they are removed by the normal

- 14 -


16/27


page replacement algorithm. When a process overlays its address space with a new pro-gram image, if the code segment of the program corresponds to one of these inactive seg-ments, then the process will reuse the inactive segment instead of allocating a new one.

There are two details to the algorithm for reusing inactive segments that need to be

mentioned. First, it may be the case that when a new segment is being created all seg-ments are either inactive or in use. In this case the inactive segment that has been inac-tive for the longest amount of time is recycled and used to create the new segment. Thisentails freeing any memory resident pages that the inactive segment may have allocatedto it. Second, whenever a process overlays its address space with a new program image,the code segment that is used needs to be from the most recent version of the object le.This is accomplished by opening the object le for the needed code segment to get atoken that uniquely identies the current version of the le. This token is then comparedto the tokens for all inactive code segments. A match will occur only if there is a codesegment that is from the most recent version of the object le.

The actual performance improvement from the use of inactive segments cannot be

measured until the operating system becomes fully operational. However, an indicationof the reduction in startup cost can be obtained by determining the ratio of code to initial-ized heap in the standard Unix programs. I examined over 300 Unix programs anddetermined that on the average there was nearly three times as much code as initializedheap. This means the elimination of all code page faults could reduce the startup cost byas much as 75 percent. The actual improvement is dependent on how recently the pro-gram has been used, the demand on memory and how many code and initialized heappages are faulted in by the program.

7. Implementation of Virtual Memory

The virtual memory system that has been described in this paper has been fully

implemented on the Sun architecture. In this section some issues related to the imple-mentation and performance of the virtual memory system are discussed.

7.1. Hardware Dependencies

Certain portions of all virtual memory systems are inherently hardware dependent.This includes reading and writing a page table entry and reading and writing hardwarevirtual memory registers. Even if the code for a virtual memory system is written in ahigh level language, it must be modied when it is made to run on different machinearchitectures so that it can handle these hardware dependencies. The Sprite implementa-tion is structured to allow these modications to be made easily. In contrast, the Uniximplementation is not structured to allow it to be easily ported to other architectures.

The Unix virtual memory system was initially written for the VAX architecture.The implementation contains code, algorithms, and data structures that are specicallyfor the VAX architecture. Instead of being isolated in a single portion of the code, these

The programs that were examined came from Suns version of /bin, /usr/bin, /usr/local and /usr/ucb. The measurements were performed by using the size command on each program andthen taking the ratio of initialized code to initialized heap.

- 15 -


17/27


hardware dependencies are strewn throughout the code. An example of what is involvedin porting the Unix code is the job that the people from Sun Microsystems did when port-ing Unix from the VAX to the Sun. When they did the port they did three things. First,they used the conditional compilation feature of C (the language that Unix is written in)

to separate many of the VAX and Sun hardware dependencies within the code. As aresult the code consists of a mixture of VAX and Sun code separated by ifdef statements.Second, they wrote a separate le of hardware-dependent routines to manage the Sun vir-tual memory hardware. Third, they left in some data structures and algorithms that arethere because the code was written for a VAX. An example of this is the simulation of reference bits [BABA 81a, BABA 81b]. This was only necessary because the VAX doesnot have reference bits. Even though the Sun does have reference bits they were not usedbecause it was simpler to stick with the simulation of reference bits.

Since Sprite is going to be ported to the SPUR architecture, one of my goals was tomake the virtual memory system easier to port than Unix. This involved dening aseparate set of routines that handled all hardware dependent operations. These routines

are called by the hardware-independent routines. When we port Sprite to another archi-tecture only the hardware-dependent routines should need to be rewritten.The result of dividing the code into hardware-dependent and hardware-independent

parts is that approximately half of the code is hardware-dependent and half hardware-independent (the actual code size is given in Table 1). Implementation of the clock andpage replacement algorithms, handling page faults, creation of new segments, and themanagement of the segment table have been done with hardware-independent code.Reading and writing page table entries, validating and invalidating pages, mapping pagesinto and out of the kernels address space, and managing the Sun memory mappinghardware have all been written as hardware-dependent code. Thus only the low-leveloperations have to be reimplemented when we port Sprite to another architecture.

Lines of C Lines of AssemblerWithComments

WithoutComments

WithComments

WithoutComments

Bytes of CompiledCode

2423 970 359 107 8560MachineDependent

2825 1095 0 0 8484Machine

IndependentTotal 5248 2065 359 107 17044

Table 1 . Sprite code size. Half of the Sprite code is hardware-dependent and half ishardware-independent. In addition almost two-thirds of the code is comments. Blank lines are treated as comments.

- 16 -


18/27


Lines of C Lines of AssemblerWithComments

WithoutComments

WithComments

WithoutComments

Bytes of CompiledCode

Total 5855 3833 173 125 32948

Table 2 . Unix code size. Unix code is a mixture of hardware-dependent and hardware-independent code. Thus unlike Table 1 there is no differentiation in this table betweenhardware-dependent and hardware-independent code. Also note that only a little morethan one-third of the code is comments. Blank lines are treated as comments.

7.2. Code Size

The virtual memory systems for Sprite and Unix were both written mostly in C witha small amount being written in assembler. Tables 1 and 2 give the sizes of the sourceand object codes for both systems. There are three observations that can be made from

these two tables. First of all, the Sprite source code is much more heavily documentedthan the Unix sources. This can be seen from the fact that although Unix and Sprite havealmost the same amount of C code including comments, Unix has twice as many lines of code not including comments. The second observation is that Sprite is not as complex asUnix. This can be seen from the fact that Unix has twice as many bytes of compiledcode as Sprite. Although code size is not an absolute measure of complexity, the factthat Sprite requires half as much code as Unix is an approximate indication of their rela-tive complexities. The nal observation from the two tables is the small amount of assembly code in either system. This shows that both virtual memory systems can bewritten almost entirely in a high-level language.

7.3. Performance

Since the Sprite virtual memory system is currently running on the Sun architecture,I have been able to do some preliminary performance evaluation of the virtual memorysystem. All of the performance evaluation was done on a Sun-2 workstation (which usesa Motorola 68010 microprocessor). This performance evaluation falls into twocategories. The rst is the speed of simple zero-lled page faults. The second categoryis the overhead of the clock algorithm. This measurement consists of determining whatpercentage of the CPU is required to run the clock algorithm at different rates. Thingsthat were not measured include the time required to read and write pages from/to backingstore, the time required to demand load a page from an object le, and the overheadrequired when memory demand is so tight that pages have to be written to backing store.These were not measured because they are dominated by the speed of the le server andnot the speed of virtual memory. Since we are not using the high-performance le serverthat we will be using in a few months, these measurements are not a good indication of the eventual speed of the virtual memory system.

7.3.1. Simple Page Faults

Table 3 gives the times required to perform zero-lled page faults on Sprite andUnix. These page faults occur when either an uninitialized heap page or a stack page is

- 17 -


19/27


Sprite UnixFree Page 3.6 ms 3.5 to 4.0 ms

In Use Page 3.9 ----

Table 3 . Time to zero-ll a page in milliseconds. The rst row is the amount of time re-quired when the page that is being zero-lled does not have to be detached from a pro-cess. The second row is the amount of time required when the page must be detachedfrom the process that owns it. Since Unix never has to detach a page, there is no entry inthe second row for Unix.

rst faulted on. These simple page faults consist of getting a page, lling it with zeroesand giving it to the process that faulted on it. In Sprite if the page is not in use it takes3.6 ms to zero-ll it and if it is in use it takes 3.9 ms. Thus it only takes 0.3 ms to detacha page from a user process. Unix, which always has free pages available, varies between3.5 and 4.0 ms for a zero-lled page fault. For both Unix and Sprite the portion of thezero-lled page fault time that is spent writing zeroes into the page is approximately 1ms.

All Referenced All in Use None in UsePages perSecond

ExecutionTime (sec)

PercentSlowdown

ExecutionTime (sec)

PercentSlowdown

ExecutionTime (sec)

PercentSlowdown

0 360 0 360 0 360 0100 376 4 367 2 362 0.5500 415 16 390 8 370 3

1000 475 32 415 15 384 7Table 4 . Clock algorithm overhead. There are three different states for pages examinedby the clock algorithm: in use and referenced, in use and not referenced, and not in use.An in-use-and-referenced page requires the most overhead because the reference bit hasto be examined and cleared, and the page moved to the end of the allocate list. An in-use-and-not-referenced page requires the second largest overhead because the referencebit has to be examined but it does not have to be cleared and the page does not have tobe moved. Finally, pages that are not in use require the least amount of overhead be-cause they can be ignored. The above table shows the amount of overhead required toexecute the clock algorithm for pages of the three different types at different speeds.The left column shows the overhead when all pages in memory are in use and referenced(the worst case). The middle column shows the overhead when all pages in memory are

in use but none are referenced. The right column shows the overhead when no pages arein use (the best case). The overhead was measured by executing a low priority user pro-cess that executes a simple four instruction loop 100,000,000 times. While this processwas executing the clock algorithm was running at high priority once a second. Theoverhead was measured by determining how much extra time was required by the userprocess to complete while the clock algorithm executed.

- 18 -


20/27


7.3.2. Overhead of Clock Algorithm

I was able to measure the overhead of the clock algorithm for Sprite. However, Iwas unable to do similar measurements for the Unix clock algorithm. The reason is that,whereas I was able to easily modify the Sprite kernel to give me good information, it was

unclear how to easily modify the Unix kernel to provide me with the necessary informa-tion. As a result there is no comparison of the overheads required by the Sprite clock algorithm and the Unix clock algorithm.

Table 4 shows the overhead required by the Sprite clock algorithm. In the worstcase, when all pages in memory are in use and being referenced, the algorithm requiresbetween 3 and 4 percent of the CPU for each 100 pages checked per second. In the bestcase when nothing is being used, it takes around one-half of one percent of the CPU. Thesignicance of these numbers can be seen by determining what is the maximum rate atwhich the clock must be run to give a good approximation of LRU.

In order for the clock algorithm to keep the allocate list in approximate LRU order,pages must be scanned at a rate several times that of the page fault rate. This willguarantee that as a page rises from the end of the allocate list to the front of the allocatelist, it will be examined multiple times by the clock algorithm giving it multiple chancesto be put back onto the end of the list. Thus pages that make it to the front of the allocatelist will not have been accessed very recently, with the exception of pages that wereaccessed since the last time that the clock algorithm examined them; that is, the allocatelist will be in approximate LRU order.

Since the clock hand must move at several times the page fault rate, the maximumclock rate can only be determined after determining the maximum page fault rate. Anupper bound on the page fault rate is the number of zero-lled page faults that can beexecuted per second. At 3.6 ms per page fault, there can be 277 of these types of pagefaults per second. However, this is unrealistically pessimistic. When a page is zero-lled it is marked as dirty. Thus if all of memory is being lled by zero-lled pages, thenall of memory must be dirty. This means that whenever a zero-lled page fault is han-dled a page must be written to the le used for backing store. Therefore the actual max-imum sustained rate of zero-lled pages faults is the rate that pages can be written to thele server. Since all other types of page faults require a read from the le server, themaximum page fault rate is limited by the speed at which pages can be read or writtenfrom/to the le server. Assuming optimistically that it only takes 10 ms to read or write apage from/to the le server, the upper bound on the number of page faults per second is100. Running the clock algorithm at a rate ve times this maximum page fault ratewould require 500 pages to be scanned per second or a worst-case overhead of 16 per-cent.

The page fault rate of 100 pages per second just given is a very pessimistic estimate.There was a study done of the paging activity of several VAX computers running Unix[NELS 84]. It measured the number of page-ins per second averaged over 5, 10, 30 and60 second intervals when the system had an average of 8 runnable processes. The resultsshowed that the maximum page-in rate over a 5 second interval was 25 pages per second.However, the page-in activity was very bursty and when averaged over longer intervals itdropped dramatically. For example when averaged over 60 second intervals, the max-imum page-in rate was 4 per second. The page-in rate averaged over the whole studywas less than one per second. Because of the results of this study, we expect that the

- 19 -


21/27


actual page fault rate that we will experience will be much smaller then the upper boundof 100 pages per second. This should result in a clock rate much lower than 500 pagesper second. In addition the overhead of the clock algorithm will not be as bad as theworst case of 3 percent per 100 pages scanned because memory is not usually all in use

and all referenced. The combination of a lower clock rate and lower overhead shouldmake the actual total overhead of the clock algorithm negligible (between one and twopercent).

8. Conclusion

A virtual memory system has been presented that provides the basic functionality of the Unix 4.2BSD virtual memory system while being simpler, avoiding some of its prob-lems, and providing additional functionality. The simplications come from usingremote servers for paging and a simplied page replacement algorithm. A problem withUnix that is eliminated is the extra page faults required to reclaim pages off of the freelist. Sprite eliminates these extra page faults by replacing the free list with a list that con-tains all of the pages in memory in approximate LRU order. The list is then used toquickly nd pages to handle page faults. The extra functionality that Sprite provides isthat it allows processes to share writable memory and it reuses old segments to allow fastprogram startup. The shared writable memory allows high-speed interprocess communi-cation between processes.

The Sprite virtual memory system as described in this paper is fully operational.The rest of the operating system is still under development. We hope that Sprite will bein use by the developers by the end of Summer 1986 and in use in the research commun-ity in 1987.

9. References

[BABA 81a]Babaoglu, O. Virtual Storage Management in the Absence of Reference Bits.Ph.D. Thesis, Computer Science Division, University of California, Berkeley,November 1981.

[BABA 81b]Babaoglu, O., and Joy, W.N. Converting a Swap-based System to do Paging in anArchitecture Lacking Page-Referenced Bits. Proceedings of the 8th Symposiumon Operating Systems Principles , 1981, pp. 78-86.

[BOBR 72]Bobrow, D.G., et al. TENEX, a Paged Time Sharing System for the PDP-10.Communications of the ACM , Vol. 15, No. 3, March 1972, pp. 135-143.

[CHU 76]Chu, W., Opderbeck, H. Program Behavior and the Page Fault FrequencyReplacement Algorithm. IEEE Computer , Nov. 1976, pp. 29-38.

[CORB 69]Corbato, F.J. A Paging Experiment with the Multics System. In In Honor of Philip M. Morse (edited Feshbach and Ingard), MIT Press, Cambridge, Mass., 1969,pp. 217-228.

- 20 -


22/27


[DENN 68]Denning, P.J. The Working Set Model for Program Behavior. Communicationsof the ACM , Vol. 11, No. 5, May 1968, pp. 323-333.

[GRIT 75]

Grit, D.H., and Kain, R.Y. An Analysis of the Use Bit Page Replacement Algo-rithm. Proceedings of the ACM Annual Conference , Minneapolis, Minn., 1975,pp.187-192.

[HILL 85]Hill, M.D., et al. SPUR: A VLSI Multiprocessor Workstation. Computer Sci-ence Division Technical Report No. UCB/CSD 86/273, University of California,Berkeley, December 1985.

[KENA 84]Kenah, L.J., Bate, S.F. VAX/VMS Internals and Data Structures , Digital Press,Bedford, Mass., 1984.

[KING 71]King, W.F. III. Analysis of Demand Paging Algorithms. Proceedings of IFIPS Congress , Ljubljana, Yugoslavia, 1971, Vol. 1, pp. 485-490.

[LEVY 82]Levy, H., Lipman, P. Virtual Memory Management in the VAX/VMS OperatingSystem. IEEE Computer , March 1982, pp. 35-41.

[MCKU 84]McKusick, M.K., Joy, W.N., Lefer, S.J., and Fabry, R.S. A Fast File System forUNIX, ACM Transactions on Computer Systems , Vol. 2, No. 3, August 1984, pp.181-197.

[NELS 84]Nelson, M.N., and Duffy, J.A. Feasibility of Network Paging and a Page ServerDesign. Term project, CS 262, Department of EECS, University of California,Berkeley, May, 1984.

[ORGA 72]Organick, E.I. The Multics System: An Examination of Its Structure , MIT Press,Cambridge, Mass., 1972.

[OLIV 74]Oliver, N.A. Experimental data on page replacement algorithm. Proceedings NCC , 1974, pp. 179-184.

[OUST 85]

Ousterhout, J.K. et al. A Trace-Driven Analysis of the 4.2 BSD UNIX File Sys-tem. Proceedings of the 10th Symposium on Operating Systems Principles , 1985,pp. 15-24.

[REDE 80]Redell, D.D. et al. Pilot: An Operating System for a Personal Computer. Com-munications of the ACM , Vol. 23, No. 2, Feb. 1980, pp. 81-92.

[THOM 78]Thompson, K. The Unix Time-Sharing System: Unix Implementation. Bell

- 21 -


23/27


System Technical Journal , Vol. 57, No. 6, July-August 1978, pp. 1931-1946.

- 22 -


24/27


Appendix A

Implementation of Sharing on VAX and Sun-2 Architectures

Section 3 described the internal representation of segments in Unix and Sprite. Thisappendix describes how each system either uses or could use its representation of seg-ments to implement sharing on the VAX and Sun-2 architectures. The description of theimplementation of sharing includes a description of the memory mapping hardware pro-vided by the VAX and Sun-2 architectures.

A.1 VAX

The VAX divides the address space of each process into the two regions P0 and P1[KENA 84]. P0 contains the code and heap and P1 contains the stack. Page tables foreach of these regions are kept in the kernels virtual address space. There are two regis-ters P0BR and P1BR that point to the base of the P0 and P1 region page tables respec-tively. These two registers must be set up correctly by the operating system whenever auser process begins executing in order to allow virtual address translation to be per-formed.

The Unix organization for the page tables for a process was developed on a VAX.This explains why the page table structure described in Section 3 came about. When aprocess begins execution Unix just sets the P0BR and P1BR registers to point to theprocesss page tables. Thus the Unix page table structure works very easily on the VAX

Heap

Code

Heap

Code

address spaceKernel virtualsegment 3PT for

Process BPT for

segment 2PT forsegment 1PT for

Process APT for

Kernelpage tablePhysicalmemory

12341234

5521

1

98765432

Figure 9 . Implementing Sprite on a VAX. In this example Process A and Process B aresharing code but not heap. The page tables for each process and each of the three seg-

ments are kept in the kernels virtual address space. The page tables are allocated suchthat the code and heap portions each begin on a page boundary. Since Processes A andB are sharing code, the kernels page table is set up so that the code portion of eachprocesss page table uses the same physical pages as those used by the page table forsegment 1, the code segment that they are sharing. In addition the kernels page table isset up so that the heap portion of each processs page table uses the same physical pagesas the page table for the heap segment that they are using.

- 23 -


25/27


but it still has the problem that the code page tables of two processes that are sharingcode must be kept consistent.

Sprite has not been implemented on a VAX. However, with a small addition to thesegment structure described in Section 3 it can be implemented quite easily (see Figure

9). Each segment is allocated a page table that begins on a page boundary. In additioneach process is allocated one page table for its code and heap which begins on a pageboundary. The heap portion of the page table also begins on a page boundary. The pagetable entries in the kernel page table that map a processs page table are set up so that thecode portion of a processs page table uses the same physical pages as the page table forthe code segment that the process is using. Similarly the heap portion of a processs pagetable uses the same physical pages as the page table for the heap segment that the processis using. When a process begins executing, the P0BR register is set to point to the pagetable for the process and the P1BR is set to point to the page table that is associated withthe stack segment. Thus there can be multiple virtual copies of segment page tables butonly one physical copy.

The advantage of the Sprite scheme over the Unix scheme is that page tables forprocesses that are sharing segments automatically remain consistent. The disadvantage isthat since the heap portion of a processs page table begins on a page boundary, heap seg-ments must begin on 64K byte boundaries. Since the virtual address space for a processis so large this should not be a problem.

A.2 Sun-2

Figure 10 shows a diagram of the memory mapping unit (MMU) provided by theSun-2 architecture. The memory mapping mechanism is based on the idea of contexts .Each context is able to map the virtual address space of one process. Since there are Ncontexts and one is used to map the kernel, N - 1 processes can be mapped at once.

context NSegmenttable for

table forSegmentcontext 1

context 0table forSegment

Contexts

Segment 0

Segment 1

Segment M

SegmentTable Entry Groups

pmeg 0

pmeg 1

pmeg 2

pmeg 3

pmeg P

Page MapPMEG

Page mapentry 0Page mapentry 1

Page mapentry Q

Figure 10 . The Sun-2 memory mapping mechanism. The values for N, M, P and Q ona Sun-2 are 8, 512, 256, and 16 respectively. The Sun-3 architecture uses the samememory mapping mechanism as a Sun-2 with the exception that M is equal to 2048.

Each VAX page is 512 bytes and each page table entry (PTE) is four bytes. This meansthat there are 128 PTEs per VAX page. Since each PTE maps 512 bytes, a VAX page full of PTEs maps 128 * 512 or 64K bytes.

- 24 -


26/27


Before a process begins executing on the hardware a special context register is set toindicate which context to use to translate the processs virtual addresses. A context isbroken up into M segments of Q pages each. Each context contains a segment table withM entries each of which points to a page map entry group (PMEG) with Q entries, one

for each page. There are a total of P PMEGs for the whole system. The following pro-cedure is performed when translating a virtual address:1) The context register is examined to determine which context to use.2) The high order bits of the virtual address are used to index into the segment table

for the context to determine which of the PMEGs to use. If there is no PMEG forthe segment then a fault is generated.

3) The middle bits in the virtual address index into the PMEG to select one of the pagemap entries. Each page map entry contains a valid bit, a referenced bit, a modiedbit, protection bits, and the physical page frame number. If the valid bit is not set,then a fault is generated.

4) Finally the physical page frame number is concatenated with the remaining loworder bits of the virtual address to form the physical address.Unlike the VAX, the Sun-2 MMU just described does not use page tables in the

kernels virtual address space; all of the memory mapping information is kept inhardware. However, since there are a small number of contexts (eight) and PMEGs(256) there is not sufcient hardware to map all processes at once. Therefore page tablesneed to be kept in software to store the state of processes or software segments whenthey are not mapped in hardware. In addition the limited number of contexts and PMEGsmust be multiplexed across all processes. The page table structure used by Sprite andUnix is just as described in Section 3. The remainder of this section describes themethods that Unix and Sprite use to manage the contexts and PMEGS.

In Unix contexts are shared by all processes by using an LRU policy: the processesthat have contexts allocated to them are the ones that have run most recently. PMEGsare shared by all contexts using a FIFO policy. As a process faults in pages PMEGs areallocated to the context in order to map the pages. If multiple processes that are sharingcode are mapped in contexts at the same time, as they fault in their code pages they willeach use different PMEGs. Thus there is no attempt to share PMEGs between processesthat are sharing code. Whenever a context is taken away from one process and given toanother one any PMEGs that the context has allocated to it are freed.

In Sprite, like Unix, contexts are shared between processes by using an LRU policyand PMEGs are managed using a FIFO policy. However, unlike Unix PMEGs are sharedbetween software segments instead of between contexts. When a page fault causes a

PMEG to be allocated, in addition to putting a pointer to the PMEG in the hardware seg-ment table, a pointer to the PMEG is stored in a table that is associated with the softwaresegment that the page fault was in. This table of PMEGs is used to make sure that

For the remainder of this section, the segments that are used by the Sun hardware will be re-ferred to as hardware segments and the code, heap, and stack segments used by Sprite will be re-fered to as software segments .

- 25 -


27/27


processes that are sharing a software segment will use the same PMEGs to map thesoftware segment. When a context is removed from one process and given to another,instead of freeing any PMEGs in that context they remain allocated to the software seg-ment that owns them. When a context is allocated to a process, the tables of PMEGs that

are stored with each software segment that the process uses are copied into the hardwaresegment table.The Sprite method of managing PMEGs and contexts has two advantages over the

Unix method. First, a page fault in a software segment in one process will have theeffect of validating the page for all processes that are sharing the software segment.Second, a process that has its context stolen from it may still have PMEGs allocated to itwhen it begins running again. The disadvantage of the Sprite scheme is that thehardware segment tables of all processes that are sharing software segments have to bekept consistent; if a PMEG is taken away from a software segment of a process in onecontext it must be removed from all contexts that are sharing the software segment.

virtual memory for the sprite operating system

Documents