parallelizing spice in a timesharing environment

11
185 PARALLELIZING SPICE IN A TIMESHARING ENVIRONMENT ROBERT OLSON and STEVE MCGROGAN ELXSI, 2334 Lundy Place, San Jose, California 95131 ABSTRACT SPICE was parallelized for use on the ELXSI System 6400 in a timeshared production environment. The first oart of the oaoer discusses imolementation issues in doina the conversibn'and reports elapsed and CPU times for - parallel SPICE in one to eight CPUs in a standalone environment. The second aart of the paper considers the implications of competition with other users in a timeshared environment. In particular, a load leveling mechanism was introduced into the parallel tasks to minimize total elapsed time in spite of unpredictable demands by other users on the system. In most engineering application environments there are a few production programs for which turnaround times map almost directly into the productivity of the engineer. It is often the case that an engineer must wait for the completion of that program before doing whatever is next. SPICE, a circuit simulation program, is one such production program for electronic CAD environments. The fashion in which a timesharing computer delivers resources to its users must reasonably match the priorities and objectives of its user community. In particular this means that when the system is lightly loaded all the resources should be bent to key application programs such as SPICE, but when the system is heavily loaded the resources should be apportioned in some manner which is specific to the circumstances of the user community. The ELXSI System 6400 was designed as a high performance multiprocessor specifically for timeshared applications in engineering environments. We parallelized SPICE to utilize an arbitrary number of CPUs to aive the minimum turnaround time oossible consistent with the priority of th; program relative to other programs running in the svstem. This paper is a report on our experiences in parallelizinq SPICE for the ELXSI,'sbme measurements of the effectiveness of that parailelism and what we did to help a parallelized SPICE behave itself in a system running many timesharing users. ELECTRONIC CAD WORKLOADS ON A SHARED COMPUTER Perhaps the most heavily used tools in an electronic CAD environment are the circuit simulators. A description of the circuit, its underlying device characteristics and its initial state are used to simulate circuit operation by evaluating device conductances and node voltages over time. For a given circuit many such simulations are run with different input values. Time steps may be as small as hundreds of picoseconds. A large circuit may require many hours or days for each simulation on a superminicomputer. The value of such a circuit simulator is obvious, since it allows the engineer to test and modify his circuit many times before it is actually realized. The other side of the coin, however, is that the engineer's project often must wait for the simulations to complete. Decreasing the time for a simulation turnaround will directly increase the productivity of

Upload: robert-olson

Post on 21-Jun-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

185

PARALLELIZING SPICE IN A TIMESHARING ENVIRONMENT

ROBERT OLSON and STEVE MCGROGAN ELXSI, 2334 Lundy Place, San Jose, California 95131

ABSTRACT

SPICE was parallelized for use on the ELXSI System 6400 in a timeshared production environment. The first oart of the oaoer discusses imolementation issues in doina the conversibn'and reports elapsed and CPU times for - parallel SPICE in one to eight CPUs in a standalone environment. The second aart of the paper considers the implications of competition with other users in a timeshared environment. In particular, a load leveling mechanism was introduced into the parallel tasks to minimize total elapsed time in spite of unpredictable demands by other users on the system.

In most engineering application environments there are a few production programs for which turnaround times map almost directly into the productivity of the engineer. It is often the case that an engineer must wait for the completion of that program before doing whatever is next. SPICE, a circuit simulation program, is one such production program for electronic CAD environments. The fashion in which a timesharing computer delivers resources to its users must reasonably match the priorities and objectives of its user community. In particular this means that when the system is lightly loaded all the resources should be bent to key application programs such as SPICE, but when the system is heavily loaded the resources should be apportioned in some manner which is specific to the circumstances of the user community. The ELXSI System 6400 was designed as a high performance multiprocessor specifically for timeshared applications in engineering environments. We parallelized SPICE to utilize an arbitrary number of CPUs to aive the minimum turnaround time oossible consistent with the priority of th; program relative to other programs running in the svstem. This paper is a report on our experiences in parallelizinq SPICE for the ELXSI,'sbme measurements of the effectiveness of that parailelism and what we did to help a parallelized SPICE behave itself in a system running many timesharing users.

ELECTRONIC CAD WORKLOADS ON A SHARED COMPUTER

Perhaps the most heavily used tools in an electronic CAD environment are the circuit simulators. A description of the circuit, its underlying device characteristics and its initial state are used to simulate circuit operation by evaluating device conductances and node voltages over time. For a given circuit many such simulations are run with different input values. Time steps may be as small as hundreds of picoseconds. A large circuit may require many hours or days for each simulation on a superminicomputer.

The value of such a circuit simulator is obvious, since it allows the engineer to test and modify his circuit many times before it is actually realized. The other side of the coin, however, is that the engineer's project often must wait for the simulations to complete. Decreasing the time for a simulation turnaround will directly increase the productivity of

186

that expensive engineer. on a large scale computer.

This suggests that the simulations should be run

Large scale computers are expensive resources, and as such are typically shared by many engineers, perhaps as a computational node on a network or as a timesharing computer. Any shared resource is subject to conflicts between fairness in access and organizational priorities. Whenever a larae iob such as a circuit simulation dominates a central computer for h&s or days there will be conflicts over whose project gets hurt. It is important that the system administration be able to allow multiple people to make progress on their work. Since most jobs wait at some point for I/O or other system services, it also seems important to allow multiple jobs to run in parallel for most efficient utilization of this expensive resource.

In our work on SPICE we felt it was necessary to take into account the reality that other people were using the computer while we were. Furthermore, since the 6400 is a timesharing multiprocessor with dynamic load balancing the load on any one processor is unpredictable, the number of processors available to the task is unpredictable, and the relative priority of the SPICE job is unpredictable. In our implementation of SPICE we introduced an adaptive scheduler which shifted work to whichever of the parallel tasks was getting the CPU cycles. That is, the SPICE tasks performed a measure of load leveling internal to the application. The result is that SPICE uses all the CPU cycles it is allowed to have, consistent with externally imposed relative priorities. This minimizes the elapsed time while living within the constraints of a shared resource.

ELXSI SYSTEM 6400 ARCHITECTURE

As currently designed, the air-cooled ELXSI System 6400 can support one to twelve CPUs, 8M to 192M bytes of main memory and 16M to 64M bytes per second of I/O. All of the main functional units connect to the Gigabus. This synchronous bus clocks 64 bits of data every 25 ns, providing a maximum bandwidth of 320M bytes of data per second. The clock cycle of the current CPU is 50 ns. Each CPU consists of a private writeback cache, an ALU board and an accelerator board. Main memory is global to all CPus and IOPs.

The instruction set architecture is a fairly standard scalar architecture. All processes have a 32-bit address space available and all addresses are virtual. There are no instructions for addressing physical addresses. All virtual memory is byte-addressible; word, cache block and page boundaries are not visible at the instruction-set level. Words and registers are 64 bits, cache blocks are 32 bytes (four words)! and pages are 2048 bytes. Floating point is the IEEE standard. A 64-bit integer multiply takes 250 ns, a 64-bit floating multiply takes 450 ns, and cache reads take 100 ns.

In addition to the standard instruction set there are about 40 operating system functions at the instruction level. These are mainly message-system instructions, but there are also instructions to read the clocks, cause breakpoints, read page map entries, flush the cache and so forth. In addition to instructions there are certain operating system functions which also reside in microcode, such as scheduling within a single CPU. Most of the microcode space is taken up by these higher leve functions.

187

Messages as an Operating System Parallelizing Concept

The native operating system, Embos, was designed to support multiprogramming and multiprocessing on a wide range of hardware configurations. It is partitioned into many independent processes which communicate via messages, thereby achieving parallelism at the resource level. A complete discussion of the mechanism and motivation for the use of messages may be found in Cl]. Embos processes can be split between user interface processes, such as the file system servers, and kernel processes, which perform memory management, process creation and destruction and a few related services. The second and third operating systems, a port of AT&T Unix V.2 and Berkeley Unix 4.3, also use messages to communicate with the kernel and can co-exist with the Embos user interface processes. Other operating systems are under consideration. The key point is that the machine and operating system architectures are designed to allow the kernel to migrate user and kernel processes from CPU to CPU without their knowledge. The goal is to make the system look like a uniprocessor to the vast majority of programs for which parallel operation would not add significantly to the throughput of the system.

Central to the success of the system as a multiuser multiprocessor is the use of messages between the operating system processes (and the user processes, for that matter). Operating system and user processes request services via an explicit message rather than through interlocked manipulations of a shared table. Each server deals with requests at its own rate and using its own private data structures, enforcing data abstraction. Microcode is responsible for the actual message delivery, including message queue management, the untimely death of the target of a message and the migration of a process to a different CPU. This allows the operating systems and user processes to easily spread across however many CPUs are present and yet be insensitive to the actual number of CPUs and their allocation. Thus the operating system itself is a parallel processing application which adapts itself to whatever hardware resources are available.

Parallel Processing Primitives

ELXSI provides a procedure-level parallel processing capability through a low level runtime library available to any programmer. The runtime library routines are not integrated directly into the languages but instead are accessed through explicit statements such as the Fortran CALL statement. The ELXSI approach to parallel processing is that the parent process calls an application subroutine which executes in parallel with the parent and possibly several siblings. The subroutine usually operates on data shared with the parent and the sibling processes. Control generally returns to the parent when the subroutine completes, although other options are available to the programmer. Semaphores, spin locks and specialized operating system services are available to help the programmer deal with concurrency and stale cache data problems.

The parent process creates the child process via forking into CPUs chosen either by the operating system or by the application. As a consequence of using the fork primitive, code and data start off identical to the parent process. In the case of read-only data and code all processes will share the same physical memory copy. At the same time the programmer can specify which regions of modifiable memory are to be physically shared between the tasks and which caching discipline is to be used for those shared pages. The child starts executing at an entry point supplied by the parent, notifies the parent when the computation is complete, and then sleeps. The parent may then restart the child task at the same or a different entry point, or it may terminate the child.

188

On the ELXSI it is very important to carefully manage the data shared between the tasks. The essence of the problem is that the default caching discipline of the machine makes it possible for a program to unknowningly- use obsolete copies of data. The default caching discipline is write-back, which means that data is flushed to main memory only when the CPU needs to reuse that cache block for something else or when the process is migrated. There are no hardware interlocks to prevent stale data; managing the problem is completely the responsibility of the programmer.

Explicit services are provided to allow the programmer to make a page cachable or non-cachable and to flush a certain region of the process's virtual address space from the cache back to main memory. Typical parallel applications combine these services with critical region techniques to efficiently share data between parallel tasks. For example, a child task might explicitly flush its results back to main memory before signalling completion to the parent process. A list of the multitasking runtime routines is in Appendix I.

SPICE

SPICE itself is very old. It was originally developed for a computer similar to the Control Data 6400. As that system had a small memory in modern terms, great effort was made to conserve memory, even at the expense of runtime. Over time the program was modified and extended to support more device types and larger circuits. The result is a classic example of the worst kind of Fortran application program. That is, the control and data flows are a mess, but the program is too valuable and delicate to change significantly. Since such programs are the rule rather than the exception, SPICE seemed like a good testbed for this experiment.

SPICE performs its simulations by alternating modeling routines with a current matrix solution routine. This alternation occurs for each timestep during the circuit simulation. The modeling routines calculate the new device conductances based on device operating points. There is one model for each type of device, such as diodes, JFETs, MOSFETs, and bipolar transistors. One model must simulate many different operating modes and consequently has many branches and special cases. The matrix solution routine calculates branch currents based on the conductances calculated by the modeling routines. From these currents new node voltages are obtained. This routine uses sparse Gaussian Elimination techniques with a great deal of time spent in traversing the linked lists which describe the sparse conductance matrix. The time required by this routine grows very rapidly and non-linearly with circuit complexity.

Parallelization of SPICE

There were five basic goals for the parallelization effort for SPICE on the ELXSI. First, the elapsed time should be minimized, subject to external constraints such as the right of others to use the machine. Second, the number of changes to the program should be as small as possible. Given the complexity of SPICE, this is only self interest. It also reflects reality for most old, large application programs of any sort. The original programmers and perhaps subsequent generations are long gone and cannot help with major modifications. Third, additional processors on the computer should be immediately useful without recompilation of SPICE. This constraint on the project was introduced to reflect the support requirements of a third party software house. It is certainly undesirable to have different versions of the code for each of the cases of one to twelve CPUs. The fourth goal was that SPICE be able to adapt itself to the

load on the machine. That is, to shift work among themselves Finally, the answers should be CPUs used.

189

it should be possible for the parallel tasks as the workloads on the various CPUs change. the same without regard to the number of

The two areas of SPICE which were parallelized were the device simulation and truncation error calculations. In general the truncation error calculations are small relative to the device simulation, so this paper will concentrate on the former. It should be pointed out, however, that it was trivial to parallelize the truncation error calculations.

As noted above, managing the shared data is critical to the correctness and performance of a oarallel aoolication on the ELXSI. The tasks which implement the SPICE device simulation must share the circuit description. This sharing was achieved by placing the COMMON blocks referenced by the device routines in memory which was shared by all the tasks. In the version we parallelized we made these regions non-cachable, since that is straightforward and easy to debug. Parallel HSPICE, a version of SPICE sold by Meta-Software on the ELXSI. was made bv Meta to use cacheable memory for higher performance at smali additional'complexity. To achieve the desired sharing in our version of SPICE only six statements were added to the single SPICE subroutine LOAD.

used. A prompt was added to ask the user for the number of processors to be

This allowed us to measure the relationship between the number of CPUs and the total runtime without actually modifying the software or the number of physically present CPUs. In the normal case SPICE automatically adapts itself to the number of CPUs present.

The subroutine LOAD creates identical copies of SPICE in the other processors before calling the normal SPICE modeling routines. This create required one FORTRAN subroutine call on the ELXSI. LOAD then starts the muitiple copies of the modeling routines which it has just created. Approximately twenty additional lines of code were reauired at this ooint. The modifications were made in such a way that the original sequential SPICE can be run if the user specifies that no parallel processes are desired. In this way it is easy to verify that a given circuit oroduces the same results in sequential or parallel modes.

The conductance matrix is updated as a result of the device simulations. To ensure concurrency in the update, the matrix is "locked" during the interval that the parallel device simulation routine is updating the matrix. This locking required two additional FORTRAN CALL statements in each of the modeling routines, a total of eight statements. A lock can be established and released in less than five microseconds, so this introduces only minimal overhead to the modeling routines. Since lock periods are short compared to the runtime of the model and the maximum number of CPUs is relatively small, the initial lock collisions tend to spread the parallel tasks such that subsequent lock collisions are rare and thus do not significantly affect the elapsed time of the program.

Total modifications to SPICE to allow parallel execution of all four device types on an arbitrary number of processors consisted of thirty changed lines and about 300 added lines. Most of the added lines were debugging statements which were later removed. The total changes and additions represent about one percent of the original SPICE proqram. After only three days of effort, the code was run and functioned properly following one minor error correction. This compares quite favorably with the time involved and problems encountered in efforts to vectorize SPICE. In those efforts many thousands of lines needed to be rewritten and major

190

algorithm changes were required [21[3].

Performance of Parallel SPICE

Circuits were simulated on one to eight processors with generally pleasing results. Device simulation time (i.e., the parallel part) was essentially inversely proportional to the number of processors, indicating that there was minimal contention for shared resources and minimal dispatching overhead. A small amount of serial time is spent in the LOAD routine itself, preventinq the modeling performance from being absolutely linear. Table I illustrates the elapsed modeling time for three different circuits with one to four orocessors. The number in oarenthesis is the speedup, where speedup is 'calculated as (one CPU time) / (actual).

TABLE I. Elapsed time

Circuit type

128 MOS devices 200 JFT devices 200 BJT devices

for model processing

1 CPU 2 CPUS 3 CPUS 4 CPUS

1009 515(1.96) 352c2.87) 273c3.70) 408 228c1.79) 169(2.41) 143c2.85) 379 209(1.81) 141c2.69) X7(3.24)

In parallel processing the ultimate elapsed time attainable is limited by the amount of overhead time not spent in parallel operation. Even given an infinite number of processors, certain functions which are not performed in parallel, such as reading the circuit description, will become the limiting time factor. Careful attention to peripheral routine performance reduced the non-parallel time for a 1000 second MOSFET simulation to only 31 seconds. Of paramount importance was the performance of the Gaussian Elimination and lu-decomposition routines. These were modified using techniques which use an auxiliary array to remove all the linked list searching that normally occurred. In addition, algorithmic modifications removed almost all the division operations and most of the memory stores. Together these changes reduced the time spent in the routines by almost a power of ten at the expense of changes to a few dozen lines.

The elapsed time performance of the whole application is shown in Table II for a circuit with 128 MOSFET devices. The "efficiency" value is the actual time divided by the theoretical best of the single CPU time divided by the number of CPUs. In some sense it shows how effectively the additional CPUs are being used. Note that the "inefficiency" includes time where the application is serial, so that the CPU cycles on the other processors are available for work on behalf of some other job and not simply lost.

TABLE II. Elapsed time for whole application

number of CPUs 1 2 3 4 5 6 7 8 elapsed time (sec.) 1111 597 433 345 296 264 246 232 speedup 1 1.86 2.57 3.22 3.75 4.21 4.52 4.79 efficiency loo 93 86 81 75 70 65 60

About half of the non-parallel time was spent reading in the circuit description, building the internal matrices and printing the results of the simulation. Much of the remaining time was spent in overhead introduced when the LOAD routine had to call several copies of each modeling routine

instead of jl dst one. the internal

A small amount of parallel overhead was introduced in dispatching scheme described below, and a small amount was

introduced by some timing routines.

191

One of the great strengths of the ELXSI is its ability to process very large proqrams. To test SPICE with increasinalv larae circuits. a test circuit was repetitively doubled in size. We-were interested to see how our parallel code would perform on circuits normally tackled only by supercomputers. On a 4 CPU, 32 Megabyte ELXSI with-an average of about 30 interactive users we obtained the results in Table III. The circuit simulation took over 4000 time steps.

TABLE III. Models of Large Circuits

(MOSFETS) 2 CPUS 4 CPUS size modeling seconds modeling seconds 128 501 256 1037 498 512 2015 1113

1024 4055 2181 2048 8122 4121 4096 16230 8562 8192 32300

The elapsed times scaled as well. Since the benchmarks reported in Table III were run during weekdays on a shared computer with an uncontrolled workload, the exact numbers are meaningless and thus not reported. As a point of interest, in this environment we found that the application got around halfzof the total CPU cycles, system and other user processes.

with the other half going to operating

PARALLEL SPICE IN A PRODUCTION ENVIRONMENT

Schedulinq policies on a load balancina multiorocessor like the ELXSI significantly complicate the accomplishment-of the'goal of minimum elapsed time for a parallel application. This is due to the unpredictability of the availability of CPU cycles and other machine resources.

ELXSI Process Scheduling and Migration

Making the most efficient use of an expensive shared resource requires that the objectives and priorities of the users be considered in scheduling policies. This is normallv accomplished in modern ooeratino svstems bv allowing users and administrators'to attach external’ priorifies to work to be done. The operating system then executes the scheduling policies to make sure that the "most important" work is always being performed. Crafting the scheduling policies themselves is a very delicate business. There is a need to balance fairness, that is, equal access for equal priorities, with efficiency, that is, better utilization of the resources and low cost scheduling decisions.

A moderately detailed understanding of the three levels of process scheduling algorithms on the ELXSI is important to understandina the motivation for the adaptive nature of parallel SPICE. The dire& effect of those algorithms is to make the ELXSI responsive to a wide variety of environmental considerations, but a side effect is to make it difficult to forecast the availability of CPU cycles on any one CPU. This in turn

192

affects how a programmer for a parallel application partitions his problem.

Within a CPU there are fifteen sets of process contexts which are managed by microcode. The microcode always runs the highest priority ready process which is in a register set. Whenever a message comes in the microcode switches processes if that message would result in a higher priority process being ready to run. Context switches of this sort take about 10 microseconds. A process becomes blocked (i.e., not ready) when it waits for a message. Operating system functions such as a page fault or I/O are constituted as messages to allow the CPU microcode to have a reasonably simple view of the world. The use of microcode to implement this very simple scheduling algorithm significantly improves the efficiency of the CPU by drastically reducing the number of times software processes run to reschedule the CPU.

Processes are assigned to a register set by a software process called the Register Set Manager (RSM). This assignment is based on process priority within a list of processes assigned to this CPU. The RSM for a CPU is always ready in a register set at the lowest possible priority, so if all processes currently assigned to register sets go idle, the RSM is awakened. However. the ouantum fault messaoe comino in from the timer will cause the RSM’S priority 'to be changed to a-very high priority. Generally this will oreemot the process which is running, to allow the RSM to reshuffle the rkgister’sets and implement a "fairness" policy for access to the CPU. Since the microcode handles the majority of scheduling decisions we can afford slightly more elegant (and therefore more expensive) scheduling at this level.

The software process known as the Process Manager (PM) is responsible for wfairlyflt balancing the load across the CPUs. The RSMs periodically send the PM a message detailing the current load on their CPU. The PM uses this information along with some global considerations to rebalance the load between the CPUs by migrating processes from one CPU to another. Migrating a process costs three to four milliseconds but is relatively rare, so intelligent decisions can be afforded.

Generally, user processes run at timesharing, batch or background levels, where all the processes at each level have roughly the same priority numbers. Kernel processes typically run at a real-time priority which is below the priority level of the RSM when the RSM is processing quantum faults, but above most user processes. This means that the RSM and the PM can together migrate an operating system process from one CPU to another in the same fashion they migrate user processes. It is also possible for user processes to run at real-time priority either below or above that of the RSM. Among other things, this means that the RSM might not be able to reschedule a CPU until the real-time process completes its work.

For total timesharing throughput the ELXSI architecture is a success, at least up to the number of CPUs we can support. In [41 Sanguinetti reports measurements of a fixed and a progressive workload on one to ten CPUS. The workloads were selected to be representative of CAD timesharing environments. The conclusions he draws from the measurements are that system overhead actually decreases in the fixed workload and that overhead rises less than linearly in the progressive workload. This success is largely due to designing the operating system and the scheduling policies and mechanisms for a timeshared multiprocessor environment.

193

The current scheduling policies tend to schedule processes as indeoendent entities. For example, a process is considered I/O bound or CPU bound based on the number of times that individual process waits for a message to arrive. No consideration is given to the possibility that several processes are working together and should be considered together when scheduling decisions are made. This is a generally successful simplifying assumption, but in the case of parallel applications it can result in unexpected distortions in the behavior of the various parallel tasks.

Impact of Scheduling Policies on Parallel Applications

In a simole standalone environment, where most parallel processing research is performed, all of the resources of the machine are ready to service the parallel processes. In production environments the parallel processes must compete with other demands on the system resources. As described above, in the ELXSI case the CPUs are micro-scheduled independently based on local conditions. This will result in the parallel tasks on one CPU getting less or more favorable treatment than tasks on another CPU. Indeed, there are ways a system administrator can manipulate the system to deliberately cause this effect. In the case of SPICE we tried to anticipate those variations in treatment.

There are a large number of circumstances under which tasks on one CPlJ might get a significantly different number of CPU cycles than tasks on another CPU. For example, if the application program were running in detached, batch or background mode it would be competing with higher priority interactive processes. Allocation of memory or disk bandwidth to the interactive process might affect one task more than another. Similarly, operating system tasks such as the memory manager are not uniformly distributed across the CPUs. The task unlucky enough to share a CPU with a busy operating system process would get less of its CPU than a process running on an unloaded CPU. Migration of processes on and off of a CPU would also affect the availability of CPU cycles. An even more subtle effect would be contention for disk bandwidth between child processes. It is unreasonable to assume that even a single file is uniformly distributed across the available disks, so disk latency might cause one process to be delayed more than another. The scheduling algorithms required to administer truly fair access to the system resources under these conditions would need to be nightmarishly complex. It is much cheaper for the system as a whole if the application itself schedules its own work to load level between the parallel tasks.

In our parallel SPICE each modeling process shares a variable with its siblinas. This variable ooints into a list of devices to be simulated. When a modeling process starts to model a device, it updates this variable to ooint to the next device on the list. In this manner the parallel tasks automatically load level with other tasks running on the ELXSI. If one processor is partially occupied with another user’s job or with operating system services, that processor will not take devices from the list as quickly as another. This sharing technique is much less sensitive to system load than rigidly dividing the number of devices by the number of processors and performing fixed device assignment. Of course,-there are many techniques for having multiple servers sharing a single work queue. We chose this one because it was simple and satisfactory for this experiment. The key point is that by using this simple load leveling technique we helped our parallel application adapt itself to a shared system.

194

On the ELXSI it is possible for the number of CPUs available to an application to vary dynamically. This is accomplished through a concept we call CPU classes. A class can include one or more CPUs and can be changed without rebooting the system. An application or a user can be restricted to a particular CPU class. This miaht be used to boost resoonsiveness during daylight hours at the expense of batch throughput by'restricting the CPUs available to batch during daylight hours, then opening things up in the eveninq. Alternativelv, a realtime user miaht take over a CPU class for a while, then return it'to the system for general use. There are probably other exotic uses for dynamically varying the CPUs available to application proorams. For a oarallel aoolication. this means that the number of CPUs can vary. The'load leveiing mechanism we have introduced into SPICE ensures the minimum possible elapsed time even under such unpredictable circumstances.

SUCCESS OF THE EXPERIMENT

We were gratified by the success of our parallelization effort. The changes were made and debugged in a few days. A total of thirty lines of SPICE were chanaed and three hundred were added. Most of the added lines were debugging code. To be sure, one of the authors was very familiar with SPICE from other work, but it is not an unreasonable expectation for any heavilv used aoolication that somebodv will understand how it works. The resulting program was as robust and easily maintained as the original, with one version for both sequential and parallel operation. The elapsed time was minimized, even with other unoredictable activitv on the svstem to interfere with the tasks. Additional processors could be utilized immediately without changes to the program. The only goal which we missed was the fifth, the reauirement that the oarallel oroaram vield exactlv the same result as the sequential version. After a cbupre of-days of intense debugging (and mild panic) we discovered that SPICE 2G5 has some uninitialized variables which had different values in the parallel and sequential cases. We left the bug in so that the sequential version would get the same (slightly wrong) answers as it would get on other computers.

The most valuable thing for us was the experience we gained in producing a parallel application for real world usage. As noted above, another version of SPICE was parallelized by a customer and is now being offered commercially on the ELXSI. This is a significant milestone as parallel processing moves out of the laboratory into commercial applications.

ACKNOWLEDGMENTS

We had much help on this project. Dianne Kennedy implemented the procedure level parallel processing facilities in Embos. Bob Hedges encouraged the project and gave Steve the time to work on it. As usual, there were a lot of people who had a role in building and explaining the underlying operating system and machine architecture.

REFERENCES

1. R.A. Olson, Parallel Processing in a Message Based Operating System. IEEE Software, Vol. 2, No. 4, pp. 39-49(July 1985).

2. A. Vladimirescu and D;O. Pederson, Circuit Simulation on Vector Processors. IEEE International Conference of Circuits and Computers, New York, (Oct. 1982).

195

3. S. McGrogan and G. Tarsy, Vector Enhancement of a Circuit Simulation Program. Proceedings, Symposium on CYBER 205 Applications, Colorado State UniversitY, Fort Collins, Colo., (August 1982).

4. J. Sanguinetti and 8. Kumar, Performance of a Message-Sased Multiprocessor. Proceedings of the 12th Annual International Symposium on Computer Architecture, Boston, Massachusetts, (June 1985).

APPENDIX I. Multitasking runtime routines

MT$ShareMemory: A parent process can define specific virtual pages to be shared among the parent and its future children.

MTBSetupSemaphore: Creates a semaphore in the user's heap space. The user must soecifv the semaohore name and the number of oarallel orocesses that can wait for the semaphore at any given time.

MT$SetupLock: Creates and initializes a binary lock structure in the user's heap space.

MT$SetupTasks: Creates N user-specified child processes, establishes communication between them (every process links to every other process), and puts each child to sleep after it is created.

MT$StartTask: Awakens a specified process, tells it to begin execution at some location (i.e. subroutine) and passes parameters. The task number of the process which is to start, the location at which the execution is to begin, the number of parameters required by the subroutine, and the parameters all must be passed to the intrinsic.

MT$WaitOnSemaphore: Increments the counter which indicates the number of orocesses waitino for that semaohore. If the counter is oreater than zero, the process is added to the wait queue and put to sieep.

MT$SiqnalSemaphore: Decrements the counter which indicates the number of orocesses waitino for that semaohore. If one or more orocesses is baiting, the fir& process in the queue is awakened.

MT$Lock: Causes the callinq process to wait in a tiqht spin loop until the lock is unlocked, then locks it.

-

MT$Unlock: Unlocks a lock, but the intrinsic does not check if the lock is in the "locked" state when it does the unlock.

MT$WhatTaskAmI: Returns the task number of the calling process. MT$FlushMemory: Flushes a specified region of a cache back to memory. The

user must specify a starting address and the number of bytes to be flushed.

MT$FlushMemoryAllSharers: Flushes a specified region of ALL caches back to memory. The user must specify a starting address and the number of bytes to be flushed.

MTSIsCacheablePaae: Returns a Boolean value which indicates if the specified address is cacheable. This is useful for debugging.

MT$IsSharedPaqe: Returns a Boolean value which indicates if the specified address is shareable. This is useful for debugging.

MT$WaitOnTask: Checks the status of a specified process. If the process has not completed execution, the calling process goes to sleep. When the process completes execution, it awakens the sleeping process.

MT$ChildDone: Returns a boolean indicating whether the specified task has completed execution. Does not wait.

MT$DestroyTasks: Sends a message to each parallel process to kill itself. This intrinsic should be executed only by the parent. All parallel processes except the parent are destroyed by one call to this intrinsic.

MT$LockStatus: Returns the current state of a lock data structure. MT$SemaphoreStatus: Returns the current state of a semaphore data

structure. MT$PageState: Returns the state of a specified page ("cacheable,V0 "shared,"

"in physical memory").