a survey of problems and preliminary results concerning parallel processing and parallel processors

PROCEEDINGS OF THE IEFE, VOL. 54, NO. 12, DECEMBER, 1966 1889

A Survey of Problems and Preliminary Results Concerning Parallel Processing

and Parallel Processors M. LEHMAN, MEMBER, IEEE

Abstruct-After a0 introduction which disclsses the significance of a trend to the design of parallel processiug systems, the paper describes some of the results obtained to date in a project which aims to develop Md evaloate a unified hardware-software parallel processing computing system and the techniques for its use.

1. MULTIPROGRAMMING, MULTIPROCESSING, AND PARALLEL PROCESSING

A BRIEF review of the literature, of which a partial list- ing is given in the attached bibliography, reveals an active and growing interest in multiprogramming,

multiprocessing, and parallel processing. These three terms distinguish three modes of usage and also serve to indicate a certain historical development. We cannot here attempt to trace this history in detail and so must rely on the bibliography to credit the contributions from industrial, university, and other research and development organizations.

The emergence of autonomous input-output devices first suggested [l ] the time-sharing of the processing and peripheral units of a computing system among several jobs. Thus surplus capability that could not be applied to the processing of the leading job in a batch processing load, at any stage of the computation, could be usefully applied to succesor jobs in the work load. In particular, while any computation was held up for some 1/0 activity, the single main processor could be used for other computation. The necessary decision-taking, scheduling, and allocation procedures were vested in a supervisor program, within which the user-jobs were embedded, and the resultant mode of operation was termed Multiprogramming.

The use of computers in on-line control situations and for other applications giving rise to ever-more stringent relia- bility and availability specifications, resulted in the con- struction of systems including two or more central processing units [ 2 ] - [ 5 ] . Under normal circumstances, with all units operational, each could be assigned a specific activity within an overall control program. As a result of the multi- plicity of units in such Multiprocessing Systems, failure of any one would degrade, but not immobilize, the system, since a supervisor program could re-assign activities and configure the failed unit out of the system. Subsequently, it was recognized that such systems had advantages over a

Manuscript received July 1. 1966; revised August 23. 1966. The author is with the IBM Thomas J. Watson Research Center, York-

town Heights. N. Y.

single processor system in a more general environment, with each processor in the system having a multiprogramming capability as well.

Finally, following from ideas first exploited in the Gamma 60 Computer [6], there has come the realization that multi- instruction counter systems can speed up computation, particularly of large problems, when these may be partitioned into sections which are substantially independent of one another, and which may therefore be executed concurrently-that is, in parallel. When the several units of a multiprocessing system are utilized to process, in parallel, independent sections of a job, we exploit the macro- parallelism [7] of the job, which is to be distinguished from micro-parallelism [7], the relative independence of individual machine instructions, exploited in look-ahead machines. This mode of operation is termed Parallel Pro- cessing and, as in PL/I [8], the execution of any program string is termed a Task. We note that parallel processing may, and normally will, include multiprocessing activity.

2. THE APPROACH TO PARALLEL PROCESSING SYSTEM DESIGN

In the previous section we indicated that the prime impetus for the development of parallel processing systems arose from their potential for high performance and relia- bility. These systems may operate as pools of resources organized in symmetrical classes and it is this property that promises High Availability. They also possess a great reserve of power which, when applied to a single problem with the appropriate degree of parallelism, can yield high performance and fast turn around time. Surplus resources can be applied to other jobs, so that the system is potentially efficient, displaying a peak-load averaging effect and hence high utilization of hardware [9 ] . The concept of sharing in parallel processing systems and its related cost reduction is not, however, limited to hardware. Perhaps even more significant is the common use of data-sets maintained in a system library or file, and even concurrent access during execution from a high-speed store. This may represent considerable economy in storage space and in processing time for 1/0 and internal memory-hierarchy transfers. But above all [9] it facilitates the sharing of ideas, experience, and results and a cross fertilization among users, a prospect which from a long term point of view represents perhaps the most significant potential of large, library-oriented,

1890 PROCEEDING OF THE IEEE DECEMBER

multiprocessing systems. Finally, in this brief summary of the basic advantages of parallel processing systems, we refer to their intrinsic modu1arity;which may yield an ex- pandable system in which the only effect of expansion on the user is improved performance.

Adequate performance of parallel processing systems is, however, predicated on an appropriate€y low level of overhead. Allocation, scheduling, and supervisory' strategies, in particular, must be simplified and the related procedures minimized to comprise a small proportion of the total activity in the system. The system design must be based on performance objectives that permit a user to specify a time period and a tolerance within which he requires and expects to receive results, and the cost for which these will be obtained. In general the entire system must yield minimum throughput time for the large job, adequate response time to the terminal requests in conversational mode, guaranteed throughput time for real-time tasks, and minimum cost processing for the batch-processed small job. These needs require the development of an executive and supervisory system integrated with the hardware into a single, unified computing system. Finally, the techniques and algorithms of classical computation, of problem analysis, and of programming, must be modified and new, intrinsically parallel procedures developed if full advantage is to be gained from exploitation of these parallel systems.

Our studies to date represent but a small fraction of the ground that will have to be covered if effective parallel processing systems are to come into their own. It is, however, abundantly clear that such systems will yield their potential only if the design is approached on a broad but unified front ranging from problem analysis and usage techniques, through executive strategies and operating systems, to logic design and technology. We therefore present concepts and results from each of these areas, as obtained during our preliminary investigation into the design and use of parallel processing systems.

3. LANGUAGE 3.1 Parallelism in High Level Languages

The analysis of high level language requirements for parallel processing has received considerable attention in the literature. We may refer in particular to the paper by Conway [IO] which discussed the concepts of Fork, Join, and Quit, and the recent review by Dennis and Van Horn

Recognizing that programming languages should possess capabilities that express the structure of the computational algorithm, Schlaeppi [12] has proposed augmentations to PL/I-like languages that portray the macro-parallelism in numerical algorithms. These in turn have been reflected in proposals for machine-language implementation. As examples we discuss Split, Terminate, Assemble, Test and Set

[11 I.

We differentiate intuitively between executive and supervisory activities. The former are those whose costs should be chargeable to the individual user directly, whereas the latter are absorbed in the system running costs.

or Wait (interlock), Resume, Store-Test and Branch, and External Execute instructions. We describe here only the basic funct iml elasmayfrom.rPxhich machine instructions for actual realizationwill be composed as suggested by practical programming experience.

3.2 Machine Level Instructions for Tasking Split provides the basic task-generating capability. It in-

dicates that in addition to continuing the execution of the present instruction string in normal fashion a new task, or set of tasks, may be initiated, execution starting at aspecihd address or set of addresses. Such potential t a s k wilp'.k queued to await pick-up by an appropriate processingunit.

Terminate causes cessation of activity on a task. The terminating unit will, of its own volition, access an appropriate queue to obtain its next task. Alternatively, it may execute an executive allocation-task to determine which of a number of task-queues is to be accessed next according to the current urgency status of work in the system.

Assemble permits the merging of several tasks. The first (n- 1) tasks in an n-way parallel set belonging to a single job, reaching the assemble instruction terminate. The nth task, however, will proceed to execute the program string which constitutes the continuation of all n tasks.

Test and Set or Wait provides an interlock facility. Thus a number of tasks all operating on a common data set may be required to filter through certain sections of program or data, one at a time. This may be achieved by an instruction related to the S/360 test and set instruction [13], but caus- ing the task finding the specified location to be already set to go into a wait state. System efficiency requires that processors do not idle, so that the waiting task will generally be returned to queue and the processor released for other work.

Resume directs a processor or processors waiting as a result of a test on a specified location, to proceed, or more generally, that specified waiting tasks that have been returned to queue be re-activated to await the spontaneous availability of an appropriate processor.

Test and Branch Storage Location permits communication between parallel tasks based on tests analogous to the register tests of uniprocessors, but associated with the contents of storage locations. This is desirable since processor registers are private to the processor and inaccessible from outside.

External Execute is a special case of the general interaction facility discussed in Section 4 that permits related tasks to influence one another. This can be achieved through the application of instructions already discussed. It is, however, more efficient to provide a new facility akin to the Interrupt concept. By applying this Interaction function, a task may cause other specified tasks to execute an instruction at a specified location, each on completion of its present instruction. Thus, for example, a number of processors searching for a particular item in a partitioned list can be caused to abandon the search when the item has been located by one, whle processors searching for other items, or otherwise busy, will not be redirected.

1966 LEHMAN: PARALLEL PROCESSING AND PROCESSORS 1891

4. INTERACTION 4.1 The Interaction Concept

An extension of the task interaction concept introduced in the preceding section is fundamental to efficient parallel processing. In the particular example cited, the interaction. in the form of an external execute instruction, forms part of the computational procedure. In fact, many other situations arise in which processing for inter-task communication may be detached from problem processing and be carried through concurrently in autonomous units, thereby increasing system utilization.

We therefore propose to associate with each active unit in the system an autonomous Interaction Controller. Groups of controllers are linked by a special bus. This provides facilities whereby any one unit may, at a given time, act as a command or signal source with all other units potential recipients. By thus systemizing inter-unit communication and making it a concurrent activity, we both increase system utilization and remove a maze of interconnecting cables. Succeeding subsections describe some of the functions that the controllers fulfill and, briefly, one hardware proposal for their realization.

4.2 Interaction Activities In present-day systems there already exist activities of the

type to be classified as interaction. Thus, for example, in System/360 we find a CPU to Channel Halt I10 facility, channel interruptions of processors, and timer interruptions. In extending the concept we differentiate among three classes of interaction.

Problem Interaction: These relate to logical dependencies between tasks, and will generally require waits, forced branches, or terminations. Search termination, previously discussed, is an example of this type interaction, as are data and instruction-sequence interlocks.

Executive Interaction: This activity is concerned pri- marily with the allocation of system resources. Consider, for example, the problem of processing interrupts in a parallel processing system. These will usually not need to interrupt a computing activity, but may await the spontaneous availability of a unit at a Terminate, a natural break- point.2 If an interrupt does become critical it should not be applied to a specific physical unit. Instead the interrup- tion should be steered to that unit which, by virtue of the work it is processing, may be classed as Most Interruptable. Selection of the latter may be obtained ahead of time and is maintained by the interaction system, on the basis of the relative urgency of tasks.

Another example of executive interaction concerns the constant provision of queue status information to all active

This is possible in a parallel processing system since tasks are smaller than jobs and since there are many processors. Furthermore, units operate anonymously. That is, on picking up a task, a unit records the task identity in an internal register and its own identity in a table associated with the work queue. Other processors do not, therefore, know how tasks and processors are matched at any time, since this is a matter of chance, and determination would require an extensive and wasteful table search.

units. Beside simplifying scheduling activity this may pre- vent units from accessing empty queues, reducing both storage and executive interference. Similarly, units can be caused to access a previously empty queue when an entry is made, obviating continuous testing of queue status.

The interaction system also supports other activities associated with accounting, recording, and general system supervision.

System Interaction : System interaction provides controls and interlocks for operation and maintenance of the physical system. It includes, for example, interchange of information between active units about the validity of storage map entries, storage protection control, queue interlocks, checks and counts of unit availability, the initiation of routine and emergency diagnostic and maintenance activity, and the isolation of malfunctioning units.

Summary : The preceding paragraphs have indicated some of the many applications of an interaction controller. The common property which, for practicality, has been used to identify potential interaction activities is that they should be autonomous relative to the main computational stream and that their execution should not require access to storage.

4.3 The Interaction Controller 4.3.1 The Basic System Hardware Architecture: It is not

intended to give a full description of an interaction controller in the present paper. We shall, however, outline its basic structure, indicate its mode of operation, and list some of the proposed interaction instructions, termed Directives.

As a first step we introduce, in Fig. 1 , a diagrammatic description of an overall representative hardware system. This consists of central processors (Pi) with local storage (LSi), 1/0 processors (SCi), storage modules (Si), a re- questor-storage queue (Qi), and a communication system functionally equivalent to a cross bar switch. An 1/0 area, including a bulk-store, files, channels (Ch), devices, device control units (Cu), and interconnection networks, is indicated in less detailed fashion.

4.3.2 Interaction Controllers: Interaction controllers (IC) are associated with all central and 1/0 processors, and communicate with each other over a special bus. Similarly localized interaction systems may provide a facility for certain classes of 1/0 units or devices to interact amongst themselves.

To be economically feasible, the Interaction Controller must be simple. Figure 2 illustrates a structure which includes about two hundred and fifty bits of storage, of which about half are organized in registers. The remainder are used as status bits or appear in the controller-processor interface. Control is obtained from a read-only store, whose capacity depends on the size of the directive repertoire (an interaction directive being analogous to a processor instruction) and the number of interaction functions it is required to implement.

Controller connection to the ten-bit wide interaction bus is by means of OR gates. When an interaction is occurring, one and only one controller will be in command of the

1892 PROCEEDINGS OF THE IEEE DECEMBER

Fig. 1 . A representative system hardware configuration.

PROC69sbR 611 CUAUNEL STATUS BITS

I I

I

Fig. 2. The interaction controller.

bus. Figure 3 illustrates the sequence of events required to implement an interaction.

The controller required by its associated processor to initiate an activity will await availability of the bus, indicated bv an ALL ZERO state, and will then attempt to seize

G=- BUS FREE ?

1 I

Fig. 3. The interaction sequence.

eight code. Should more than one controller attempt to seize the bus at the same time, a conflict resolution procedure is initiated. This is based on the simultaneous transmission by all requesting controllers.of a second, two byte, identifying code. Each byte consists of one or more o m followed .by all ZEROS. A simple comparison by each controller of its transmitted signals with the state of the bus, identifies to itself that controller having the most ONES in each byte, since it will have found a match on both com- parisons. This enables it to seize the bus and to switch to the command state. All remaining controllers remain in the listening state.

The controller in command of the bus then transmits signals which select recipients for the directives which are to follow. Other controllers ignore all further communications until the next selection signal appears.

4.4 Interaction Directives A signal designating the interaction function required by

a processor is transmitted across the processor/controller interface, as the result of the execution of some processor instruction. The processor will then generally continue its execution sequence unless or until it is required to pass on a second interaction function before a previously issued function has been completed. Upon receipt of the interaction command, and after successful seizure of the bus as described, the command controller may initiate execution of the interaction by transmitting a sequence of one or more directives to the selected units. A basic set of directives is listed in Table I.

The Compare directives are most frequently used to seize the bus and to select a subset of the controllers for the

control by transmitting a unique identifying four-out-of- receipt of subsequent directives. The remaining units


TABLE I

Send and Compare Compare Received

Set Status Bits Interact

External Execute Attention Free Bus

ignore further directives until alerted by an Attention signal or until Free Bus provides the release that permits waiting controllers to attempt to seize the bus. Receive provides for transmittal of data between controllers; for example, transmission of a machine instruction to a selected set of controllers, followed by the directive Interact. Thus t h s sequence could realize the basic interaction function. Ex- ternal Execute is, however, considered so fundamental to efficient exploitation of a parallel processing system that we include it as an explicit directive. Status bits that may be set or reset by appropriate directives, provide data on the status of various systems queues, on the interruptability of given processors, on Wait status, and so on.

5 . STORAGE COMMUNICATION The fact that interest in large parallel processing systems

is increasing rapidly as technology enters into the integrated or monolithic era is no coincidence. Such systems will not, in fact, be practical for general purpose application until miniaturization reaches the stage where the large amount of hardware required can be assembled in compact fashion. This need is most apparent when one considers communication between the high-speed store and the various classes of processors, which may collectively be termed Requestors. Already in presently available systems, the transmission delay between storage and requestors is of the same order of magnitude as the storage cycle time; and cycle times are still decreasing.

Formulation of a hardware model as in Fig. 1 led to the immediate conclusion that feasibility of the interconnection of large numbers of units had first. to be established. Many possible systems were considered, and preliminary studies concluded that the crossbar switch was the most appropriate system for early study in view of its regular structure, simplicity, and basic modularity. More particularly, monolithic crossbar modules are visualized which it will be possible to interconnect to provide networks of any required dimensions. Alternatively, or additionally, other in- terconnections of these modules can provide highly available, multi-level trunking systems.

In addition to the switch proper, the crossbar network requires a selection and control mechanism. It is moreover appropriate to locate the queues, whch store all but one of a group of conflicting requests, withn the switchng area. A switch complex, as in Fig. 4, has been designed for a system configuration including twenty-four requestors, thirty-two memory modules, thirty-two data plus four parity bit words, and sixteen plus two parity bit addresses.

RtQUEST SIbUl j

Fig. 4. The centralized crossbar switch.

The result of this design study shows that the size and complexity of such a switch is not excessive for a large scale system. In its simplest form and using standard high- performance logical devices, with a fan-in of four, a fan-out of ten and a four-way OR capability, its use leads to a worst case delay of some seven logical levels in the control and queue decision circuits and two levels in each direction of the switch. The switch uses between two and three times as many circuits as a central processor such as the model 75 of System/360. While this, in itself, represents a considerable amount of hardware, it is still an order of magnitude less than the hardware found in the units that the switch is interconnecting. Moreover, its regular structure and simple, repetitive logic suggest ultimate economical realization using monolithic circuit techniques.

6. USAGE 6.1 The Executive System

The basic properties outlined in Section 2 give parallel processing systems the potential to overcome many of the ills and shortcomings that presently beset computer systems. For maximum effectiveness, the system must be library- or file-oriented. It can, however, be exploited efficiently only if the overhead resulting from executive control and supervisory activity does not strangle the system. More particularly, the gains from the sharing of resources and any peak averaging effect must exceed any additional overhead due to resource allocation procedures, conflict resolutions, and other processing activity arising from the concurrent operation of many units. Thus a unified and integrated design approach is required in which software and hardware, operating system and processing units, lose their separate identities and merge into one overall complex, for which allocation and scheduling procedures, for example, are as basic and as critical as arithmetic operations.


Equally significant to the successful exploitation of parallel processing potential are the problems of data management, man-machme interactions, and, most generally, problem preparation and usage of the system. We restrict the present discussion to brief comments on programming techniques for task generation and on the development of algorithms possessing macro-parallelism. In particular we indicate that multi-instructioncounter systems can be profitably applied to the solution of the large problems whose computing requirements tax the speed capability and storage of the largest computer and the patience of their users. In the following section we evaluate these proposals by quoting some performance measurements obtained from an executing simulator.

6.2 Programmed Task Generation Study of the usage of parallel processing systems for the

rapid solution of large real-time problems involves two aspects. On the one hand we must consider the development of algorithms- displaying an appropriate form of macro-parallelism. On the other hand programming techniques must be developed for efficient exploitation in terms of both problem- and machine-oriented instructions, such as those discussed in Section 4.

It is appropriate to discuss programmed task generation first. For simplicity we consider a job segment.that requires n executions of a procedure I. The procedure will itself include modification of index registers or other changes that distinguish the individual tasks. We assume that on completion of all n tasks, a new procedure J shu ld be initiated. Moreover, should processing power be available at a time when n executions of I have been initiated but nok all n completed, we assume that an.independent procedure K, belonging to the same job, may be initiated. In the simplest case K will be a terminate instruction which releases the processor, and makes it available to process other work as determined from the work-queue complex.

Execution of split and terminate instructions involves executive overheads, so that these instructions should not be used indiscriminately. Within a system' in which a maximum ofp processors are available to a job, it is pointless to partition a job, at any one time, into more thanp tasks. It is, however, undesirable to guarantee a user that p processors, or even more than one processop, will execute his program. A simple task generation scheme that makes as many entries in the task queue as there are potentially concurrent parts of the algorithm (for example, from a loop containing a split instruction) is inefficient when that number is much larger than the number of processors that happen to be available. The technique also leads to very large queues. An alternative, termed Onion Peeling by us, puts the instruction sequence containing the split at the

number of processors actualky applied to the segment. I t also ensures that processing is completed as quickly and as efficiently as possible with the number of processors that. become available to the job segment. Thus if during execution no further processors are freed, the n tasks are executed sequentially with only one split and no terminate. If, on the other hand, some other number of processors is.used for execution, the procedure is speeded u p accordingly. The maximum number p of processors that may be applied to the job may be limited by the number of processors in the system and available, or by executive edict.

The basic scheme is illustrated by.the following program, in which the first expressions following the zmoing of coun- ters ensures that no unnecessary splits are queued.

A=O B=O c=o

ST IF N - B < 1 THEN GO TO IN

A = A + I IF A 2P THEN GO TO IN

SPLIT TO ST B = B + l

IN IF B > N THEN GO TO FIN

CALL I PROCEDURE c=c+ 1 IF C < N THEN GO TO IN

Suppress split if nth task being initiated

Split if less than p processors allocated

If all n I-tasks started, proceed with K

If all n I-tasks completed, proceed with J

CALL J PROCEDURE FIN CALL K PROCEDURE.

6.3 Macro-Parallelism Commonly used numerical algorithms, data processing

procedures, and computer programs are generally sequential in nature. The .reason for this is largely historical,. a consequence of the fact that the Mechanisms, human, mechanical, and electronic, used in developing and executing these procedures .have been incapable of significant parallel activity, other perhaps than. the simultaneous, co- ordinated use of many humans. The advent of parallel processing systems thus -calls .for the modification of ac- cepted techniques to expose any inherent parallelism. The resultant procedures must then be further adapted to make parallel tasks of such a magnitude that the overhead in- volved in their generation becomes insignificant. But the ultimate benefit from parallel execution will be obtained only by going back to the problems. themselves. These must. be analyzed anew. Algorithms must be developed that. make it possible to exploit the parallel executing capability, by introducing into the mathematical and program model parallelism that ultimately reflects the parallelism of the physical system or phenomena being studied. In this need to

head of procedure Z and ends each execution of the procedure with a terminate. This restricts the queue length for 3 This is not quite accurate. The simple MOP algorithm here this job segment to one but it otherwise is an inefficient as does not explicitly interlock the split sequence. There is therefore a possi- the previous method. bility that unnecessary taskcalls may be queued during the execution of

A Mod$ed Onion Peering scheme restricts the ever, small, while the degradation arising from an interlock could be sig- the split which is to generate the nth task. The probability of this is, how-

split and terminate overhead to at most one more3 than the nificant, and the algorithm in the form given appears more econamical.


return to fundamentals, the siutation is somewhat analogous to the early days of electronic computing, when attempts at commercial application were largely frustrated until it was realized that widespread application required the development of new techniques, rather than the adaptation and mechanization of existing procedures.

At the present time, however, our direct activity in problem analysis has concentrated mainly on the adaptation of existing numerical techniques for parallel processing, for problems in which the basic macro-parallelism was self- evident. These include, for example, linear algebra and the solution of elliptic partial differential equations. In these areas the extent and nature of the parallelism had previously led to proposals for vector processing systems such as Solomon [14] and Vamp [15]. Other areas in which the parallelism is self-evident but where vector processors prove less effective are those in which the algorithms model distinct physical activities such as in file processing and Monte Carlo techniques. For all significant problems investigated [12] it was possible to establish the existence of parallel tasks of such a length that tasking overheads could be expected to be negligible.

Other classes of problems have been studied, both in terms of the extension of existing algorithms and the development of new ones. In particular we refer to the extrac- tion of polynomial roots [16], solution of equations [17], and the solution of linear differential equations [18], [19]. These various studies, not all directly related to the present project, were more mathematical in nature, and to the best of our knowledge, no attempt has yet been made to develop efficient parallel computer programs. Thus, while numerical methods are beginning to emerge whch enable the exploitation of macro-parallelism in the solution of time- limited problems, and from which it appears that significant reductions may be obtained in throughput times, much work remains to be done on re-programming the problems themselves.

7. SIMULATION 7.1 Simulation as a Design Tool

It has been our experience with simulation that its prin- cipal function as a design tool is to focus attention on features that require investigation and explanation. Many results, qualitative and quantitative, that are obtained during simulation experiments may also be obtained analytically. It is, however, the insight and understanding gained from the design of simulation experiments and the analysis of their results that draws attention to specific details and difficulties. The undeniable value of simulation in development and design is therefore quite different from that in system evaluation, where meaningful performance figures may be obtained when the work load is well defined.

7.2 The Executing Simulator In the present study simulation was seen as fulfilling a

number of additional functions. In particular it made available a usable working model of a parallel processing system. This would give potential users the incentive to undertake

actual programming and to gain limited operational experience. An executing simulator was also required for the investigation of what is commonly regarded as the most immediate question in parallel processing, the extent of performance degradation due to storage-access interference and executive (queue-access)interference. Such an executing simulator is now operational and its use is discussed in the next section. We note parenthetically that a limitation of this type simulator is its speed. For the evaluation of total system performance over any length of time, particularly when using a computer itself much slower than the simu- lated system, only gross, nonexecuting, simulation is reasonable [20].

The system presently modeled in the executing simulator includes the processors, switch, and Storage Modules of Fig. 1 . The storage modules are accessed through a fully interleaved address structure, though it is clear that in any realization interleaving will be partial, both to sustain high availability and to decrease storage interference between independent jobs. The individual processors have a Sys- tem/360-like structure [21] and execute an augmented subset of S/360 machine language. The nonstandard instructions added to the repertoire include the functions discussed in Section 4. The local store LSi, to be used also as an instruction buffer, is however not included in the model for which the interference results are quoted in the next section. The simulator configuration is parameterized so that, for example, the numbers of storage modules and processors, instruction execution times (in storage cycles), and the nature of statistics gathered and printed may be selected for each run. The program itself is modular, and both system features and measurement facilities may be expanded or modified as required.

7.3 Simulator E.xperiments 7.3.1 Kernels: Simulation experiments first concen-

trated on an investigation of storage interference arising in the execution of typical kernels from numerical analysis. The results indicated that under the limited condition of the experiments and for a storage module-to-processor ratio of two, interference would degrade performance by less than twenty percent, dropping to some five percent for storage module-to-processor ratio of eight. Addition of a local processor store and its use as an instruction buffer effec- tively eliminated interference, as expected, indicating that it had been substantially due to instruction-fetch interference.

These results were considered to have been generated under conditions too restrictive to permit generalization. In particular each set referred only to concurrent executions of a single loop. Thus more recent experiments have included many runs of a matrix-multiply subroutine and the solution of an electrical network problem using an appropriately modified version of the Jacobi variant of the Gauss-Seidel solution of a set of linear algebraic equations.

7.3.2 The Matrix Multiplication: The Matrix Multiply program was written in two versions. A classical sequential program excluding all the special instructions provided the standard on whch measurement of the parallelism over-


TABLE I1

Instruction Execution Time in Storage Cycles

1

No. of hoc.

1 1 2 4 8

16 16 16 16

2

No. of Storage Mods.

64 64 64 64 64 64 32 16 64

Fixed Point Addition Floating Point Addition Floating Point Multiplication

Split Floating Point Division

Terminate New Task Fetch (Part of Terminate)

3 4

Matrix Time Dim. Run

40 429 40 427

33 39 47 40 38 40 35 40 56 40

109 40 216 40

0.4 0.5 1 .o 2.0

25.0 25.0 25.0

TABLE 111 ~ U L T S of THE MATRIX MULTIPLY SIMULATION

Note: All times in thousands of storage cycles. NA-Not Applicable NV-Not Available

5

Total Proc. Time

427 429 432 436 445 4 6 1 507 639 428

f 6 1 1 10 9 8 7

Storage Interference

I % Time

1.02 0.21 1.77 5.79

14.4 30.3 75.9

207.0 26.1

L

head and interference could be based. The second, parallel, Program used the onion peeling rather than the MOP algorithm described in Section 7.2. The product matrix was partitioned by rows, with the computation of each comprising one task. The experiments were performed for square matrices of dimensions thirty-nine and forty with from one to sixteen processors and sixteen to sixty-four storage modules. Two sizes of matrices were used to isolate the effect of commensurate periodicities of array mapping with the address structure of the store, which demonstra- tively had sigmficant influence on the results.

Instruction execution times for the most frequently executed instructions used in the experiment are given in Table 11.

These times exclude the instruction fetch time (one instruction for each fetch), since these are overlapped unless storage conflict occurs, when a request must be queued. The arithmetic operations may also include a data fetch (RX instructions) in which case a further store access time is required.

In the absence of an mternal instruction buffer, processors executing the sameprogram string interfere with each other continuously during instruction fetches. To minimize this effect for loops that are short relative to the width of the

Exec. Interf.

%

No. of. Storage AcceSSeS

Storage 2 Utilization

% Notes

0.2 0.05 0.4 1.3 3.3 7.0

17.7 48.2 6.5

NA NA 0.33 0.39 0.68 0.76 0.88 0.64 NV

459K 460K

1 460K 46OK 460K 460K 460K 460K 427K

1.69 1.68 3.3 6.6

13.0 25.0 45.4 72.1 26.9

Sequential program Interference between instruction& data fetches

% Storage Utilization = # Acc. X # hoc.

hoc. time X # Mods.

- Col. 9 x Col. 1 Col. 5 x Col. 2

-

interleaving, it is profitable to unwind such loops by repeti- tion so that the resultant string stretches as far as possible across the interleaved store. The program was unwound in this way. We note, however, that it is in fact better [22] to repeat the loop, appropriately modified, several times across the interleaved store, directing successive processors to successive, but unconnected, loops. This can decrease interference by as much as twenty percent over the previous case.

Some results of the simulation are given in Table I11 and plotted in Figs. 5 and 6.

We note that running time (col. 4) is defined as the interval between the start of the first processor on its first task and the completion, by the last processor to finish, of its final task. Since an onion peel technique has been used for the split- ting, there is an interval (of order 70 storage cycles) between the start of successive tasks. There is also an initial interval (87 memory cycles) in which the first processor initializes the program. Finally, the finish of processors is staggered and, in particular, for the sixteen-processor case, eight processors are assigned two tasks (rows) in succession; and eight, three tasks. The former processors will, of course, terminate considerably earlier than the latter. Thus, as indicated by the corresponding entry in column four, the


l o b K b 5 io I6TASK.i 15 4 40T6KS )'lp

NUMBER OF PROCESSORS Fig. 5. Execution time for matrix multiply.

particular mode of partitioning is not optimum if the shortest execution time is to be obtained. From a system efficiency point of view, however, and in actual operation with other jobs and tasks in the system, it is of no consequence since processor idling does not actually occur. New tasks, perhaps arising from quite different jobs, are initiated, according to some scheduling strategy, whenever a processor becomes spontaneously available.

In addition to run time, we define a total processor time (col. 5). This represents the sum total of time that individual processors were active in the program and is therefore a re- flection of total processor running cost. Storage interference (cols. 6, 7) measures the total time that processors were in- active due to attempts to initiate simultaneous accesses to the same storage module. It occurs also when only a single processor is applied, when it represents a conflict between a data fetch and an attempt by the overlap circuit to initiate an instruction fetch from the same module. Executive interference (col. 8) represents processor hold-ups due to the simultaneous attempts by two or more processors to access the system work-queues. These interferences are of course representative of a whole class of effects that can lead to performance degradation in parallel processing systems.

In Table I11 interference has been related to the number of interleaved storage modules and to the number of processors. In an actual system it is of course a complex function of the number of storage modules, of the degree of address interleaving, of the relationship between active jobs and the degree of program and data sharing, and of the total system utilization of storage. In optimizing a design, the numbers of processors and storage modules and the addressing scheme must be fixed subject to constraints related to cost, total storage capacity, the capacity of available storage modules, the degree of availability desired, and the expected nature of the work load. Processor utilization of

5 10 NUMBER OF PROCESSORS

Fig. 6. Total processor time and interference in matrix multiply modules.

storage alone is not very significant, since a critical factor is the 1/0 storage activity present, the degree of storage utilization required to get program and data into the hgh-speed store and to output results. We include utilization figures for these executions in Table 111, to aid in analysis of the system behavior but not for evaluation purposes.

7.3.3 The Electrical Network Analysis Problem: This problem represents the solution of a set of simultaneous linear equations, described by a sparse coefficient matrix. The technique used for its solution on the executing simulator essentially comprises a relaxation procedure. Extensive runs have been made using a specific thirty-six node network, yielding twenty-six equations with up to four terms in each equation.

From the wealth of results obtained we present representative sets that indicate some general trends related to the characteristics and performance of the parallel processing system. Available space will not permit, however, detailed analysis in the present paper, nor does it permit a discussion of the equally interesting results obtained concerning speed of convergence, in particular, and other effects which must be understood within the framework of a numerical analysis of the relaxation solutions.

Figures 7 , 8 , and 9 present the basic performance data, throughput time, and total processor time, for a total of one hundred and forty-four cases. The variables are the number of processors in the system (12 cases), the size of the inner loop as represented by the number of currents (from 2 to 5) evaluated in the loop, and the number of interleaved storage modules (16, 32,64).

These curves clearly indicate the reduction in throughput time to be obtained from the use of parallel processing, the consequent increase in processor cost due to interferences of various sorts, the resultant effect of diminishing returns, and the actual increase in throughput time, when too many

1898 PROCEEDINGS O F THE IEEE DECEMBER

KILOCYCLES mA#

16 STORAGE

80

a:40 30 20 10

NUMBER OF PROCESSORS Fig. 7. Total processor and throughput times in electrical

network analysis-sixteen storage modules.

KILOCYCLES STORAGE

600- W m - r + soo- 8 4!m- 8 400-

f 300- # 350- 2 250- 200-

1 5 0 - l o o 90- 80-

2- 2 50- 2 40-

30 - 20-

32 STORAGE MOWLES

/-

/

Fig. 8. Total processor and throughput times in electrical network analysis-thirty-turantorage modules.

processors chase too few equations and generally get seri- ously “into each other’s way.”

For the smaller inner loops and when interference between processors is low, total processor times vary somewhat erratically. The causes for this are related to the relaxation pattern and the rate of convergence in each case. In fact there appears strong circumstantial evidence that an ad hoc procedure, which does not guarantee sequential evaluation

KILOCYCLES STORAGE

NUMBER OF PROCESSORS Fig. 9. Total processor and throughput times in electrical

network analysis-sixty-four storage modules.

KILOCYCLES STORAGE

5 ECuATKmS

a

Fig. 10. Total processor and throughput times in electrical network analysis with number of storage modules as a parameter.

of the equations, improves performance. This point, however, requires further study.

Figure 10 reproduces some of the results of the previous three figures for the case of a five-equation inner loop. Table IV lists these same results as a percentage of the time


TABLE IV RUN TIME FOR RESISTOR NETWORK SYSTEM RELATIVE TO THE RUN TIME

USING ONE PROCESSOR, WITH A FIVE EQUATION INNER Loop

~ Relative Time Number of ~ ,

1 2 4 6 7 8 9 IO 11 12 14 16

20.3 17.9

19.2 16.8 17.8 15.2 17.6 14.5 16.8 1 13.9 17.5 13.9 17.3 1 13.2 17.7 1 13.7

100% 51.2 27.1 19.5 17.1 15.8 14.2 13.7 12.9 13.0 11.7 11.7

100

No. of Processors

100% 50.0 25.0 16.7 14.3 12.5 11.1 10.0 9.1 8.3 7.2 6.3

using one processor and compares them with the reciprocal of the number of processors.

Figure 11 indicates storage interference and parallel processing overheads as a function of the number of processors, with storage modularity again a parameter and an inner loop again comprising the evaluation of five currents. Storage interference has previously been defined. The parallel processing overhead represents as a percentage the excess of total number of storage cycles required for execution, excluding storage interference cycles, when more than one processor is used, relative to the number of cycles required by a one-processor execution.

Actual counts during execution show that in general some sixty-seven percent of store access are instruction fetches in this program and some thirty-three percent are data fetches. Thus incorporation of a substantial instruction buffer in each processor clearly reduces all interference by an order of magnitude, since of the four ways in which a storage interference can occur, only one-a data fetch conflicting with a data fetch-remains in the inner loop. Moreover, these measurements refer to a processor in which arithmetic speeds, as in Table 11, are of the order of magnitude of a memory cycle time, which implies a somewhat powerful processor. Thus in every sense the interference figures are worst case results whch, with the performance curves to which they relate, support the view that storage interference is not a serious obstacle to parallel processing.

The four contours drawn on these curves represent lines of constant storage module-to-processor ratio. They slope slightly upward due to the statistical Marbles and Boxes [ 2 2 ] effect previously referred to.

Figure 12 presents two sets of data, based on the five- equation line loop. The upper family of curves relates to storage utilization. The reservations made at the end of Section 7.3.2, with reference to the significance of utilization figures, also apply. The second family of curves represents a first attempt at estimating the relative quality of processing, that is, some function of a cost/performance factor. Such a factor is intuitive and environment-sensitive,

x

UXUElPROCESXR RATIO

I I I I I I I I I I I I

x

Fig. 11. Storage and executive interference.

w

20 10

d

1 1 1 1 1 1 1 1 1 1 1 l I I I I 1 2 3 4 5 6 7 8 9 D l l l 2 R l 4 l 5 ~

NUMBER OF PROCESSORS Fig. 12. Storage utilization and cost/performance factors.

depending on the relative concern for speed and for costs of various sorts. For the present data we have chosen to display a function :

K = nroughput time x total processor time

where K is a constant, throughput time a measure of the speed of computation, and total processor time a measure of the cost.

8. CONCLUSION In this paper we have presented some thoughts on

parallel processing. In particular we have chosen to survey the topic by including an extensive bibliography and some of the results of our work in this area. The discussion has had to be brief, but our intention has been to convey the picture of the potential that parallel processing systems offer for the future development of computing.

1900 PROCEEDINGS OF THE IEEE

The key to successful exploitation lies in a new, unified, and scientific approach.to the entire problem of the design and usage of computing systems. The development of large, integrated systems raises many problems, but there can be no doubt that economic solutions to these will be found. Their development should comprisea significant part of the computer system architectural design effort of the next few years.

Any ultimate evaluation of a .parallel processing system within a working environment depends on actual operating experience. This in turn requires the existence of a system and the interest of users. Only when usable systems become available will the concept of parallel processing in integrated systkms be accurately evaluated.

ACKNOWLEDGMENT This paper reports on a group activity in whch each in-

dividual member had his own specific assignments and in addition participated in regular discussions on all aspects of the project. Credit is therefore due to all members of the group which, during the period covered by the contents of this paper, included G. C. Driscoll, J. M. Lee, A. P. Mullery, J. L. Rosenfeld, H. P. Schlaeppi, and M. Weitzman. I should also like to express my sincere thanks to Dr. H. A. Ernst for the constructive criticism, advice, and tncourage- ment offered during preparation of this paper. My sincere thanks are also due to members of the Graphics and Design Department at the Thomas J. Watson Research Center, and in particular to G. Massi and Mrs. M. J. LaMarre for their preparation of the charts and figures. Last, my thanks to Mrs. J. Galto for her infinite patience in the repeated re- typings of the manuscript.

REFERENCES [ l ] S. Gill, “Parallel programming,” Computer J. , vol. I , pp. 2-10, April

1958. [2] A. L. Leiner, W . A. Notz, J. L. Smith, and A. Weinberger, “PILOT-

a new multiple computer system,” J . ACM, vol. 6, pp. 313-335, July 1959.

[3] H. S. Bright, “A Philco multi-processing system,” 1964 Proc. FJCC, pp. 97-141.

[4] W. H. Desmonde, Real Time Data Processing Systems. Englewood Cliffs, N. J.: Prentice-Hall, 1964.

[5] J. D. McCullogh, K. H. Speierman, and F. W . Zurcher, “Design for a multiple user multiprocessing system,” 1965 Proc. FJCC, pp. 61 1- 618.

[6] P. Dreyfuss, “System design of the Gamma 60,” 1958 Proc. WJCC, pp. 13C133.

[7] M. Lehman, “Serial mode operation and high-speed parallel processing,” Information Processing, 1965 Proc. IFIP, pt. 2. New York: Spartan, October 1966, pp. 631433.

[8] __ IBM OSi360, “PL/I language specification,” Form C28-6571, p. 74.

[9] F. J. Corbato and V. A. Vyssotsky. “Introduction and overview of

[ I O ] M. Conway, “A multiprocessor system design,” 1963 AFIPS Proc.

[ l l ] J. Dennis and E. C. Van Horn, “Programming semantics for multiprogrammed computation,” Commun. ACM, vol. 9, pp. 143-155, March 1966.

[12] H. P. Schlaeppi, “Extensions of PL/I-like languages for parallel processing, with programming examples,” in preparation.

[13] A. D. Falkoff, K. E. Iverson, and E. H. Sussenguth, “A formal description of System/360,” IBM Sys. J . , vol. 3, no. 3, p. 213, 1966.

[14] D. L. Slotnick, W. C. Borch, and R. C. McReynolds, “The Solomon computer,” 1962 Proc. FJCC, pp. 97-107. Also, J. Gregory and

the multics system,” 1965 Proc. FJCC, pp. 18S196.

FJCC, pp. 107-126.

R. McReynoIds, “The Solomon computer,” IEEE Trans. on Elec- tronic Computers, vol. EC-12, pp. 774-781, December 1963.

[15] R. V. Smith and D. N. Senzig, “Computer organization for array processing,” 1965 Proc: FJCC, pp. 117-128.

[16] G. S. Shedler and M. Lehman, “Parallel computation and the solution of polynomial equations.” IBM, Yorktown Heights, N. Y., Research Rept. RC 1550, February 1966.

[17] G. S. Shedler, “Parallel numerical methods for the solution of equations,” IBM, Yorktown Heights, N. Y., Research Rept. RC 1619, June 1966.

[18] J. Nievergelt, “Parallel methods for integrating ordinary differential equations,” Commun. ACM, vol. 7, pp. 731-733, December 1964. .

[I91 W. L. Miranker and W. M. Liniger, “Parallel methods for the numerical integration of ordinary differential equations,” to be published in Math. J . Computation.

[20] J. H. Katz, “Simulation of a multiprocessor computing system,” 1966 Proc. AFIPS Conf., vol. 28, pp. 127-139.

[21] G. A. Blaauw and F. P. Brooks, “The structure of System/360,” IBM Sys. J.,vol. 3, no.2,pp. 119-135, 1964.

[22] J. Rosenfeld, “Marbles and boxes,” IBM Research Center, York- town Heights, N. Y., Project Rept., November 1965.

1.

2.

3.

4.

5 .

6.

I .

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

BIBLIOGRAPHY J. M. Frankovich and H. P. Peterson, “A functional description of the Lincoln TX-2 computer,” 1957 Proc. WJCC, pp. 146-155. S. Gill, “Parallel programming,” Computer J . , vol. 1, pp. 2-10, April 1958. P. Dreyfuss, “System design of the Gamma 60,” 1958 Proc. WJCC, pp. 13C133. C. Strachey, “Time sharing in large fast computers,” in Information Processing 1959. Paris : UNESCO ; Munich : Oldenbourg ; London : Buttenvorth, 1960, pp. 336341. A. L. Leiner, W. A. Notz, J. L. Smith, and A. Weinberger, “PILOT- a new multiple computer system,” J. ACM, vol. 6, pp. 313-335, July 1959. N. Lourie, H. Schrimpf, R. Reach, and W. Kahn, “Arithmetic and control techniques in a multiprogram computer,” 1959 Proc. EJCC, pp. 7S81. G. Estrin, “Organization of computer systems, the fixed plus variable structure computer,” 1960 Proc. WJCC, pp. 33-40. H. Hellerman, “On the organization of a multiprogramming- multiprocessing system,” IBM COT., Yorktown Heights, N. Y., Re- search Rept. RC-522, September 1961, 52 pages. F. J. Corbato, M. Menvin-Daggett, and R. C. Daley, “An experimental time-sharing system,” 1962 AFIPS Proc. SJCC, pp. 335-344. G. M. Amdahl, “New concepts in computing systems design,” Proc. IRE, vol. 50, pp. 1073-1077, May 1962. E. F. Codd, “Multiprogramming,” in Advances in Computers, vol. 3. New York: Academic, 1962, pp. 77-153. Symposium : “Symposium on Multi-Programming (Concurrent Pro- grams),” Information Processing, 1962 Proc. IFIP Congress, North Holland Co., 1963, pp. 570-575. D. L. Slotnick, W. C. Borck, and R. C. McReynolds, “The Solomon computer,” 1962 Proc. FJCC, pp. 97-107. J. P. Penny and T. Pearcey, “Use of multiprogramming in the design of a low-cost digital computer,” Commun. ACM, vol. 5, pp. 473476, September 1962. Planning a Computer System, W. Buchholz, Ed. New York: Mc- Graw-Hill, 1962. F. R. Baldwin, W. B. Gibson, and C. B. Poland, “A muitiprocessing approach to a large computer system,” IBM Sys. J. , pp. 6 7 6 , September 1962. J. P. Anderson, S. A. Hoffman, J. Shifman, and R. J. Williams,

1962 AFIPS Proc. FJCC, pp. 8696. “D825-a multiple-computer system for command and control,”

D. L. Slotnick, W. C. Borck, and R. C. McReynolds, “The Solomon computer.” ibid., pp. 97-107. J. McCarthy. Time Sharing Computer Systems in Management and the Computer of the Future. Cambridge, Mass.: M.I.T. Press, 1962. M. J. Marcotty, F. M. Longstaff, and A. P. M. Williams, “Time sharing on the Ferranti-Packard FP6000 computer system,” 1963 AFIPS Proc. SJCC, pp. 2 9 4 . J. S. Squire and S. M. Polais, “Programming and design considera- tions of a highly parallel computer,” ibid., pp. 395400. B. Russell and G. Estrin, “An evaluation of the effectiveness of parallel processing,” 1963 IEEE Pacific Comp. Conf., pp. 201-220.

PROCEEDINGS OF THE m, VOL. 54, NO. 12, DECEMBER, 1966 1901

23. H. A. Ernst, “TCS, an experimental multiprogramming system for the IBM 7090,” IBM Corp., Yorktown Heights, N. Y., Research Rept. RJ248, June 1963,41 pages. the IBM 7090.” IBM Research Rept. RJ248, June 1963,41 pages.

24. M. Lehman. R. Eshed, and Z. Netter, “SABRAC, a time sharing low- cost computer.” Commun. ACM, vol. 6, pp. 427-429, August 1963.

25. R. V. Smith and D. N. Senzig, “Computer organization for array processing,” IBM Corp., Yorktown Heights, N. Y., Research Rept. RC 1330, December 1964.

26. A. S . Critchlow, “Generalized multiprocessing and multiprogramming systems,” 1963 AFIPS Proc. FJCC, pp. 107-126.

27. M. E. Conway, “A multiprocessor system design,” ibid., pp. 139-146. 28. R. R. Seeber and A. B. Lindquist. “Associative logic for highly

parallel systems,” ihid., pp. 489493. 29. R. M. Meade, “604 machine description,” IBM internal memo.,

December 1963.38 pages. 30. M. Lehman, R. Eshed, and Z. Netter, “SABRA-a new generation

serial computer,” IEEE Trans. on Electronic Computers, vol. EC-12. pp. 618628, December 1963.

31. M. W. Allen, T. Pearcey. J . P. Penny, G. A. Rose, and J. G. Sander- son. “CIRRUS, an economical multiprogram computer with micro- program control,” ibid., pp. 663-671.

32. W. F. Miller and R. A. Aschenbrenner, “The GUS multicomputer system,” ibid., pp. 671-676.

33. G. Estrin, B. Russenl, R. Turn, and J. Bibb, “Parallel processing in a restructurable computer system,” ibid., pp. 747-755.

34. G. Gregory and R. McReynolds, “The Solomon computer,’’ ibid.,

35. H. S . Bright, “A Philco multi-processing system,” 1964 Proc. FJCC, pp. 97-141.

36. R. G. Ewing and P. M. Davies, “An associative processor,” 1964 AFIPS Proc. FJCC, pp. 147-158.

37. H. A. Kinslow, “The time-sharing monitor system,’’ ibid., pp. 443- 454.

38. J. Nievergelt. “Parallel methods for integrating ordinary differential equations,” Commun. ACM, vol. 7, pp. 731-733, December 1964.

39. W. H. Desmonde, Real Time Dura Processing Systems. Englewood Cliffs, N. J.: Prentice-Hall, 1964.

40. M. Lehman. “Serial mode operation and high-speed parallel processing,” Information Processing, 1965 Proc. IFIP, pt. 2. New York: Spartan, 1966, pp. 631-633.

41. R. V. Smith and D. N. Senzig, “Computer organization for array

pp. 774-755.

processing.” 1965 Proc. FJCC, pp. 117-128. 42. E. W. Dijkstra, “Solution of a problem in concurrent programming

control.” Commun. ACM, vol. 8, p. 569. September 1965. 43. J. B. Dennis, “Segmentation and the design of multiprogrammed

computer systems,” J . ACM, vol 12, pp. 589-602, October 1965. 44. F. J. Corbato and V. A. Vyssotsky, “Introduction and overview of’

the multics system,” 1965 Proc. FJCC, pp. 185-196. 45. E. L. Glaser. J. Couleur, and G. Oliver, “System design of a com-

puter for time sharing applications,” ibid.. pp. 197-202. 46. V. A. Vyssotsky, F. J. Corbato, and R. M. Graham, “Structure of the

multics supervisor,” ibid.. pp. 203-212. 47. R. C. Daley and P. G. Neumann, “A general-purpose file system for

secondary storage,” ibid., pp. 213-229. 48. J. F. Ossanna, L. E. Mikus, and S. D. Dursten, “Communication

and input-output switching in a multiple computing system,” ihid..

49. J. W. Forgie, “A time and memory sharing executive program for quick-response on-line applications,” ibid., pp. 599610.

50. J. D. McCullogh, K. H. Speiennan, and F. W. Zurcher, “Design for a multiple user multiprocessing system,” ibid., pp. 61 1-618.

51. W. T. Comfort, “A computing system design for user device,” ibid., pp. 619-628.

52. J. P. Anderson, “Program structures for parallel processing,” Commun. ACM, vol. 8, pp. 786788, December 1965.

53. B. W . Arden, B. A. Galler, T. C. D. O’Brien, and F. H. Westervelt, “Program and addressing structure in a time-sharing environment,” J . ACM, vol. 13, pp. 1-16. January 1966.

54. J. H. Katz, “Simulation ofa multiprocessor computer system,” SR & D Rept. LA-009, February 1966, to be published in 1966 Proc. SJCC.

55. G. S . Shedler and M. M. Lehman, “Parallel computation and the solution of polynomial equations,” IBM, Yorktown Heights, N. Y., Research Rept. RC 1550, February 1966.

56. H. Hellerman, “Parallel processing of algebraic expressions,” IEEE Trans. on Electronic Computers, vol. EC-15, pp. 82-91, February 1966.

57. J. B. Dennis and E. C. Van Horn, “Programming semantics for multiprogrammed computation,’’ Commun. ACM, vol. 9, pp. 143-155, March 1966.

58. N. Wirth, “A note on ‘program structures’ for parallel programming,” Commun. ACM, vol. 9, pp. 32G321, May 1966.

59. D. E. Knuth, “Additional comments on problems in concurrent programming control,” ihid., pp. 321-322.

pp. 231-241.

Very High-speed Computing Systems MICHAEL J. FLYNN; MEMBER, IEEE

Abstract-Very high-speed computers may be clnssified as follows:

1) Single Jktmction Strdingle Data Stream (SISD) 2) S i l e Imbnctioa Stream-Multiple Data Stream (SIMD) 3) Multiple hstmcth StrePntSingle Data Stream (MSD) 4) Mnltiple Instroctioo Stream-Multiple Data Stream (”D).

“Stream,” as nsed here, refers to the sequence of data or irstructiom as seen by the machine daring tbe execution of a program.

m e coastitaeats of a system :storage, exeation, and htrudon bandhg (braoching) are dkcussd witb regard to recent developmen6 and/or systems Limitatim. The COlsMnents are dkcuwd m term of coocarrent SED

Manuscript received June 30,1966; revised August 16,1966. This work was performed under the auspices of the U. S. Atomic Energy Commission.

The author is with Northwestern University, Evanston, Ill., and Argonne National Laboratory, Argonne, Ill.

systems (CDC 6600 series and, io particnlar, IBM Modd 90 series), since mnltiple stream organizations usually do not quire any more elaborate compooents.

Representative organizations are s e l e c t e d from each class and the arrangement of the comtitwnts is shown.

INTRODUCTION

M ANY SIGNIFICANT scientific problems require the use of prodigious amounts of computing time. In order to handle these problems adequately, the

large-scale scientific computer has been developed. This computer addresses itself to a class of problems character- ized by having a high ratio of computing requirement to input/output requirements (a partially de facto situation

a survey of problems and preliminary results concerning parallel processing and parallel processors

Documents