associative and parallel processors

Associative and Parallel Processors

KENNETH J. THURBER

Product Development Group, Sperry Univac Defense Systems, St. Paul, Minnesota 55165 Computer, Information and Control Sciences Department, Universtty o f Minnesota, Minneapolis, Minnesota 55455

LEON D. WALD

Test Equtpment Engineering, Government and Aeronautical Products Dtvision, Honeywell, Inc., Mmneapohs, Minnesota

This paper is a tutorial survey of the area of parallel and associative processors. The paper covers the main design tradeoffs and major architectures of SIMD (Single Instruction Stream Multiple Data Stream) systems. Summaries of ILLIAC IV, STARAN, OMEN, and PEPE, the major SIMD processors, are included.

Keywords and Phrases; associative processor, parallel processor, OMEN, STARAN, PEPE, ILLIAC IV, architecture, large-scale systems, SIMD processors, array processors, ensemble.

CR Categories: 3.80, 4.20, 6.22

INTRODUCTION

Purpose and Scope

The purpose of this paper is to present a tutorial survey on the subject of parallel and associative processors. The paper covers the topics of system categorizations, applications, main tradeoff issues, historically important architectures, and the architectures of systems that are currently available.

Currently, the microprocessor/computer-on- a-chip revolution is providing the potential for p r o d u c t i o n of very cost-effective high- performance computers through utihzation of a large number of these processors in parallel or in a network. The parallel and associative processors provide a class of architectures which can be readily used to take immediate advantage of the microprocessor technology.

Surveys

A number of good surveys on the subject (or bordering on the subject) of parallel and asso-

ciative processors have been published [1-10]. Some of these surveys are only of historical interest due to their age. Several conferences in this area may also be of interest [11-14].

Classification of Computer Architectures

Many approaches to classification of computer architectures are possible. Most techniques use global architecture properties and are thus valid only within hmited ranges. The main classification techniques discussed below provide a framework within which to view associative and parallel processors.

Flynn [ 15] proposed a classification scheme that divides systems into categories based upon their procedure and data streams. The four categories are:

1) SISD (Single Instruction Stream Single Data Stream) - a uniprocessor such as a single processor IBM 360.

2) MISD (Multiple Instruction Stream Single Data Stream) - a pipeline processor such as CDC STAR.

Copyright © 1976, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this material is granted, provided that ACM's copyright notice is given and that reference is made to this pubhcation, to its date of issue, and to the fact that re- pnnt ing privileges were granted by permission of the Association for Computing Machinery.

Computing Surveys, Vol. 7, No. 4, December 1975

216 • Kenneth J. Thurber and Leon D. Wald

CONTENTS

INTRODUCTION Purpose and Scope Surveys Classification of Computer Architectures Definitions Reasons for Use of SIMD Processors Apphcatlons Matrix Multiplication on a Parallel Processor

SIMD GENEALOGY MAIN TRADEOFF ISSUES

Associative Addressing The State Transition Concept and Associative

Processor Algonthms Processing Element (PE) lnterconnectmns Input/Output Host Interaction Memory Dtstrlbutmn Software Activity Control and Match Resolution Balance Between Logic and Memory

ASSOCIATIVE PROCESSORS Introduction DLM (Distributed Logic Memory) Highly Parallel Associative Processors Mixed Mode and Multidlmenslonal Memorms Implemented Associative Processors Associative Processor Simulators

PARALLEL PROCESSOR AND ENSEMBLES Introductmn Unger's Machine SOLOMON I, SOLOMON II, and ILLIAC IV Machines With Other Interconnectlon Structures Orthogonal Computer PEPE Comment

ACTUAL MACHINES Introduction STARAN OMEN PEPE ILLIAC IV

CONCLUSION REFERENCES

3) SIMD (Single Instruction Stream Multiple Data Stream) - an associative or parallel processor such as ILLIAC IV.

4) MIMD ( M u l t i p l e Instruction Stream Multiple Data Stream) - a multiprocessor such as a Univac 1108 multiprocessor system.

Murtha and Beadles [ 16] proposed a classifi-

cation technique based upon parallelism properties. Their categories are:

1) General-purpose network computer; '2) Spec i a l -pu rpose network with global

parallelism; and .3) Non-global , semi-independent network

with local parallehsm - thxs category is a catchall for machines which do not fit into the first two categories.

Categories 1 and 2 have the following subcases:

l .a) General-purpose network with centralized common control;

l .b ) G e n e r a l - p u r p o s e n e t w o r k w i t h identical processors but independent instruction execution actions;

2.a) Pattern processors; 2.b) Associative processors. The purpose of this classification techmque

was to differentiate between multiprocessors and highly parallel organizations.

Another possible classification view, suggested by Hobbs, et al.; [1] consists of the following categorms:

1 ) Multlprocessor 2) Associative processor 3) Network or array processor, and 4) Functional machines. Further, Hobbs, et al. ; suggested that archi-

tectures could be classified based upon the amount of parallelism in:

1 ) Control 2) Processing units, and 3) Data streams. However, it was noted that these parameters

were present in all highly parallel machines and w e r e therefore not adequate to define a machine architecture.

In his article against parallel processors, Shore [ 17] presents a unique classification techmque which derives the machine descrip- tions from the description of a uniprocessor. The machine categories considered are summarized as follows:

1) Machine I - a uniprocessor. 2) Machine II - a bit-slice associative proc-

essor built from Machine I by adding bit- slice processing and access capabdity (e.g., STARAN).

3) Machine I I I - a n orthogonal computer derived from Machine II by adding parallel word processing and access capa- bil i ty (e.g., OMEN).

C o m p u t i n g S u r v e y s , Vol . 7, No . 4, D e c e m b e r 1975

Associative and Parallel Processors • 217

4) Machine IV - a machine derived from Machine I by replicating the processing units (e.g., PEPE).

5) Machine V - a machine derived from Machine IV by adding interconnections between processors (e.g., ILLIAC IV).

6) Machine VI - a machine derived from Machine I by integrating processing logic into every memory element (e.g., Kautz's logic-in-memory computers [85, 86]).

Hlgb ie [67] classifies computers using Flynn's four basic categories. However, he expands the SIMD category into the following four subcategories:

1) Array Processor - a processor that processes data in parallel and addresses the data by address instead of by tag or value.

2) A s s o c i a t i v e M e m o r y P r o c e s s o r - a processor~that operates on data addressed by tag or value rather than by address (Note: this definition does not require parallel operation; however, the definition does allow for machines that operate on data m parallel).

3) Associative Array Processor - a processor that is assocmtlve and also operates on arrays of data (typically, the operations are on a hit-slice basis, i.e., a single bit of many words).

4) Orthogonal Processor - a processor with two s u b s y s t e m s - a n associative array processor subsystem and a serial (SISD) processor s u b s y s t e m - w h i c h share a common memory array.

Higbie's categories provide for the identification of the ability to perform parallel, associative, and serial processing. Higbie defines a parallel processor to be any computer that contains multiple arithmetic units and operates on multiple data streams. Clearly, all of Higbie's four subcategories of SIMD processors fit his deflnltmn of a parallel processor.

Although none of the classification schemes presented here are mathematically precise, they are necessary to place assocrative and parallel processors in perspective. In the next section the terminology of associative and parallel processors will be defined for use m the remainder of this paper.

Definitions

The definitions used for the remainder of this paper are based upon Flynn [ 15 ], and are con-

sistent with those used by Thurber [10]. The definitions are"

1) SIMD machine: any computer with a single global control unit which drives multiple processing units, all of which either execute or ignore the current instruction.

2) Associative Processor: any SIMD machine in which the processing units (or process ing memory) are addressed by a property of the data contents rather than by address (i.e., multiply A and B together in all elements where A ~ B ) .

3) Parallel Processor: any SIMD machine in which the processing elements are the order of complexity of current day small computers and which has, typicaJly, high level of interconnectivity between processing elements.

4) Ensemble a parallel processor in which the interconnection level between processing elements is very low or nonexistent.

Clearly, the intersection of these definitions will not be null due to the overlap in classi- fying machine architectures.

Reasons for Use of SIMD Processors

There are many reasons for the use of SIMD architectures. Some of the most important are:

1) Functional a) Large problems such as weather data

processing. b) Problems with inherent data structure

and parallelism. c) Reliability and graceful degradation. d) System complexity. e) Computational load.

2) Hardware a) Better use of hardware on problems

with large amounts of parallelism. b) Advent of LSI microprocessors. c) Economy of duplicate structures. d) L o w e r nonrecurring and recurring

costs. 3) Software

a) Simpler than for multiprocessor. b) Easier to construct large systems. c) Less executive function requirements.

However, the reader is cautioned to remem- ber that these are special-purpose machines and any attempt to apply them to an incorrectly



sized, or designed, problem is an exercise in futility.

Applications Numerous applications have been proposed for associative and parallel processors. Whether an application is ever implemented is based upon the economics of the problem. It is not enough to be able to describe a problem solution. To date, considerable energy has been expended searching for cost-effective parallel solutions, but little has been done to actually implement them. Examples of such applications are: matnx manipulation, differential equations, and linear programming.

Several areas of application appear quite well suited to associative and parallel processing. In some cases, SIMD processors provide cost- effective system augmentation (e.g., air traffic control and associative head-per-track disks). In others, SIMD processors are very close to functional analogs of the physical system (e.g., data compression, concentration, and multiplexing).

The use of associative and parallel processors appears promising in the elimination of critical bottlenecks in current general-purpose computer systems; however, only very small associative memories are required, and cost is really not a critical factor m the solution of these problems. Such problems may be encountered in the management of computer resources, and might revolve such items as protection mecha- n i sms , r e s o u r c e allocation, and memory management.

The a p p l i c a t i o n trend for associative processors is well defined. Due to cost factors, applications wil~ be limited (in the near future) to problems such as resource management, virtual memory mechanisms, and some augmentation of current systems, rather than to data-base management or large file searching and processing. The application trend for parallel processors and ensembles seems to be in the area of large data computation problems such as weather data processing, nuclear data processing, and ballistic missile defense. Re- searchers must concentrate on systems aspects of the problem, not on developing solutions to problems which are ill-defined or nonexistent. The only truly useful data-processing applications of parallel and associative processors have been developed through extensive work at the

syslems level, followed by parallel systen design, rather than vice versa.

SIMD processors have been proposed an( appear well suited for a number of applications these are hsted here along with their primar~ references. Factors to be considered in applyin~ SIMD processors have been discussed elsewhere [18]. To be useful for solving the listed prob. lems , the highly parallel processors musl become more cost-effective. The suggested application areas are:

1) Applications in which associative processors appear to be cost-effective: a) Virtual memory mechanisms [ 19] b) Resource allocation [20] c) Hardware executives [21 ] d) Interrupt processing [ 22,23 ] e) Protection mechanisms [ 19] f) Scheduling [201

2) Applications in which parallel processors a p p e a r cost -effect ive and in which a s s o c i a t i v e processors may be cost- effective: a) Bulk filtermg [241 b) Tracking [25 ] c) Air traffic control [26] d) Data compression [27] e) Communications multiplexing [28,29] f) Signal processing [301

3) Apphcations in winch associative processors would be useful if they were more cost-effective and in which parallel processors are probably cost-effective are: a) Information retrieval b) Sorting c) Symbol manipulation d) Pattern recognition e) 'Picture processing f) Sonar processing g) Sea surveillance h) Graph processing 1) Dynamic programming j) Differential equations k) Eigenvector 1) Matrix operations m) Network flow analysis n) Document retrieval o) Data file manipulation p) Machine document translation q) Data file searching r) Compilation s) Formatted files t)

[31] [32] [33] [341 [351 [361 [371 [381 [391 [40] [411 [42] [43] [44] [45] [46] [47] [48] [49]

Automatic abstracting [ 50,51 ]



u) Dictionary-look-up translations [52] v) Data management [53] w) Theorem proving [54] x) Computer graphics [ 55 ] y) Weather data processing [ 56]

Hol lander ' s paper [2] presents many

Each cell is interconnected to its four nearest neighbors. Cannon [59] derived the following algorithm to multiply two n x n matrices together in n stages using this processor.

Algorithm: Set: experts' opinions about the applicability of

associative and parallel processors. Slotnick [57] and Fuller [58] have compared associative and parallel processors. They concluded that parallel processors appear more useful than associative processors, but the machines would Shift: always be special purpose. Shore [ 17] has presented a very eloquent case against SIMD machines and parallel processors.

Matrix Mult ipl icat ion on a Parallel Processor Multiply:

Matrix manipulations can be used to illustrate Add: the potential of a parallel processor. One of the major computations in many of the possible Shift: applications cited previously is the matrix multiply. Assume that a parallel processor such Jump: as that shown in Figure 1 is used to perform this operation.

Each processing element will be assumed to have three registers, AREG, BREG, and CREG.

1to AND FROM ALL PEs

L

t

Figure 1. Parallel Processor with PE connections to the Four Nearest Neighbors.

CREG = 0 BREG [PE I j] = B [I,J] for all I,J ~ N ' AREG [PEI, J] = [I,J] for all I,J ~ N Ith row of A left I-1 columns for all I ~ N Jth column of B up J-1 rows for all J ~ N (TREG = AREG times BREG) In Parallel in all PEs (CREG = CREG + TREG) In Parallel in all PEs AREG right one row BREG down one column If not N th pass, Jump to Multiply:

As an example let:

~ 1 a2 i l a 3.

A = a 4 a 5 a 6

7 a8

~ 1 b4 bi~ B = b 2 b 5 b 8

3 b6

and C = A x B ,

After initialization, the memory map for CREG is:

0

0

For AREG, the memory map is:

a~i a2 a i l a 5 a 6 a 4

a 7



And the BREG map is:

~l b5 bi~ b 2 b 6 b 7

3 b4

BREG =

b 3 b 4 b 8 b 1 b 5 b 9

2 b6

After the multiply, add, shift, and jump, the memory maps appear as follows:

CREG =

AREG =

~ l b 1 a2b 5 a3b9-~ 5b2 a6b 6 a4b7] 9b3 a7b4 a8b8_J

~ 3 al ai~ a 4 a 5 a 6 8 a9

After two more iterations the multiply will be finished and CREG [PEI, J] will contain C[I,J].

SIMD GENEALOGY

Figure 2 indicates the major parallel and asso- cialive architectures considered in this paper, together with their generic relationships. These may' differ substantially, yet they are related by virtue of their global associative or parallel properties.

ASSOCIATIVE PROCESSORS

MEMEX 1721

I

CRYOGENIC CATALOG MEMORY (73)

DISTRIBUTED LOGIC MEMORY 131)

ASSOCIATION TREE STORING CHANNEL PROCESSOR PROCESSOR

(78) (8O)

TWO DIMENSIONAL AND SULK DISTRIBUTED LOGIC MEMORIES

(81)

HIGHLY PARALLEL

AUGMENTED ASSOCIATIVE ASSOCIATIVE CONTENT CONTROL LOGIC ADDRESSAB LE SWITCH PROCESSING MEMORY (24) SYSTEM

(85) (87)

SIMD MACHINES

PARALLEL PROCESSORS ENSEMBLES

UNGER MACHINE

(1071

SOLOMON I ORTHOGONAL (108) COMPUTER

1105)

SOLOMON I I I~ 1109) OMEN

1112)

ILLIAC IV (57)

BIT ADDER SLICE SKEWED (113) BIT

SLICE (92)

PEPE (115)

IMPLEMENTEO ASSOCIATIVE PROCESSORS

EXOR SKEWED BIT SLICE (93)

STARAN (1141

BYTE DISTRIBUTED SLICE LOG IC (106) (20)

MIXED MODE AND MULTIDIMENSIONAL (23)

Figure 2. SIMD Geneology.



MAIN TRADEOFF ISSUES

To allow the reader to place the processed architectures in perspective, the major system tradeoff issues in SIMD processors will be discussed next. It should be understood that SIMD processors are special purpose. They are useful for a limited set of applications which may be characterized as having [ 18 ] :

1) A large number of independent data sets; 2) No relationships between data sets that

prevent them from being processed in parallel;

3) The reqmrement for qmck throughput response; and

4) T h e p o t e n t i a l to exploit associative addressing selection techniques.

Associative Addressing

Figure 3 indicates the main features required to implement an associative address process. The designer must include facilities to address or q u e r y the processor associatively, i.e., to

address the memory array by content. In Figure 3, the application may require that all em- ployees with a salary of $35 per day be located. This would be accomplished by performing an equality search on the daily salary field of the file. This search is performed in parallel. To set up the search a Data Register must be loaded with the daily salary ($35) for comparison, a Mask Register is included to mask the data register so that only the desired fields may be searched, a Word Select (activity select) Reg- ister specifies the subset of words to be addressed, a Results Register is required to collect the results of the search, a Match Indicator is used to indicate the number of matches, and a Multiple Match Resolver must be provided to generate a pointer to the " topmos t " matched word. An eight-word example of a personnel file stored in a simple associatxve memory is illustrated in Figure 3, it assumes that the search for a salary of $35 per day was init iated on the entire file. All necessary registers and features are illustrated.

GENERAL FORM OF THE FIELD DESCRIPTOR

CONTROL UNIT

0 35 O

0 1 O

JOHN 30 1

JIM 20 2

BEV 10 3

SALLY 35 6

BILL 5 4

ART 90 5

DOUG 20 7

LARRY 10 8

NAME DA ILY EMPLOYEE SALARY NUMBER

DATA REGISTER /MULTIPLE

RM : V R MASK REGISTER

O

0

0

1

1 0

1 0

1 0

1 0

T l WORD SEARCH SELECT RESULTS

REGISTER REGISTER

Figure 3. Associative Processor Storing a Personnel File.

T EXTERNAL

PROCESSING LOGIC


222 • Kenneth J. Thurber and Leon D.

The State Transition Concept and Associative Processor Algorithms

In 1965, Fuller and Bird [35] described a bzt- shce associative processor architecture. Fuller and Bird discussed their algorithms in terms of state transitions. To perform parallel processing of data in an associative processor, two basic operations are required. These are a content search and a "multiwrlte" operation. The multiwrite operatmn consists of the ability to simultaneously write into selected bits of several selected words. With a content search and multiwrite capability, an associative processor can be used to process data in highly parallel fashion.

Fo r example, consider the problem of adding field A to fmld B. This can be accomplished as a sequence of bit pair additions (A i added to Bi), with statable at tention paid to the carry bit. If bit 1 is the least significant and bit n the most, Figure 4 indicates the state table for the bit-serial addition of A to B with the result accumulating into B. As shown B i (result) and C i + 1 differ from B i (operand) and C i in the four states, 1, 3, 4, and 6. One possible addition method would be to search for these states and multlwrite the correct B i and C i + 1 into the appropriate fields, i.e., set up a "tag field" designated as Ci, and set C 1 = 0. Then operat£ons would proceed from the least significant bit to the most significant bit of the pair of operands. At each bit position, i, Ai, Bi, and C i would be searched for state 1. Then Ci + 1 = 0, B 1 = 1 would be multiwritten into all matching words. After a search for state 3, Ci + 1 = 1, B i = 0 would be multiwrttten into all matching words. This same procedure would be carried out for states 4 and 6. if a word is matched on any search, it would be removed from the list of actwe cells until operations moved from bit position i to i + 1, so that writing the resultant B 1 and the carry-in for bit position i + 1 into the memory does not cause the data already processed to be reprocessed. Obviously, utilizing the sequential state trans- formation approach, any desired transforma- t i o n s can be processed in an assocmtive processor with simple content search and multiwrite capability using very simple word logic. Operating on the state transition concept, six general requirements of the architecture of an associative processor were deduced. These are:

Wald

PRESENT CARRY STATE

A I B I C l+ l *

0 0 0

o o ) o

0 1 0

( o , 1 )

( ~ o o ..... i )

I 0 1

( 1 1 1 )

1 1 1

PARTIAL SUM

C~ B I

0 0

I I

0 I

I 0

0 I

I 0

0 0

I I

* CARRY INTO BIT POSITION ¢÷I FROM BIT POSIT(ON L

Figure 4. Bit-Slice Addition State Transition Diagram.

1) The basic processing primitive is the identification of all words meeting some binary state variable configuration;

2) No explicit addressing is necessary since processing can be performed simultaneously over all data sets of interest,

3) Data is processed "column serially" to limit the number of state transformations to be remembered;

4) Tag bits (temporary bit-slice storage) are very useful (for example, the carry bit),

5) A search results/word select register and multlword write capability are desirable; and

6) Stored program control of the associative processor is demrable.

Fuller and Bird recognized an important correlation between the way in which data is stored m the processor's memory and its processing speed. As examples of this phenomenon, they described both word per cell (W/C) and bit per cell (B/C) data organizations for their associative processor. Figures 5 and 6 show the W/C and B/C storage organizations for nine four-bit words X 11, • • • , X33, respectively. The word per cell organization has become an industry standard, and it is, in fact, the way most associative processors store data. The desire to perform bit-slice processing on many data sets simultaneously without much inter- word communication led to the widespread use of this type of organization. W/C is very efficient if a large number of cells are involved;

C o m p u t i n g S u r v e y s , Vol. 7, N o . 4, D e c e m b e r 1975


Bt+f$ O TO 3 OF WORD X13

' '

, , 1 - - x 3 ~

1 1 2 3 1

I , 1 I I

I , I ~ I x3a I I x32 I

[ I L [ I I

I xi , I

I I I I I I xo l I I I L

Figure 5. Word Per Cell Organization.

operands can be stored in the same cell, and most operands are being processed by each bit- slice command. For example, in Figure 5, this type of organization would be very efficient if, for all operands, Xil , X12 , Xi3 , it were desired to calculate Xil + (Xi2 * Xi3 ) for all 1. In this case all cells could be activated and the cal- culation could be performed in a bit-serial manner over all operands.

However, there are problems in which a B/C approach is more efficient. (The B/C approach is the same as described by Crane and Githens [76], and was discovered independently by Fuller and B*rd [35].) The B/C organization allows many operands to be processed simultaneously in a bit-parallel mode. Because many operands are treated in a bit-parallel fashion, B/C processing can be very efficient, even on only a few operands, in comparison with W/C processing. Also, operands stored in the same bit slice can be easily picked up. Therefore, B/C processing can be efficient in cases where only a few operands are activiated for processing by an instruction, or for applications in which extensive pairing of operands must occur.

Using the state transition concept, associative processor algorithms may be clearly described for a variety of architectures. Con- sidering the range of architectures currently available, no attempt will be made to describe the algorithms here. Rather, a list will be given of the more important operations that can be performed. Obviously there are many variations of these algorithms.

Define a field as a set of bit slices. There are two main types of operations that can then be performed in an associative processor. First are central operations which process a central operand in the data register against a selected field. On the other hand, there are field opera- t i o n s , which process two fields together without utilizing the data register. A central

BITS 0 TO 3 j - - - -- OF WORD X13~ "

1!

- -X13- - -X12- - - - X l l -

3

- - X23 ~ - X22 - - _ X21 -

- - X 3 3 ~ - X 3 2 - --X31--

Figure 6. Bit Per Cell Organization.

equality search would return responses for all active words in which the field matched the data word, 1.e., a search of the form C = V where V is a vector of active words. A field equality search would return responses for all active words in which the fields were equal, i.e., a search of the form V = V 1 where V and V 1 are vectors of length P and a response is given only if V i = Vii . Table 1 lists typical operations that can be performed in an associative processor. Most operations are meaningful m either the field or central format. A comPrehensive discussion of these operations may be found in [60] . Thurber and Patton [61] discussed f l o a t i n g - p o i n t a l g o r i t h m s for associative processors.

Tab le 1. T y p i c a l Assoc ia t ive Processor Operations.

CLASS OF EXAMPLE OPERATION OPEBATION

Maximum Min imum Between hmlts

SEARCHES Equality Greater than Less than Proximity

Add AR ITHMETIC Subtract

Mult ip ly ,~

Response resolution SYSTEM LEVEL Mult iple match marnpulat,on

Response and match ~ ecjl$%er manipulat ion

Comput ing Surveys, Vol. 7, No. 4, December 1975

224 • K'e~neth J. Thurber and Leon D. Wald

Processing Element (PE) Interconnections There are five major solutions to the problem of the level of PE interconnection that is to be provided in a SIMD processor; these are to establish.

I) No direct connection between processors (usually this condition occurs in an asso- c i a t i v e processor or ensemble), thus forcing all PE to PE communication to pass through the controller [62].

2) C o n n e c t i o n s to t he f o u r neares t n e i g h b o r s - this connection is the most common interconnectlon and allows for efficient use of the processing array in matrix manipulations [ 57 ].

3) Connections to the six (hexagonal) or eight (octagonal) nearest ne ighbors - - t h e s e interconnections are sometimes used m pattern-recognition processors [63].

4) Perfect Shuffle [ 6 4 ] - t h i s connection forms a permutat:on (which can be modelled as a shuffle of a deck of n cards: 1--~1, n/2 + 1--~2, 2--->3, . . .) and it is quite useful in signal processing applica- tmns [301.

5) Connections to the nearest n-cube neigh- b o r s - this connection pattern maps the PEs into the nodes of an n-cube [65].

There are some issues involving end conditions on the four nearest-neighbor interconnection structure. Usually, the convention is that the leftmost and nghtmost cells in a specific row can be connected together. This allows the network to be connected as either a cylinder or a torus. Such a 'convenUon would also be useful with respect to six and eight nearest-neighbor connection structures.

Input/Output

In addition to the typical I/O tradeoffs encountered in general-purpose computers, SIMD processors encounter an I/O bottleneck with respect to output from the PEs to the host machine [26, 50, 51, 66] . An additional type of I/O exists. A parallel set of parallel channels (one per PE) can be defined to provide high- speed I/O to the processing element.

The host - S I M D processor bottleneck can be particularly acute due to:

1) The match resolution problem required between word outputs if the processor is associative;

2) Bit or byte serial processor operation; and 3) The number of processing elements. The parallel I/O channels are connected to

the array of processing elements in one of three ways: To all elements, to a subset of the elements, or in a processor with four nearest- neighbor connections, to both a row and a column of processing elements under program control.

Host Interaction

There are two basic techniques for interfacing the processor to a host machine. One is to make the SIMD machine look like a peripheral to the host [26]. The other is to connect the host and SIMD processor onto a scoreboard so that they appear as functional units [67]. Due to its potential complexity this latter technique has seen very little use.

Memory Distribution

Functionally, the control unit of the processor has a memory which must be able to store both data and instructions. Each processing element needs only a data memory since the controller broadcasts the instructions to all PEs simultaneously. This may be either a physically dedicated memory (e.g., PEPE) or a shared memory that functionally provides unique data paths to each PE (e.g., OMEN).

The program storage may be dedicated to the control unit, may be shared between the control unit and host, or may be shared between the PEs and control unit (e.g., ILLIAC IV). In ILLIAC IV both instructions and data are stored in a PE's memory. The data in a PE's memory is only available to that PE. Instruc- tions stored in PE memory are distributed, a page of instructions is fetched by the control unit, wluch fetches from eight PEs simultaneously and reads a word from each PE into an eight-word CU (Control Unit) buffer.

Software SIMD machines are special purpose. Therefore, the available software Is based upon the application. There are, however, some software tradeoffs that are visible due to the SIMD property of the machines.

Computing Surveys~ Vol. 7, No. 4~ December 1975

Associative a~td Parallel Processors • 225

A number of higher order languages have been extended for use on SIMD machines [68, 69]. The major extensions of the languages are summarized as follows"

1) Data Declara t ions- a means of differ- entiatlng parallel (Processing Element) and sequential (Control Unit) variables is necessary. Typmally, any data type may be declared parallel. This declaration causes the variable to be assigned the same location in every PE. Since the amount of PE memory may be small, only a few (a hundred or so) variables should be declared parallel.

2) Arithmetic and Logical Expressions - the use of arithmetic and logical operators is usually directly extended. The result of the typical operation is described here: a) S E R 1 0 P SER2 - Operations upon

two serial variables yield either a serial data item if assigned to a serial variable location, or a duplication, if assigned to a variable declared to be parallel.

b) PAR1 OP PAR2 - Operations on parallel variables yield a parallel result and should be assigned to a parallel variable, or else the programmer must assure that only a single PE is active so that the assignment can be made to a serial variable.

c ) SER OP PAR - Operations with both serial and parallel variables in the expression yield a parallel result. There will be system problems if this expression tries to assign a value to a serial variable unless only one PE is active.

3) Parallel Control Statements - The control of the program on a SIMD machine involves not only the type of control statements seen in sequential machines, but also includes control statements that use the amount of activity in the array as a means of execution control. Typically, such control statements limit the control to condational jumps on none active, one active, more than one active, or all active.

There are other issues with respect to software extensions that are unique to ILLIAC IV. These are: 1) storage allocation and dealloca- tlon, 2) local indexing, and 3) mode control. The language for ILLIAC IV (GLYPNIR) allows for blocks of storage to be allocated and

deallocated (GETPEB and FREEPEB instructions respectively). Further, ILLIAC IV has a register in each PE which allows the address generated by the CUs to be individually indexed in each PE. Mode control involves setting up instructions that allow the programmer to reconfigure the set of PEs. These instructions must be tailored so they reflect the details of the physical and functional modularity of the set of PEs.

Activity Control and Match Resolution

Not all elements m a SIMD machine participate in all operations. Thus, an activi ty bzt is usually provided so that a PE can be stopped from participating m a particular operatmn. In some machines (e.g., PEPE) a stack is provided for activity. Instructions allow for pushing and popping the current activity state (m all PEs) into and out of the stack. This feature is used in conjunction with the nesting provided by the parallel DO, IF, and WHERE statements provided by PEPE's PFOR language [68].

The activity bit is used to control the PEs participating in an operation; however, there are instructions (e.g., some I/O type instructions) which require that a unique PE be identified. Further, some control instructions require that the number of active PEs be known. These requirements lead to the problem of multiple match resolution. Match resolution as a particularly acute problem in associative processors and associative memories. Some machines provide a simple match indicator (flip-flop) which provides a match/no-match indication. Other systems provide a match indicator subsystem which provides either a count of the exact number of matches or an indication that none, one, more than one, or all elements matched. The choice of the available match indication is directly tied to the speed desired and to the language constructs desired to control system activity. PEPE provides a unique technique to count the number of matches. Each active element's activity bit or flip-flop outputs a unit of current. The match indicator collects the units of current and performs an analog-to- digital conversion of the current sum to obtain the number of matched elements.

In addition to knowing the number of matched elements, it is often useful to retrieve, in a sequentml fashion, the data from the



matched elements. This requires that a technique be provided which allows the matched elements to be senally activated. The device that performs this functmn is known as the multiple match resolver (MMR). Implementa- tion of high-speed multiple match resolvers is one of the most difficult design problems associated with SIMD machines [70]. Typi- cally, the MMR is implemented using static logic networks or shift register networks. STARAN has a two-level resolution system. The memory of STARAN has been partitioned into 256-word memory arrays. The first level of resolution occurs at the array level (between the 256 words) and the second level occurs in the control unit between the arrays in the system. A multiple level technique not only can be quite fast, but is also very modular and flexible with respect to the number of arrays in the system.

Occasionally it is useful to retrieve items from the processor in an ordered fashion, for example, in ascending or descending order on some value. Techniques to implement such requirements quickly are quite common, but one of the best designed to date is the ordered retrieval algorithm of Lewm [71]. The impor- tance of the Lewin algorithm is that retrieval is dependent upon the number of words to be retrieved, not the number of bits In the words, which makes it a very fast technique.

Balance Between Logic and Memory Most of the tradeoff issues discussed so far have dealt with detailed tactical issues concerned with implementation of features and facilities necessary to design a SIMD machine. One remaining major issue is that of balancing the processing logic and storage sections of each system so that an efficient, application-tuned system is the design result. The tradeoffs involved m this area are implicit in the architecture of the system and thus are the subject of the following sections on architecture. The machines described will be tuned for different applications but they will all exploit SIMD properties.

ASSOCIATIVE PROCESSORS

Introduction Associative addressing is the basis of associative processing, A simple example of an address

Wald

association process is a catalog index in which the search is for all pages which are associated with a property A. This is accomplished by looking up A in the index and finding associated with A the list of all pages which mention or contain A. Associative processor architectures mechanize this process in hardware. Since the search property can be simply mechanized in software using hash coding techniques, hardware mechanization is required principally for speed. Hash code techniques are also limited in speed by the match resolution problem.

Associative processors can make efficient catalog search mechanisms. The first associative processor, conceived in 1945, was a device useful for searching the information files of an individual [72]. The first associative hardware mechansim implemented was a cryogenic device bmlt in 1956 as a catalog search memory [ 73 ].

fiLM (Distributed Logic Memory) Lee [31 ] described a distributed logic memory. This machine is designed as an information retrieval and string processing memory. The architecture of the computer is given in Figure 7. The DLM exploits the associative addressing technique for the information retrieval process. Lee's design goals were to develop a machine in which 1) cells are indistinguishable, 2) the string retrieval time is independent of the number of cells, 3) the amount of time for cross retrieval (retrieval of the identifier given the information contents as the search criteri- on) is independent of the number of cells, and 4) the memory is modularly expandable. The reason for this associative processor design was to replace conventional systems in which there is really only a superficial relationship between information and its corresponding address. A d d i t i o n a l l y , c o u n t i n g , addressing, and scanning operations were eliminated. These decisions then dictated a new architectural concept. In this associative processor, Lee thought of reformation retrieval as a process of getting rid of useless data rather than a searching process for information.

The elementary functions of the processor are indicated by the signal and control lines in Figure 7. The associative processor has four commands - input, output, propagate, and match. There are also lines for input data, output data, and the direction of desired


I

Associative and Parallel Processors • 2 2 7

SIGNAL PROPAGATION

LINES

INPUT SIGNAL

OUTPUT SIGNAL

MATCH

PROPAGATE

INPUT LEADS

OUTPUT LEADS

PROPAGATION DIRECTION LEADS

OUTPUT DATA BUFFER

COMMAND SIGNAL LINES

DATA LINES

DIRECTION SPECIFICATION LINES

Figure 7. Distributed Logic Memory.

propagation (right or left). Match gives the processor its associative capabihty. When the match command is enabled, the contents of all active cells are simultaneously compared to the information on the mput leads. If the match against the contents of cell 1 is successful, then an internal signal, MI, is generated and transmitted to one of the adjacent cells, depending upon the specified direction of propagation. Assume, for example, that all propagation is from left to right. The transmission of M i to the adjacent cell causes that cell to become active. The propagate command causes the transfer of the activity bits of all cells to the adjacent cell in the specified direction of propagation, namely, the right. Each cell has the capability to store one unit of information. A cell is either active or inactive. Examples of retrieval and string processing are detailed in Lee's paper.

Lee pointed out that the successful exploi- tation of his architecture is dependent upon the

capability to construct systems consisting of thousands, or even millions of cells. The dis- tnbuted logic memory concept was further developed by Lee and Paull [74] and Gaines and Lee [75], with the emphasis on information retrieval and the processing of strings of informat ion. This work became the basic architecture winch Crane and Githins [76] modified for bulk processing. Further, as Crane and Glthins developed the PEPE (Parallel Element Processing Ensemble) architecture, Lee's work made a significant contribution in the organization of the correlation units.

Many other researchers have considered the use of Lee's basic DLM concept. Sturman [77] constructed a general-purpose processor using DLM organization. Sawtt, et al. [78] used the DLM concept for their phrase-oriented ASP (Association Storing Processor). Lipovski [79, 80] designed one of the most interesting DLM architecture derivatives. His architecture ad-


228 • Ke~neth J. Thurber a~d Leon D.

dressed the problem of building a practical DLM for information storage and retrieval problems.

A major complicatmn encountered m the application of the DLM-type architecture to information storage and retrieval is that very high speed I/O and the ablhty to segment the processor to execute subprograms are normally required. A linear type DLM (Figure 8 ) i s deficient in these areas and is handicapped by propagation delays. Lipovskl designed the Tree C h a n n e l Processor (TCP) interconnection scheme of Figure 9 to solve these problems. Lipovski proposes (Figure 9) that the processor have a single channel and two identical rail complexes. The rail communication is used so that cells may appear to the programmer as an

CHANNEL fUSED TO CONNECT ALL

CELLS IN PARALLEL)

CELL ~+ 2

CELL I C + 1

ADJACENT CELL PAIRS)

CELL=

* Each cell contams a Storage Register, Comparator, and Match Fhp Flop

Figure 8. Distributed Logic Memory Used for the Tree Channel Processor.

Wald

ordered one-dimensional array able to detect subsets and substrings, to count elements, etc. Cells are ordered such that all cells below cell i or to the right of cell i have lower order. Using this rule, identical logic in each cell, (similar to carry look-ahead logic) can propagate signals to higher or lower cells taking the shortest route in the tree. Each cell is so equipped that a signal to be transferred from point A to point B can bypass cells 1 and 2. Cell 3 can be set so that the signal proceeds from A through 3 to B instead of from A past 3, past 1, past 2, past 3 to B. The root cell will always have the largest integer value and therefore be below every other cell. Propagation delays have been de- creased by several orders of magnitude. The TCP architecture also allows subtrees to be pruned out of the array if they are faulty.

r ROOT OIRECTION C A C

Figure 9. Tree Channel Processor Intercon- nection.

There exists a three-dimensional realization of the TCP tn a seven-way homogeneous tree as shown in Figure 10. To illustrate the construc- tion, hrst draw a cube and number the corners 1, 2, . . . , 7, A. Into the center of the cube place a cell and connect it to every other

Computing Surveys, Vol. 7, No 4, December 1975

Assocmtive and Parallel Processors • 229

c~L

CEsLL

CELL 2

CELL 1

/ / I

I I 1

/ A /

/ /

Figure 10. Tree Channel Processor Construc- tion in a Cube.

corner. Call ttus entity a leaf module. In general, to construct a tree take seven identmal leaf modules and connect them such that their A corners coincide, linking this corner to the "free corner," corner B, to form a larger leaf module. This procedure can be repeated by setting A = B and repeating the operation. There also exist realizations for 25-way and 50-way homogeneous trees. The propagation delay through such trees has been calculated by Lipovsla [80].

The ASP (Association Storing Processor) [78] consists of much more than a processor. The concept includes a language specification [81], two special-purpose machine designs, and an interpreter for the language. A principle of design for the ASP machines was that the language drive the hardware design, i.e., the hardware be designed to support the language.

The language is based on two fundamental premises. The first is that the data and processes are constructed from ordered triples of items, (A, R, B). These triples represent relations and may be interpreted as follows. A is related to B by the relation R. The second premise is that all processes may be expressed in terms of matching replacement structures.

Each ASP instruction consists of two data structures. One is the control structure and represents a specification of the substructures being searched for the data structure. The second is the replacement structure specifying the data which is to replace the set of matching data structures. Two structures will be defined as matching if, and only if, there is a one-to-one correspondence between their items and links such that corresponding items and link labels m a t c h . Each matching data structure is replaced.

The data structures may be represented as directed graphs. The ASP language does not limit the size or complexaty of the graph. There may be more than one relation per item. The ASP instruction may specify the next mstruc- t ion to be executed in case of either a match or no match, or else the processor will halt (return control to the calling process from a subroutine). Empty structures are allowed.

There are three means of indirect match. These are 1) rule of inference, 2) wired-in functions, and 3) subroutines.

In rule-of-inference matching, a relation matches if there exists a set of data relations which imply it. This process may be recursive. Rules of inference appear as relations with the link label IMP (implies), and they may contain X symbols. The tail of IMP is connected to the antecedent of the rule, and the head to its con- sequent. Logical functions may appear in antecedents, making them very flexible.

There are a number of wired-in interpreted functions. The control structure only locates the matches. Instead of replacement, it as possible to exercise some wired-in interpretive functions, such as DECONC (deconcatenate), COUNT (count items or types of items), etc.

Subroutine matching involves the calling of a prestored subroutine that generates a matching relation much like a function generator. This feature is useful for retrieving relations which can be more conveniently and flexibly stored as a process and interpreted at run-time rather than stored in an explicit form. The general block diagram of both ASP machines is shown in Figure 11. The main element in these machines is the "Context-Addressed Memory." The Arithmetic Unit is somewhat unique. This unit provides special hardwired functions for help in executing the program. It may furnish


230 • Kenneth J . Thurber and Leon D. Wald

CENTRAL CONTROL

COMPUTER

CONTEXT ADDRESSED

MEMORY TO AND FROM

ALL SUBSYSTEMS

TT MICROPROGRAM

MASK AND CONTROL ~MPARI~N UNIT AND REGI~ERE MICROPROGRAM

MEMORY

1' ARITHMETIC

UNIT (RESERVED WORD

HARDWARE)

Figure 11. Association Storing Processor.

arithmetic, etc., but the detailed functions provided are application-dependent.

The phrase-oriented ASP system has been designed and optimized to store, search, and replace phrases, i.e., sequences of relations (substructures of the data structure). Thus, instead of identifying items, the phrase-oriented system must be able to trace through detaded structured graphs with multiple relations. For this reason, architecturally the phrase system looks like Lee's DLM, and the data is stored as a set of relations.

The phrase-oriented processor memory is functionally dhistrated in Figure 12(a). Each word of the memory is broken into two fields. These are the relation field (link-label) and item field. The start of a phrase is marked by the special symbol PS in the rink-label field. The item associated with the phrase is placed m the field of the same word as PS. All relations between items are then detailed, with the relation being specified.

The phrase format was selected because it contains a single phrase for each item. A phrase consists of a header, the item, all link labels, and their associated items. Inverse link labels are also included. Phrases are stored con- tiguously. The means of performing searches is

I I I

PS A W O R D i + 2

R1 B W O R D i + 1

R 2 C W O R D =

PS D W O R D ~ - 1

I I I

Figure 12(a). Phrase-Oriented Processor.

B

A

Figure 12(b). Phrase that Results in the Memory Map Shown in Figure 12(a).

very simple. Phrases that contain all relations specifying the context address of the variable are selected and then the items at the head of the selected phrases (PS) are tagged as the values of the variables. PS is implemented as a one-bit tag in the actual ASP system. Since the processor is associative, ASP imposes no restric- t ion on phrase lengths, item ordering within a phrase, or the ordering of phrases. However, phrases must be contiguous. This restriction could be removed if an actual name tag were included in the cell mstead of a tag bit. Facilities are provided to ease processing of compound relations, but they will not be detailed here.

The primary difference between the phrase- ormnted and item-oriented ASP is the memory design. The context-addressed memory for the item-organized ASP is shown in Figure 13. The item-oriented ASP has a square array of storage



I

!!!! p

I I

' i

CONIF LLER

COMP, :ISON

. . . . . ~%L~

Figure 13. Item-Oriented ASP.

cells with three bus structures. Two of the buses connect local cells. One set connects local cells horizontally, the other connects the ceils vertically. In each of these cases a cell is connected dlrectionally to its nearest neighbor. In Figure 13, each cell can only propagate a local signal to the cells north or west of itself. The third bus structure is global and connects the control circuitry to each cell. The cells at the top (left) connect back to the cells at the bottom (right) so that any cell can propagate a signal (through a chain of intermediate cells) to any other cell.

Each storage cell in the memory contains five fields and its own address. Relations are coded m terms of the cell addresses, i.e., a cell containing (A, R, B) actually contaxns in field A the address of the cell containing "A," etc. Of the five fields, two are for tags. The cells can perform comparison operations, reads, and writes, and can generate intercell commumca- tions. A match flip-flop is included in each cell. Local signals propagate by address, vertically (north) until the row number in the signal and the cell match, then left (west) until the correct cell is found. A "blockage" east to west signal

in conflict with a south to north signal causes the east to west signal to have priority. Express routes are available for local signals. DLMs found actual application in the correlation units of PEPE. It appears they may be quite useful for constructing an associative disk, and a number of researchers are investigating this area [82, 83,841.

H i g h l y Paral lel Assoc ia t ive Processors

A number of researchers have described processors that are associative but which have either extensive control built into the memory array or multiple control units. These architectures are discussed next.

Kautz [85] has designed an associative processor in which extensive arithmetic ability exists along with other functions at each storage ceil location. This array is called an ACAM ( A u g m e n t e d C o n t e n t - A d d r e s s e d Memory). Kautz and Pease have further refined designs of thts type [86]. Whereas many of the pr evio us associative processors were very simple-typically less than 10 gates per cel l - this array has cells whose complexity is on the order of 40 NOR gates. To understand the ACAM, think of the array as an m x n array of cells. Each bit is stored in the single cell shown in Figure 14. Each of the values ai, bi, ci,

a - -

b - -

TTT

b • FUP FLOP (V) Z"

o o o yI : y ~ - Z ( x + ~ ) R-Y

0 tl 1 YI=Y ~- Z (X 0"~; R-Y

0 I 0 y1 . y ~ ' - Y Z + Y X ~ ' Y X + Y Z

0 I I y t - y ~ . Z ~ - X + Y Z

I 0 0 y I . z y + z x ~ . Z ~ - x o Y z

I 0 1 Y I - Z Y * Z X ~ - Z ~ ' Y

I I 0 yI - X Y + X Z Z ' . X Z + X y R . X

I 1 I Y I = y ~ g x ~ z ~ . MAX (XyZ) ~ ' . X

II I P c

--2

- - b

Figure 14. Kautz's Cell.


232 • K e ~ e t h J . Thurber and Leon D. Wald

represents a programmable value that deter- mmes the array's behavior. The c values exist on a column basis and can be used as bit masks so only the columns selected are included in the operations. The ab values are paired on a per- word basis. There exists an ab value for each word in the array. These ab values serve to select the operation to be performed on that particular word. As with the c values (which are constant along a column), the ab values are constant over a word. The intersection of the c and ab values at a cell completely specify the function the cell performs.

Various words of the array can concur- rently perform different operations (e.g., a ib i does not have to equal aj bj if i #=j), thus allowing extreme flexibility. In fact, several different tests of the input word may occur simultaneously. A price is paid in two major areas for this flexibility. First, each cell has 8 terminals and an m x n array has 4 (m + n) edge terminals (8(m + n) pros if each cell were to be placed in a DIP). Second, unless all ab values are constrained to be equal, or entire words are masked, the actual use of the array in an efficient manner will be very difficult.

In the horizontal direction the array can realize such operations as shifting, column entry, column readout, equality search, and inclusion. In the vertical dtrection the array allows word shifting, word I/O, masked word, complementing, and other operations. Func- tions such as stacks, queues, etc., may be synthesized by using the left-most or right-most columns as tag registers. Registers to hold the c, a, and b values are highly desirable.

A unique associative processor designed for bulk filtering of radar data was described by Schmitz, et al. [24]. This processor is shown in Figure 15. In the figure are M control units and N Processing Elements (PEs). The design of the PEs is actually not pertinent to the basic concept. They could be complex (e.g., the PEPE or ILLIAC IV PEs), or be very simple associative memory ceils. The key element in the architecture is the Associative Control Switch (ACS), which allows one of the control units (CUs) to be connected to each of the PEs. Thus, each PE may be connected to any one of the M CUs, or it may be inactive. This allows for the sub- setting of the N PEs into M parallel or asso- ciahve processors. The connection is handled

PARALLEL I /0

F i g u r e 15. A s s o c i a t i v e C o n t r o l Switch Architecture.

associatively, therefore the machine is an associative processor.

This architecture tends to solve one of the p r o b l e m s of a c o n v e n t i o n a l associative processor, which is that while some PEs are operating, other PEs are inactive, and are thus n o t c o n t r i b u t m g to the solution of the problem. In this associative processor several different portions of the system may be con- currently operating on different instruction streams, thereby raising hardware utilization problems. Additionally, the ACS acts like an extended search-results register, i.e., the ACS is set to select a CU based upon the result of an operation or sequence of operations which are performed in the PE. Seeber and Llndquist [87] have discussed a similar architecture in which the PEs were associative processors.

Mixed-Mode and Multidimensional Memories

Memories have been traditionally limited tech- nologically to sequential or random-access, one-


A ssociatwe a~d Parallel Processors • 233

dimensional word lists. The major exception to this has been the class of assocxative-access memories. Jensen [23] extended the concept of conventional memories to include memories with unusual access modes and dimension- alities. The memories Jensen proposes are unconventional organizations designed to take advantage of LSI regularity and modularity. The claimed advantages of these types of memories are improved run-time efficiencies and m e m o r y utalizatlon, and easier programming.

Access modes are the means by which the data is "addressed." Jensen proposes that certain access modes be combined in the same memory (e.g., FIFO (First-In First-Out) and associative), or that memories with a specific access mode (e.g., FIFO) be directly implemented (as discussed by Derickson [88]) rather than emulated as proposed by others (King [89]).

Dlmensionahty refers to the number of coordinates of a memory. In a random-access (wordwise one-dimensional, i.e., hnear) memory an n-dimensional array would have to be linearized by some type of address translation. In an n-dimensional memory, n-dimensional data structures could be directly stored and addressed wath n-tuples. This concept has been proposed for use in a machine to implement APL [90]. It is important to note that a multi-

INPUT

OUTPUT

ASSOCIATIVE ~"t~INPUT/OUTPUT

ACCESS

Figure 16. Associative Queue.

dimensional memory need not have the same access mode in all directions. An example of th i s t y p e of m e m o r y is Jensen's two- dimensional queued associative memory. A nonlinear address as possible in any dimension.

Figure 16 shows an example of a mixed- mode memory, i.e., one with an associative queue. In this memory, data may be accessed either FIFO or associatively. This type of memory appears very efficient for imple- menting a Least Recently Used (LRU) page replacement algorithm. For the same functional capacity and performance the associative queue requires N 2 fewer associative bits to implement the LRU algorithm for N pages than an associative memory would require. An important feature of the associative queue is that it only requires adding simple shift logic between associative words, and this appears to be quite cost- effective in certain applications.

A t w o - d i m e n s x o n a l queued associative memory ks shown in Figure 17. This as an associative memory with a queue behind every associative-memory word. This type of memory architecture is difficult, if not impossible to emulate efficiently.

QUEUE WORD

J /

AM WORD

Figure 17. Two-Dimensional Associative Queue.

Jensen envisions these devices applied as small memories distributed throughout the computer. He feels that such memories are extremely useful for hardware support of executive functions. Ervem and Jensen studied the problems of interrupt processing in depth



[22 ] . The area of associative support of executive functions [21 ] has been studied using sma l l c o n v e n t i o n a l associative processors. Associative memories have also been found to have straightforward application in virtual memory systems [19] and to have some application in I/O processing [91 ].

Implemeated Associative Processors A number of associative processors have been implemented, with both standard and custom logic. This section will discuss the functional characteristics of these processors and the most common techniques used to implement them.

Figure 18 ts a functional block diagram of the typical associative processor architecture that has been realized. It consists of a control unit, data register, mask register, search-results register, word-select register, parallel I/O channel, associative memory array, match indicator, and multiple-match resolver. An example of the operation of an eight-word associative processor, as illustrated in Figure 18, follows.

It is important to note that there is no address decoding logic in an eight-word associative processor. This is because there are no addresses, per se. Data in the processor zs addressed by means of the contents, or by some property of the contents. Each word of memory is broken into variable-sized groups of bits called fields. These fields do not have to be made up of contiguous sets of bits, however, they are customardy constructed that way. The data and mask registers contain the same number of bits as a word, and the search-results and word-select reglsters contain a bit for each word.

In this simplified processor, the data register contains the word to be compared with the stored words, the mask register designates which of the bit positions of the search word are to be included m the search operation, a results register stores the results of the search, and a word-select register selects words to be searched over. For the example situation illustrated m Figure 18, word 7 has not been selected, as indicated by the contents of the word-select register. The contents of the mask register show that only the first field of the data register is to be included in the search. An equality search operation will result in the comparison of the contents of the first field of the

. . . . . [ ] [ 7 ~ [ ~ [ ] [] [] []

SEARCH WORD PER WORD °MMR MEANS MULTIPLE MATCH RESOLVER RESULTS SELECT PROCESSING

REGISTER REGISTER HARDWARE

Figure 18. Associative Processor Operation Example.

search register to the contents of the corresponding field of all stored words. It should be noted that only stored words 3 and 6 satisfy the search and are therefore identified by Is in the results register after completion of the search. Word 7 would have satisfied the search, but it was not in the set of words designated for searching by the word-select register. In many associative processor applications, such a search operation would normally be followed by a readout operation (whereby the identified words are sequentially read out), or by another search operation (in which case the search- results register would be transferred mto the word-select register). Note that a series of searches can be performed and the results can be ANDed together if the results in the search- results register are used as new contents in the word-select register. This logical function can be quite useful in computatzon. A multiple- match resolver (MMR - zndicated by the arrow in Figure 18) is also an integral part of the memory. The MMR lndmates the "first match" in the memory (as defined by the hardware) if there were any matches.

A number of techniques are available to implement the memory array. The major ones are bit-slice RAM arrays, bit-slice skewed adder arrays, bit-shce skewed exclusive or (EXOR) arrays, byte-slice arrays, and dlstrxbuted logic arrays. Each of these techmques, and the tradeoffs between them, is discussed briefly below. In the discussion, the following definitions



apply: 1) Bit Slice - Bit-Slice i consists of bit i of

all selected words. 2) Word Slice -Word-Slice j consists of all

unmasked bits of word j. Implementation of a bit-slice associative

memory is quite simple. A random-access memory is oriented to store bit slices instead of word slices. Consider a basic memory chip (e.g., a 256-word by 1-bit memory chip) as shown in Figure 19. A normal word-oriented memory of 256 words, with 256 bits per word can be constructed as shown in Figure 20. To use this memory as a bit-slice assocmtive memory simply rotate the bit array 90 ° so that what was formerly a word slice becomes a bit slice, i.e., when the address i is presented to the memory chip decoders, hit-slice i is addressed. An equality-search operation can be constructed

I N P U T / O U T P U T

A

/ /

COD E R

\

T

0

1

255

Figure 19. 256-Word by 1-Bit Memory Chip.

I/O 0

A

1

?

2. = 5

0

1

2E5

110 1 I/0 255

,t

T 0 0

1

- - 1

• I

I

255

Figure 20. 256 x 256 Memory Array.

C o m p u t i n g Surveys , Vol. 7, No. 4, D e c e m b e r 1975


for fields of several bits by ANDing together a number of single-bit equality searches. Since bit slices can be read from the memory into the external registers and processing logic, control sequences in the external logic can implement many complex operations (such as addition) between limits searches and equality searches.

There is one major problem with the bit-slice memory array: i.e., its serial nature. Operations are implemented using serial techniques. This can still provide high throughput when data sets are being processed m parallel. However, I/O is an example of an operation that may have to occur serially, but cannot be processed in parallel. In bit-slice associative machines these I/O operations can be very slow.

To solve the" I/O problem it is desirable to address the memory array so that both bit and word slices can be accessed in parallel. This can be accomplished by storing bit slices by the flowchart in Figure 21 and word slices by the flowchart in Figure 22 while using the logic system of Figure 23 [92]. The memory map for tins adder-generated skewed storage technique is illustrated m Figure 23. Shifting of the input and output bit slices is required for align- ment of the bit slices due to the address map properties of the storage techmque. The adder- generated skewed logic technique is not very amenable to changes m array size because the adders are not very modular (in today's hardware technology). This problem led Goodyear to use the EXOR generated skew logic technique illustrated in Figure 25 [93]. To implement this techmque Goodyear uses a coordi- nate addressing technique which generates both X and Y addresses. If the S register is set to all ls, word slic~s are addressed. If the S register is set to all 0s, bit shces are addressed. Other address modes such as an n-bit field of every nth word (n = 2P for some P) can also be easily d~signed.

Byte-slice machines can be easily implemented from any of the bit-slice skewed logic t e c h n i q u e s by simple interleawng access methods.

Another technique is to place more logic into each storage cell, that is, to employ distributed logic memories [85, 94, 95]. Many such memorms have been suggested. The most viable concept involves placing an EXOR gate and control logic with each storage cell. Memory arrays of this type can perform basic equality

Wald

ADDERS

READ ADDRESSI NOTE SINCE ADDERS ARE ENABLED ADDRESS I IS BEING INCREMENTED BY THE MEMORY MODULE NUMBER

SHIFT LEFT J PLACES

OUTPUT

(a0, pa l , i a255,1 )

Figure 21a. Bit-Slice Addressing - Read.

INPUT ] (ao I' al, ~ a255 I )

SHIFT RIGHT ] PLACES J

ENABLE ADDERS

WRITE ADDRESS ]

NOTE SINCE ADDERS ARE ENABLED ADDRESS j IS BEING INCREMENTED BY THE MEMORY MODULE NUMBER

Figure 21b. Bit-Slice Addressing - Write.

searches, not only parallel-by-word, but also in parallel over all bits. Match and mismatch cur- rents are usually summed on the word output lines. A bit-slice processor must AND together n bit-slice equality operations to search an n-bit field. Thus, it is possible that a dtstributed logic memory can perform such an equality search n times faster than a bit-slice memory. Further, the distributed logic memory has access to both bit and word slices in parallel, thereby allevi- ating the serial-by-bit word-slice I/O problem. To date, distributed logic memories have been quite expensive and have not seen extensive use. A good comparison of bit-slice and distributed logic memories can be found m [26].


READ ADDRESS I

SHIFT LEFT I PLACES

OUTPUT

(al, 0' a=, 1' ai, 2 . . . . . al, 255 )

Associative a~d Parallel Processors

o

J DECODER

qlol l

ADDRESS

AD • Q • • • • •

t ............... I

I . . . . . . . . . . . . . . . . . . I

• 237

) e ~ a 2 5 s

2

Figure 22a. Word-Slice Addressing - Read. Figure 23. Adder Skew Array Network.

INPUT

a,, 0' al, I ' " " "' al, 255

I SHIFT RIGHT i PLACES J

WRITE ADDRESS i

MEMORY MODULE

0 1 2 255

1 a1,255 a1,0 a1,1

2 a2,254 a2,255 a2, 0

,~ i ' ', '

l I I I S i 255 a255,1 a255, 2 a255, 3

al, 0, a,, 1 , ,a, 255

a0, I' a l , I' ' a255, I

Figure 22b. Word-Slice Addressing - Write. Figure 24. Adder Skew Network Memory Map.

Computing Surveys, Vo]. 7, No. 4, December 1975


,7 ,2 ,g ,g ~" t

iJ

~ x

6

8

o~

© x

O

Computing Surveys, Vol 7, No. 4, December 1975

WORD'~I ~

0

2 IdJ ._.1

0

4

0 1

1 0

2 3

3 2

4 5

5 4

6 7

7 6

BIT

2 3 4

3 2 5

0 1 6

1 0 7

6 7 0

7 6 1

4 5 2

5 4 3

5 6 7

4 7 6

7 4 5

6 5 4

1 2 3

0 3 2

3 0 '1

2 1 0

Figure 25b. EXOR Skew Network Memory Map (8x8).

Associat ive Processor S imu la to rs

Associative hardware of any sophistication has always been difficult to obtain. Therefore, a n u m b e r of simulation systems have been devised. Some of these will be discussed here. In general, most of these systems are very crude an comparison to the hardware structures of assoclatzve processors. For example, the software solutions are usually limited m capabdity to equahty searches because their implementa- tions are either a hash coding process or a complex hst system. Thus, m terms of capability, the simulators are totally inferior to the envlsmned hardware; they do not provide any- thing close to a realistic associative enwronment no r the means to evaluate such an envz- ronment. Furthermore, since the simulators are implemented on conventional serzal machines, they do not have hardware support for some very important functions, such as multiple- match resolution. Thus their utzhzatmn may be quite expensive and time-consuming.

LISP is a hst-processing language which has been used to implement a software associative retrieval system for graph processing (Simmons [961). Green [97] compares the properties and uses of the list-processing languages LISP, IPL-V, and FLPL. A concept close to LEAP, but based on a derivative language of PL]I, was used by Symonds [98] to implement a PL[I-


oriented data structure to be used for software association of data. The ASP system (Savitt and Love [81 ] ) had as one of its features a software interpreter which provided a simulation of the associative processor, and which could be used to execute the ASP language. AMPPL-II was a software version of the Goodyear plated-wire Associative Processor at RADC (Findler [99]). Brotherton and Gall [100] deslgued ALS, which is essentially a hardware version of the associative memory described by Feldman and Rovner for use with LEAP [ t01 ] . ALS was designed to emulate a fast hash code search for equahty searches. TRAMP (Ash and Sibley [102]) was a data base designed for a software associative processor.

LEAP [101] is an associative language designed for the processing of large, complex data structures. The language is based upon extensions of ALGOL. The basic supporting data structures have been implemented with a hash code scheme, thus limiting its effectiveness for equality searches. The referenced paper [ 101 ] presents a good overview of LEAP, along w i t h examples and comparisons to such languages as IPL°V, etc.

LEAP was designed to be utilized in programming Feldman and Rovner's simulated a s s o c i a t i v e m e m o r y . This memory was originally designed with capabilities to access table entries by means of hash coding. The underlying data structure was ring-like: A paging simulation was later developed for applications that require large amounts of memory. (A two-level hierarchy of core and drum memones was used.) LEAP is not a single language, but rather a famdy based on an extensible version of ALGOL. Each language form adds different capabilities to the basic ALGOL form. Typical forms of LEAP may include matrix operations, on-line graphics, property sets, etc. LEAP has two compilers (one including dynamic checking). The design philosophy was to use a translator writing s y s t e m to make possible extensible and application-tadored languages for the simulated associative processor.

Another way to achieve the "simulation" capability is to construct a simulator that contains a basehne assocmtlve processor. The basehne processor IS then used as hardware support to perform more realistic simulations of other associative processors. However, as

Computing Surveys, VoL 7, No. 4, December 1975

240 • Kent~elh J. Thurber and Leon D.

seen in Gall's hybrid associative processor study [50, 51 ], this technique has not produced good results to date. Further work of this sort was performed by Auerbach [37], namely, testing p r o g r a m s on real associative hardware at RADC. Unfortunately, these later attempts share the same problems as Gall's 1966 study, viz., the system configuratmn is so general- purpose and mtsmatched to the problem that the results are meaningless unless system adjust- ments are made to the measured performance in order to account for the configuration imbalance.

P A R A L L E L P R O C E S S O R S A N D EN- SEMBLES

Introduction

In the parallel processor area researchers started investigating machines that were arrays of cells connected m a four-nearest-neighbor manner. Such machines include Von Neumann's Cellular Automata [103] and the Holland Machine [104]. Eventually these machines were con* sidered curiosities, and interest switched to paraUel processors in which a central control mechanism controlled the entire array with the array operating in a SIM!D fashion.

Because their applications can vary, some of the architectures discussed in the following sections may be classified in a number of ways. Depending upon the interpretation of the design intent, PEPE [62] could be classified as either an ensemble or an associative processor, and the Orthogonal Processor [105] and SIMDA [106] could be classified as associative or parallel processors. SIMDA, PEPE, and the Orthogonal Processor have been classified as associative, ensemble, and parallel processors, respectively, these classifications result from the author's interpretation of the applications intended by the designers of the processors.

Unger's Machine

The Unger machine [107] was designed to perform pattern-recognition processing. It was primarily oriented towards processing of lines. Line thinmng, doubling, extending, and center determination are typical functmns that could be accomplished with the machine. The main functional units of the complex consisted of the central control computer and the processing element array. The central control computer

Wald

sent commands to all PEs in the array in parallel. The PEs were connected to their four nearest neighbors. The PEs did not have an activity bit; however, the array could enable conditional jumps in the master control computer by means of a mechanism that could test for all zeros in the PE array (i.e., a designated bit of each PE could be tested and all these bits could be ORed together to specify whether or not the array contained the zero value). The master control computer could jump on the condition of all bits zero.

SOLOMON I, SOLOMON II, and ILLIAC IV

Slotnlck, et al. [57, 108, 109] designed the SOLOMON I, SOLOMON II, and ILLIAC IV machines, each of which has a PE array with f o u r - n e a r e s t - n e i g h b o r c o n n e c t i o n s . The machines were designed to work on problems involving differential equations, matrix manipulations, weather data processing, and hnear algebra.

SOLOMON I was a bit-serial processor. Each PE contained a serial accumulator and could execute instructions. A block diagram of the SOLOMON I PE is given in Figure 26. The control unit for SOLOMON I is shown in Figure 27. Figure 28 gives a block daagram of the overall SOLOMON I system. The X register was used to obtain parallel I/O from the PE array ; it could read from the top, bot tom, left, or fight of the array. Serial I/O through the control unit was also available. The control unit had a specific set of instructions, and the PEs had a separate set of instructions.

A few major changes to SOLOMON I led to SOLOMON II. The serial arithmetic concept, although quite flexible, was too slow for the intended apphcations. Therefore, the SOLO- MON II arithmetic units were switched to 24-bit floating-point umts.

As the SOLOMON processors developed to eventually become ILLIAC IV, the floating- point arithmetic units were modified further to a 32-bit word length. Also, the array con- figurations changed from four 8 x 16 "quadrants" to four 8 x 8 PE quadrants. Only one quadrant was actually implemented in ILL|AC IV. The purpose of multiple-quadrant con- f i g u r a t i o n s was to a l l o w s imul taneous processing of unique programs or subroutines. ILLIAC IV w111 be discussed in detail later in this paper.


Associative and Parallel Processors 241

TO MATRIX SWITCH

AND CONTROL

t--

CENTRAL CONTROL

r I

TO I I MATRIX _ I I SWITCH

AND i I CONTROL I I

I I I

FRAME L = BUFFER

L MEMORY A CONTROL ~ FRAME

v CONTROL

FRAME 2 BUFFER

I C E N T R A L C O N T R O L

I MODE GEOMETRIC L CONTROL CONTROL

MODE CONTROL

LOGIC

PE CONTROL LOG IC

SERIAL ARtTHMETtC

AND CONTROL U N T

l

__• GEOMETRIC STATE

FROM NEAREST NEIGHBORS

I TO ~--- NEAREST

I NEIGHBORS

Figure 26. SOLOMON I PE.

PE CONTROL REGISTER

BROADCAST REGISTER

MOO E CONTROL

l PEs

I VAR lAB LE ROUTING

INSTRUCTION REGISTER

OPERATION CONTROL

LOGIC

PE CONTROLLER

MATRIX SWITCH

AND CONTROL LOGIC

MATR IX SWITCH

AND CONTROL LOGIC

l PE MEMORIES

Figure 27. SOLOMON I Sequence (Partial Block Diagram of SOLOMON I CU).



i CONTROL UNIT

I

i i_ _ : _ _

Figure 28. SOLOMON I System.

Machines wi th Other In te rconnect ion Structures

Although machines with connections between the four-nearest-neighbors predominate in the field of parallel processors, machines with six [631 and eight-nearest-neighbor connections have b e e n p r o p o s e d . Further, machines embedded m the n-cube [65], with irregular [110] and with variable [111 ] interconnection structures have also been discussed.

Orthogonal Computer

Shooman [105] initiated the concept of the orthogonal computer. It consists of a horizontal arithmetic unit (HAU) and a vertical arithmetic umt (VAU) sharing an orthogonal memory (OM). The OM is a memory device that pro- rides access to both word slices (HAU access) and bit slices (VAU access). It is partitioned as shown m Figure 29. Assume that there are R VAU elements. The OM could be implemented as an associative memory; however, it was originally envisioned as a bit-shce processing device that would be useful for high-speed signal processing. The detailed implementahon of an OM will be discussed further on m this paper, m the section on OMEN [ 112].

PEPE

PEPE (Parallel Element Processing Ensemble) is a machine designed for ballistic missile defense data processing [62]. It is an outgrowth of the Distributed Logic Memory (DLM) concept. Originally, PEPE consisted of two major com- p o n e n t s : the CU (Correlatmn Unit)/CCU (Correlation Control Unit) complex, and the

HORIZONTAL UNIT t INPUT/ ~ (WORD SERIAL IF- . . . .

OUTPUT BIT PARALLEL PROCESSING)

ORTHOGONAL MEMORY UNIT I , 2 . . . . . . I

I I !

I I

t i , 1 1 Z - - 2 - - L I

I + 1 1 2 . . . . . L

I KR

I L NR I

HORIZONTAL/ VERTICAL

SYNCHRONIZATION CONTROL

I I I I I

(BIT SERIAL WORD PARALLEL PROCESSING)

I

I I INPUT/OUTPUT

I

F i g u r e 29. O r t h o g o n a l Computer Block Diagram.

AU (Arithmetic Unit)/ACU (Arithmetic Con- trol Unit) complex. Incoming targets are corre- lated by the CU/CCU complex. Each CU contains an associative memory built on the two- dimensional DLM concept. When a target correlates against the track file (a track is stored in each PE, or in an AU/CU pair) it is stored into the common AU/CU memory for future updating, prediction, and phased array radar pulse generation computahons by the AU/ACU complex. When an I/O bottleneck at the ACU was discovered, an additional complex, AOU (Associative Output Unit)/AOCU (Associative Output Control Unit) was added to PEPE. PEPE will be detailed further in the next section of this paper.

Comment

Associative architectures exist in greater variety than do parallel and ensemble architectures. The major reason for this is the wide variety of i n t e n d e d a p p l i c a t i o n areas . Associative machines have been considered for uses ranging f r o m s i m p l e virtual memory mechanisms through use in matrix mampulation and data- management processors. Thus, a wide variety of architectures can be expected. Most parallel and ensemble processors are used in problems re-

Computing Surveys, Vol 7, No. 4, December 1975

quiring heavy computational loads; many are oriented for matrix manipulation problems. Thus, the architectural variety of these processors has not been as rich as for associative processors, although more of the anticipated applications have materialized and more of these processors have developed beyond the feasibility stage. The next section examines the m a i n SIMD p r o c e s s o r s that have been developed.

ACTUAL MACHINES

Introduction

This section will give a brief description of the mos t important SIMD machines currently available. The machines will be discussed in approximated ascending order of complexity.

STARAN Early Goodyear Associative Processors used plated wire as the storage medium [113]. The machines consisted of a basic bit-slice associative processor and were thus restricted to a bit-serial mode of I/O access. STARAN (a semi- conductor version of the Goodyear Bit-Slice Associative Processor) was designed to correct this deficiency [93]. STARAN has parallel word I/O. It uses the EXOR generated skewed logic storage technique described earlier.

STARAN has access to either a row or a c o l u m n , much like Shooman's orthogonal processor (OMEN), but it has only one set of registers and processing logic (M,X,Y) instead of both word-slice and bit-slice registers. For effective array utilization, STARAN was partitioned into arrays of NxN bit matrmes, where N is a power of 2. Addressing is to a particular row or column. STARAN is composed of a number of these arrays.

STARAN utilizes a PDP-11 as its sequential controller, and has a sequential AP control unit which has a memory cycle about ten times faster than the PDP-I I.

To the programmer, the basic array appears to be addressable in a bit-slice or word-slice mode, as indicated in Figure 30. Actually, the system contains a logic circuit called a Flip Network (FN). Internally, the machine addresses bit slices (FN is essentially bypassed) or word slices (FN addresses diagonals). Goodyear established the number of word slices and bit slices to be 256, as this size is compatible with


WORD 0

WORD 255

• 243

WORD SLICE ADDRESSING

OR

BIT SLICE ADDRESSING

Figure 30. STARAN Main Address Modes.

current commercially available RAMs. Most of the FN can be implemented utilizing off-the- shelf selector and EXOR chips.

The cost of the FN is about 20% less than the cost of the memory array; the array (256 x 256) contains 65,536 bits and the FN is equivalent to about 50,000 bits.

The block diagram of a basic array is given in Figure 31 The X, Y, and M registers each consist of one bit per associative-memory word. The X register is generally used to store temporary results. The Y register effectively acts as the search-results register. It typically contains the results of search, arithmetic, and logic operations. The M register is used to specify element activity. This register, in the bit-slice mode, corresponds to a word-select register, and in the word-slice mode, to a mask register.

The processing array has the capability to perform any of the two variable logical functions between registers. STARAN does not have a dedicated serial adder on a per word basis. Goodyear programs the X and Y registers, using


2 4 4 • Kem~eth J. Thurber and Leon D. Wald

O U i P U T tPARALLEL)

256 B I I S

I

I

I LOGIC {

A R G U M E N I 132 OQ 256 BITS~

Figure 31. STARAN Basic Array.

the logical functions, to appear to have this capability. This cuts the cost of an array, but requires the facility for high speed operation within the X, Y, M register complex.

Each of the 256-word x 256-bit associative arrays includes a 256-bit resolution system. Resolution is always going on in each array, and the system interface is a 9-bit response output. Eight b i t s give the address of the first responder, and the 9th bit is the Inclusive OR of the response register. The overall system structure is given in Figure 32. A summary of STARAN is tabulated in Figure 33.

The system I/O is not definable because each system is unique in regard to the kind of I/O required. Typical options indude DMA to a host computer, buffered I/O for peripherals, communcation through the extemal function logic, and parallel I/O channels into any of the arrays.

The assembly language APPLE - Associ- ative Processor Procedural Language [114] - has been developed for STARAN. Assemblers

AP MICROPROGRAMMED CONTROL MEMORY

SUBSYSTEM

MEMORY PORT RESOLUTION LOGIC

AP CONTROLLER

L

I- 'r ;

PROGRAM PAGER LOGIC

PDP-I t (SEQUENTIAL CONTROLLER)

J EXTERNAL FUNCTION ~ l LOGIC

• ASSOCIATIVE MEMORY ~ ARRAY 0 l -

- • ASSOCIATIVE MEMORY , ARRAY t I

~ASSOCIATIVE MEMORY ~1 ARRAY 31

- - PARALLEL I/O LOGIC AND CONTROL

OMA

~BUFFERED I ~

EXTERNAL FUNCTION COMMANDS

Figure 32. STARAN System.

TO HOST - COMPUTER

Comput ing Surveys, Vol. 7, No. 4, December 1975

STANDARD FEATURES [ OPTIONS AVAILABLE

1 BASIC ASSOCIATIVE ARRAY

1 AP CONTROLLER

15 INTERRUPTS

AP CONTROL MEMORY

3 512WORD PAGE BIPOLAR MEMORIES

1 S12WORD HSDB BIPOLAR MEMORY

1 16KWORD BULK CORE

PROGRAM PAGER

POP11

8192 WORD 16BtT CORE

PAPER TAPE

KEYBOARO PRINTER

S INTERRUPTS

1 EXTERNAL LOGIC FUNCTION SECTION

UP TO 31 ADDITIONAL ARRAYS

DUAL CONTROL

ADDITIONAL INTERRUPTS

3 1024WORD PAGE BIFOLAR MEMORIES

1 1024WORD HSDB BIPOLAR MEMORY

1 32KWORD SULK CORE

30 720 WORDS OF EXTERNALLY AODRESSABLE MEMORY

20K 1SBIT CORE

ADDITIONAL PERIPHERALS

ADDITIONAL INTERRUPTS

INTERFACES DMA BID EXF PIO

Figure 33. STARAN System Capability Sum- mary.

for APPLE are available and are tailored for the i n d i v i d u a l machine installation. Few I/O instructions are included m APPLE, since I/O is custom]zed for each installation

STARAN has been purchased by RADC (Rome Axr Development Center), Johnson Space Center at NASA - Houston, and the US Army Topograptuc Laboratory. The use of STARAN m these various facilities should provide a good test of its overall processing character|stlcs.

OMEN

OMEN (Orthogonal Mlm EmbedmeNt) is the p r o c e s s i n g s y s t e m designed by Sanders Associates. OMEN is based upon the orthogonal computer architecture [105]. It utihzes a modified byte-slice associative-access memory designed using Intel 1103 ch]ps as the basic cells.

Figure 34 shows the functional orgamzatlon of OMEN. The key concepts in OMEN are: 1) use of the scoreboard for the control concept, 2) implementation of the orthogonal memory (OM), and 3) use of the PDP-I1 as the Hori- z o n t a l Arithmetic Unit (HAU) to bypass development of costly system and support software. There are 64 PEs in the Vertical Arith-

Associatwe and Parallel Processors • 245

PARALLEL INPUT/OUTPUT

Figure 34. OMEN System Block Diagram.

metic Unit (VAU). Four models of OMEN are dlstmgmshed by their PE complexities.

The OM is interleaved so that it appears to the HAU as shown m Figure 35, and to the VAU as shown m Figuree 36. Random access in the vertical direction is to the byte slice. The cost to build the 1103-based OM Is only 19% more than the cost of a comparable 1103-based random-access read/write memory.

There are etght holding regtsters in the OMEN VAU. These, along with the VAU skew logic, allow for" broadcasting a value to all PEs, matrix pre-multlply, matrix post-multiply, perfect shuffle, barrel shift, and order reversion. These PE lnterconnections allow efficient processing of matrices, vectors, and signal processing algorithms such as the Fast Fourmr Transform.

BYTE 0 BYTE 1 OM H O R I Z O N T A L

WORD O

BYTE BYTE

131070 131071

OM H O R I Z O N T A L

WORD 65535

Figure 35. OMEN HAU Memory Map.


246

BIT SLICE 16376

BIT SLICE 16377

BIT SLICE 16378

BIT SLICE 16379

BIT SLICE 16380

BIT SLICE 16381

BIT SLICE 16382

BIT SLICE 16383

Kenneth J. Thurber and Leon D. Wald

64 BITS PER BIT

SLICE

BIT SLICE 0

BIT SLICE 1

BIT SLICE 2

BIT SLICE 3 OM VERTICAL BYTE SLICE

BIT SLICE 4 0

BIT SLICE 5

BIT SLICE 6

BIT SLICE 7

BIT SLICE 8

BIT SLICE 9

BIT SLICE 10

BIT SLICE 11 OM VERTICAL

BIT SLICE 12 BYTE SLICE I

BIT SLICE 13

BIT SLICE 14

BIT SLICE 15

O M V E R T I C A L

- - B Y T E S L I C E

2 0 4 7

ouE~ I

MODELS MODEL 61

REGISTERS 8 BITS/PE 37 BITS/PE SAME AS OMEN 133 BIT$/PE ()NCLUOES 4 64 BUT LACKS (iNCLUOE$ B GENERAL ROM STORAGE GENERAL REGISTERS FOR FFT REGISTERS OF 16 BITS COEFFICIENTS OF 16 BITS EACH) EACH

Figure 36. OMEN VAU Memory Map.

Figure 37 gives the PE variations that are available with OMEN. I/O is accomplished via the UNIBUS* or directly through the OM, using a parallel I[O channel.

The software available to OMEN consists of extensions of the FORTRAN and BASIC supplied with the PDP-11. In addition, an extended version of APL is being planned.

PEPE

PEPE (Parallel Element Processing Ensemble) xs an ensemble designed for balhstic missile radar defense processing [115]. The system configuration of PEPE is shown in Figure 38. The ACU (Arithmetic Control Unit) and AU (Arlth- merle Umt) block diagrams are shown in

ALU BIT SERIAL s i r SERIAL FULLY FULLY ARITHMETIC ARITHMETIC PARALLEL PARALLEL

HARDWARE E FLOATING FLOATING FC}{NT PGLNT ARITHMETIC

Figure 37. OMEN PE Variations.

F . . . . .

I

I pEPE ~ I PROCESSING IELEMENT

I I I L . . . . .

PEPE HOST COMPUTER INTERFACE

. . . . .

EMC ELEMENT MEMORY CONTROL • OD C OUTPUT DATA CONTROL

Figure 38. PEPE System.

Figures 39 and 40, respectively. As can be seen from Figure 38, the system can be inputt ing data in the CCU/CU complex, updating tracks in the ACU/AU complex, and output t ing radar control commands through the AOCU/AOU complex simultaneously.

C o m p u t i n g S u r v e y s , V o L 7 , N o . 4, D e c e m b e r 1 9 7 5

Associative a~Td Parallel Processors • 2 4 7

PROGRAM DATA MEMORY MEMORY HOST

OTHER PEPE

SUBUNiTS

PROGRAM ~ DATA FETCH MEMORY

CONTROL CONTROL PROGRAM COUNTER

INSTRUCTION BUFFER

ROUTING CONTROL

LOGIC r

T

PARALLEL OPCODE/ OPERAND PAIRS

I INDEX

REGISTER FILE

7

It

INTERCOMMUNICATION I I/0 LOGIC LOGIC I/O BUFFER (VIRTUAL) SYSTEM MODE

T SEQUENTIALExECUTION E ~__~ ACCUMULATOR (A)

LOGIC

EXPLICIT OPERAND (B) I t

EXTERNAL DATA AND CONDITIONS

GLOBAL MEMORY DATA DATA

L - ~ C ~ I ~ . . . . . . . . .

T

t ........ I I . . . . . . . . . OR I I . . . . . . . . . I OPERAND (S~ (A) . (O)

T ARITHMETIC DPC" I

- - LOGICAL AND ov'SHIFT LOGIC

ACTIVITY

ACTIVITY I{AG REGISTER STACK

DPC DOUBLE PRECISION CARRY EA ELEMENT ACTIVITY EF ELEMENt" FAULT OV ARITHMETIC OVERFLOW

Figure 40. PEPE AU.

Figure 39. PEPE ACU.

The complexes are quite similar, so only the ACU/AU complex need be described. Exe- cution sequencing consists of" 1) instruction fetch, and 2) instruction evaluation. The result of this process is an instruction which xs routed to the sequential control section or to the PIQ (Parallel Instruction Queue) for transmission to the PICU (Parallel Instruction Control Unit). The PIQ is invisible to the programmer. The PICU is microprogrammed, but the PEs are hardw~red to execute as slaves to the PICU microlnstruct ions .

The ACU has accumulator and extension, index, condition, interrupt mode, and I/O buffer registers. The AU has accumulator, overflow, double precision carry, element activity, fault, tag, and activity registers. PEPE PEs use the activity stack concept to support nested control structures from the extended version of F O R T R A N ( P F O R ) that Is avatlable.

PFOR contains constructs that allow for both sequential (control umt)var iable , and parallel (PE) varmble declarations. Parallel arithmetic and logic expression evaluations are also provided. The WHERE statement is the parallel analog of the FORTRAN IF statement. A counting function is available to tally the number of active elements, and to furnish the exact number of matches and indications of

Cornputlng Surveys, Vol. 7, No. 4, December 1975

248

none, one, many, or all in the match indicatmn subsystem. An analogy to the FORTRAN lo~cal IF statement is also provided. Lastly, a parallel DO statement to control sequencing is also available. An assembly language, PAL ( P a r a l l e l A s s e m b l y L a n g u a g e ) supports commonality. Each of the six unl ts -ACU, CCU, AOCU, AU, CU, and AOU- l s able to execute a subset of PAL, thereby simplifying the software problem as much as possible.

To date, a 16-element versmn of PEPE has been built and benchmarked. A 288-element version is currently being constructed. Each processing element is a 32-bit floating-point unit with a complexity of about 8800 gates, plus memory. A floating-point add from PE memory takes about 800 nanoseconds.

ILLIAC IV

ILLIAC IV is the largest parallel processor currently operating. It has a four-nearest-neighbor interconnection structure, and is partitioned into four 8 x 8 PE quadrants. However, only one of the four quadrants has been built.

A functional block diagram of the ILLIAC IV CU is given in Figure 41. The CU is composed of five major subsections: ILA (Instruc- t i o n Look Ahead) , ADVAST (Advanced Station), FINST (Final Station), MSU (Memory Service Umt), and TMU (Test and Maintenance Unit). The CU controls the sequencing of the PE quadrants. CU instructions are fetched from the PE memories and paged into the ILA. Thus, functionally the CU has an instruction memory, but physically the memory is an integral part of the quadrant PEs' memories. This allows the ILLIAC IV programs to be different in separate quadrants and to be fetched from backing storage at the same time the PE data is fetched. The CU contains four general-purpose accumu- lators, several control registers, a 64-word scratch pad and quadrant control registers. The ADVAST subsection examines each instruction and executes sequential instructions. Parallel instructions are decoded by the FINST and transmitted to the PEs for execution. (FINST functions much like the PEPE PICU.)

The processing unit (PU) consists of the PE, its memory (PEM), and the MLU (Memory Logic Unit). The PE function is diagrammed in Figure 42. The PE contains no control logic. It xs a slave to the CU. The PE consists mainly of registers and high-speed arithmetic logic, plus

• Kenneth J . Thurber and Leon D. Wald

PE MEMORIES

FROM CDC ¢

{CDC} CONTROL

DESCRIPTOR CONTROLLER

CONTROLBU$ D A ~ A A N D CONTROLBUS ADDRESS

BUS

Figure 41. ILLIAC IV CU.

parallel shift logic. As shown in Figure 42, the PE registers that are visible to the programmer are

A - results register activated by the PE actwity status,

B - operand register; R - intermediate storage register, which is

always enabled and used for communication; and

S - an Intermediate storage register that is only operable if the PE is active.

Double indexing is possible m ILLIAC IV. Addresses may be indexed in the CU, and the address that is passed to the PE array may be individually indexed in each PE. A number of higher order languages (TRANQUIL [116] and I V T R A N [117]) have been proposed for ILLIAC IV. The current de facto standard appears to be GLYPNIR [69], an extension of ALGOL. It is block structured, and provides for b o t h sequential and parallel variable data declarations. Parallel assignment statements are available. Further, arithmetic capabilities may be controlled with a routing index which allows the computation to be performed remotely (In another PE) and routed to the currently active PE. GLYPNIR constructs are available to provide dynamic storage allocation, and data declarations allow static storage allocation. Pointers are supplied to support a record- processing capabihty. Pointers may be vectors

Computing Surveys, VoI. 7, No 4, December 1975

XD AND FROM NEIGHBORING DATA FROM OATA BUS TO AND FROM

c u PEs PE MEMORY FROM CU

flOUTING EXTENSION TEMPORARy

aEGISTER(S)

Ao/ tESS

INE EX i ~ ~ R E G I ;TER

ADDRESSES TO BOOLEAN pE MEMORY

LOGIC

DATA TO pE MEMORY

Figure 42. ILLIAC IV PE.

and may be conf'med (PE pointer), or noncon- fined (CU pointer).

The ILLIAC IV PE consists of approximate- ly 10,000 gates. Typical execution speeds are on the order of 500 nanoseconds. The read- cycle time of the PEM is about 250 nanoseconds.

CONCLUSION SIMD machines are an innovative and con- troverslal architecture concept. Currently, they have their detractors [15, 17, 118] and their proponents [119]. Each group is eminently able to defend its positions. However, the arguments of the detractors become moot if the machines are properly represented as special purpose.

There are a number of laws [120], con- jectures [118], and effects [121] winch detractors claim will spell the doom of SIMD machines. Yet there are problems on which these machines perform spectacularly [ 122].

Consider ILLIAC IV If the cost measure apphed were hardware duty cycle as a function of percent of cost, then ILLIAC IV would be


quite cost-effective. About half of the hardware cost is in the control section. Each of the ILLIAC PEs (processing elemeats) contain nearly 1 OK gates. The CU is nearly 100K gates. However, the PE gates are mainly ALU and register gates, which are quite regular, whereas the CU gates are mostly random-control logic. Therefore'; even in a large 256-element configuration, although there are more total gates in the PEs, the cost of these gates is about equal to the CU cost. Thus, in the 64 processing element (PE) ILLIAC IV, even if only one PE is being driven by the control unit, probably 50 % of the effective hardware (on a cost basis) is being utilized. Due to the careful design the PEs are simple slaves, and all control unit functions are centralized so that a large portion of their ha rdware is usually active. Keeping large portions of the hardware busy most of the time is a good goal, but it ts not necessarily the most critical task. This is not to say that all problems are perfectly suited for ILLIAC IV, or that hardware duty cycle utilization is a reasonable cost measure. In most ILLIAC IV applications it is probably meaningless, to determine what percent of the hardware is active at any given time. Mclntyre [123] points out that the utilization of a 256-element ILLIAC IV only gives a speedup of 60 times to the table lookup problem because 80% of the time is spent moving data between PE memories. However, one must note this is an order of magnitude better than Mlnsky's Conjecture would predict. Kuck and Sameh [124] have shown that the efficiency of ILLIAC IV on matrix eigenvalue problems may range between 11 and 90 % for different algorithms. Stone [125] shows how ILLIAC can cut the solution time considerably for certain linear systems or equatmns. In a sequential processor, this t:me is proportional to N. ILLIAC IV can reduce this proportion- ality to log2N if N processors are available. Using N processors, Stone has obtained a speedup of N/log 2 N. Minsky's Conjecture says that the max:mum effectweness is log2N for N processors, Le., that the maximum speedup is log 2 N. One notes that if N ~ 4, Stone's so- luhon is greater than the Minsky Conjectured value, and that (for example) for N = 8, Stone's solution runs four times faster than Minsky predicts it would. To compound the difficulty of evaluating system performance, different types of special-purpose processors are avail-



able. How should they be compared? Gilmore [126] shows that a 6600/STARAN combina- tion can solve weather forecasting problems about seven tzmes faster than a 360/65 augmented by the special-processor IBM 2938, Model 2. Tbas illustrates the problems involved in making such comparisons.

Gall [50, 51] reported associative-processor time improvement factors of from 2 to nearly 200 over a CDC 1604 on spelling, correction, and automatic abstracting problems.

The key point is that special-purpose machines correctly designed for their appfica- tions can perform spectacularly, but these machines are not effective when applied to problems they were not designed to solve. Because they are individually designed for different problems, they are difficult if not impossible to compare in terms of performance.

It is true that sizing problems which occur in the application of rigidly structured machines such as ILLIAC IV evoke hazards to the use of these machines, but they are special-purpose machines, designed for specific problems or classes of problems. Applications to any other p r o b l e m s mus t result in a performance degradation. To date, much of the increased performance of large-scale machines has come basically from advances in componentry (faster and smaller clrcmts). There are physical limxts which will surely bander major advances of tbas sort in the future; therefore, it is imperative that practitioners of the art of system architecture not disregard parallel processors too soon. The impact of processors that are functional analogs of physical systems has just begun to be felt. Effective use of LSI, large military and civilian real-time problems, and the advent of microcomputers are necessitating a reexamina- txon of the fundamental nature of computer structures. One promising architecture is that of pa ra l l e l processors. With PEPE, ILLIAC, STARAN, OMEN, etc., SIMD architecture can be assessed. The impact of the systems has only begun to be measured.

REFERENCES

[1 ] Hobbs, L C.; and Thels, D. J "Survey of Parallel Processor Approaches and T e c h n i q u e s , " Parallel Processor Systems, Technologtes, and Apphca- ttons L C Hobbs, et al. (Ed.), Spartan Books, New York, 1970, pp. 3 - 2 0 .

Wald

[2] Hollander, G. L. "Architecture for Large Computing Systems," Proc. AFIPS Sprmg Jt. Computer Conf., 1967 pp. 463 -466 .

[3] Mmker, J. "An Overview of Asso- c i a t i v e M e m o r y or C o n t e n t - Addressable Memory Systems and a KWIC I n d e x to the Literature" 1 9 5 6 - 1 9 7 0 , " Computmg Revtews, Oct. 1971, pp. 453 -504 .

[4] Mlnker, J "Associative Memories and Processors: A Description and Ap- praisal," Teehmcal Report TR-195, University of Maryland, July 1972.

[ 5 ] Mukhopadhyay, A. "Survey on Macro- cellular Research," Technical Report on NSF Grant GJ-723, University of Iowa, Dec. 1972.

[6] Cannel, M. H., et al. "Concepts and Applications of Computerized Asso- ciative Processing, Including an Asso- c ia t ive Processxng Bibhography," N a t i o n a l T e c h n i c a l I n fo rma t ion Service, Dec. 1970, AD 879 281.

[7] Hanlon, A. C. "Content-Addressable and Associative Memory Systems: A Survey," IEEE Trans. Eleetromc Com- puters, August 1966, pp. 509-521

[ 8 ] Murtha, J. C. "Highly Parallel Informa- tion Processing Systems," Advances In Computers , Academic Press, New York, 1966.

[9] Parhaml, B. "Associative Memories and Processors. An Overvmw and Se- l e c t e d Bibliography," Proc. IEEE, June 1973, pp. 722-730 .

[10] Thurber, K. J. "Large Scale Architec- ture. Assocxatlve and Parallel Proc- essors," Hayden Publishing, Rochelle Park, New Jersey, in press.

[11 ] Proc. 1972 Sagamore Computer Con- ferenee on RADCAP and tts Apphea- tzons, August 23 -25 , 1972, pubhshed jointly by IEEE, RADC, and Syracuse University.

[12] Proe. 1973 Sagamore Computer Con- ference on Parallel Processmg, August 22 -24 , 1973, pubhshed jointly by IEEE, ACM, and Syracuse Unwersity.

[13] Proc. 1974 Sagamore Computer Con- ference on Parallel Processing, T. Y Feng (Ed.), August 20 -23 , 1974, Sprmger-Verlag, New York.

('omputmg Storeys, Vol. 7, No 4, December 1975

[14] Proc. 1975 Sagamore Computer Con- ference on Parallel Processing, August 19-22, 1975, to be published by IEEE.

[ 15 ] Flynn, M. J. "Some Computer Orgam- zations and Their Effectiveness," 1EEE Trans. Computers, Sept. 1972, pp. 948--960.

[16] Murtha, J. C., and Beadles, R. L. "Survey of the Highly Parallel Infor- mation Processing Systems," Office of Naval Research Report No. 4755, Nov. 1964.

[17] Shore, J. E. "Second Thoughts on Parallel Processing," Computing and Electrical Engineenng, Vol. 1, Per- gamon Press, Oxford, England, 1973, pp. 95-109.

[ 18 ] Thurber, K. J.; and Berg, R. O. "Apph- cations of Associative Processors," Computer Design, Nov. 1971, pp. 103-110.

[19] Denning, P. J. "Virtual Memory," Computzng Surveys, Sept. 1970, pp. 153-189.

[20] Wald, L. D., and Anderson, G. A. "Assoc ia t ive Memory for Multi- processor Control," Final Report NAS 12-2087, Sept. 1971.

[21] Berg, R. O.; and Johnson, M. D. "An Associative Memory for Executive Control Functions m an Advanced Avionics Computer System," Proc. of the 1EEE 1970 lnt'l. Computer Group Conference, June 1970, pp. 336-342.

[22] Erwm, J. D.; and Jensen, E. D. "Inter- rupt Processing with Queued Content Addressable Memories," Proc. AFIPS Fall Jt. Computer Conf. 1972, pp. 621-627.

[23] Jensen, E. D. "Mixed-Mode and Multi- dimensional Memories," COMPCON '72, 1972, pp. 119-121.

[24] Schmitz, H. G., et al. "ABMDA Proto- type Bulk Filter Development Concept Definition Phase," Final Report, Con- tract No. DAH 60-72-C-0050, National Technical Information Service, Aprd 1972.

[25] Joseph, E. C.; and Kaplan, A. "Target Track Correlation with a Search Memory," Proe. 6th Natzonal MIL-E- CON, June 1972, pp. 255-261.


[26]

• 251

Thurber, K. J. "An Associative Proc- essor for Air Traffic Control," Proc. AFIPS Spring Jt. Computer Conf., 1971, pp. 49-59.

[27] Johnson, M. D.; and Gunderson, D. C. " A n Associative Data Acquisition System," Proc. 1970 International Telemetry Conference, April 1970, pp. 107-115.

[28] Wald, L. D. "An Associative Processor for Voice/Data Communications," Proc. 1972 Sagamore Computer Con- ference, August 1972, pp. 135-144.

[29] Wald, L. D. "Integrated Voice/Data Compression and Multiplexing using Associative Processing," Proc. AFIPS 1974 Nattonal Computer Conference, May 1974, pp. 133-138.

[30] Pease, M. C. "An Adaptation of the Fast Founer Transform for Parallel Processing," J. ACM, April 1968, pp 252-264.

[ 31 ] Lee, C. Y. "Intercommunicating Cells, Basis for a Distributed Logic Com- puter," Proc. AFIPS Fall Jr. Computer Conf., 1962, pp. 130-136.

[32] Seeber, R. R.; and Lindquist, A. B. "Associative Memory with Ordered Retrieval," IBM Research and Devel- opment, Jan. 1962, pp. 126-136.

[33] Seeber, R. R. "Symbol Manipulation with an Associative Memory," Proc. 16th Nattonal ACM Conf., ACM, Sept. 1961.

[34] McCormick, B. H.; and Divilbiss, J. L. Tentative Logical Realization of a Pattern Recogmtton Computer, Re- port No. 4031, Digital Computer Lab, Umverslty of Illinots, 1969.

[35] Fuller, R. H.; and Bird, R. M. "An Associative Parallel Processor with Application to Picture Processing," Proc. AFIPS Fall Jr. Computer Conf., 1965, pp. 105-116.

[36] Bird, R. M. "An Associative Memory Parallel Deltic Realization for Active Sonar Signal Processing," Parallel Proc- essor Systems, Technologies, and Apphcattons, L. C. Hobbs, et al. (Ed.), Spartan Books, Washington, D.C., 1970, pp. 107-129.


252

[371 Auerbach, "Associative Memory Inves- tigations - Substructuring, Searching and Data Organizations," Final Re- por t , Air Force Contract AF 30 (602)-4309, May 15, 1968.

[38] Crane, B. A. "Patch Finding with Associative Memory," 1EEE Trans. on C o m p u t e r s , (July 1968), pp. 691-693.

[39] Estrm, G.; and Fuller, R. H. "Some Applications for Content-Addressable Memories," Proc. AFIPS Fall Jt. Com- puter Conf., 1963, pp. 495-508.

[401 Bussell, B. "Propertms of a Varmble Structure Computer System m the Solutmn of Parabohc Partml Differ- ential Equatmns," PhD Thesis, UCLA, August 1962.

[41] Estrin, G., and Vlswanathan, C. R. "Organization of a Fixed-Plus-Variable Structure Computer for Computation of Elgenvalues and Eigenvectors of Real Symmetric Matrices," J. ACM, Jan. 1962, pp. 41-60.

[42] Gilmore, P. "Matrix Computations on an Associat ive Processor," GER- 15260, Goodyear Aerospace Corpora- tmn, June 1971.

[43] Orlando, V. A.; and Berra, P. B. "The Solutmn of the Mlmmum Cost Flow and Maximum Flow Network Prob- lems Using Associative Processmg," Proc. AFIPS Fall Jt. Computer Conf., 1972, pp. 859-866.

[44] Cheydleur, B. F. "Dimensioning in an Assocmtlve Memory," Vtstas m Infor- matron Handhng, Vol. 1, P. W. Howerton and D. C. Weeks (Eds.), Spar tan Books, Washington, D.C., 1963, pp. 55-77.

[45] Goldberg, J.; and Green, M. W. "Large Fries for Information Retrmval Based on Simultaneous Interrogatmn of All Items," Large Capacity Memory Tech- ntques for Computmg Systems, M. C. Yovits (Ed.), MacMillan Co., New York, 1962, pp. 63-77.

[46] Hayes, J. P. A Content Addressable Memory with Applicattons to Machme Translation, University of Illinois Computer Lab., Report 227, June 1967.

Kenneth J. Thurber and Leon D. Wald

[47] Savitt, D. A., et al. "ASP: A New Concept in Language and Machine Organization," Proe. AFIPS Sprmg Jt. Computer Conf., 1967, pp. 87-102.

[48] Peters, C., Associattve Memory Com- piler Techniques Study., National Technical Information Service, Nov. 1967, AD 824 213.

[49] Bird, R. M., et al. Study of Associative Processtng Techmques, National Tech- nical Information Servme, Sept. 1966, AD 376 572.

[50] Gall, R. G. "Hybrid Associative Com- puter Study," Vol. I, Basic Report, Nat ional Technica l Information Service, July 1966, AD 489 929.

[51] Gall, R. G. "Hybrid Associatwe Com- puter Study," Vol. II, Appendmes, Nat ional Technical Information Service, July 1966, AD 489 930.

[52] Baker, F. T., et al. Advanced Com- puter Orgamzation, National Techmcal Information Service, May 1966, AD 844 444.

[53] DeFlore, C. R., et al. "Associative Techniques in the Solution of Data Management Problems," Proc. ACM Nat tonal Conf., 1971, ACM, pp. 28 -36.

[54] Stfllman, R. B. "Computation Logic: The Subsumption and Unification Computa t ions , " PhD Dissertatmn, Syracuse Umversity, Jan. 1972,

[55] Stdlman, N. J., et al. "Associative Processing of Line Drawings," Proc. AFIPS Sprtng Jr. Computer Conf., 1971, pp. 557-562.

[56] Morenoff, E., et al. "4-Way Parallel Processor Partition of an Atmospheric Primitive-Equation Prediction Model," Proc. AFIPS Sprmg Jr. Computer Conf., 1971, pp. 39-48.

[57] S l o t m c k , D. L., "Unconventional Systems," Proc. AFIPS Spring Jt. Computer Conf., 1967, pp. 477-481.

[58] Fuller, R. H., "Associative Parallel Processing," Proc. AFIPS Spring Jt. Computer Conf., 1967, pp. 471-475.

[59] Cannon, L. E., "A Cellular Computer to Implement the Kalman Filter Al- gorithm," PhD Thesis, Montana State University, August 1969.


[60] Feng, T. Y. "Search Algorithms for Associative Memories," Proc. Fourth Annua l Prtnceton Conference on Informatzon Sctences and Systems, Princeton University, Princeton, N.J., March 1970, pp. 442-446.

[61] Thurber, K. J., and Patton, P. C. "Hardware Floating Point Arithmetic on An Associat ive Processor ," COMPCON '72, 1972, pp. 275-278.

[62] Berg, R. O., et al. "PEPE - An Over- view of Architecture, Operation and I m p l e m e n t a t i o n , " Proc. National Electrontcs Conference, IEEE, New York, 1972, pp. 312-317.

[63] McCormick, B H. "The Illinois Pa t t e rn Recogni t ion Compute r ILLIAC III," IEEE Trans. Computers, Dec. 1963, pp. 791-813.

[64] Stone, H. S. "Parallel Processing with the Perfect Shuffle," IEEE Trans. Computers, Feb. 1971, pp. 153-161.

[65] Squire, J. S.; and Paleis, S. M. "Pro- grammmg and Design Considerations of a Highly Parallel Computer," Proc. AFIPS Sprmg Jr. Computer Conf., 1963, pp. 395-400.

[66] Nuspl, S. J., and Johnson, M. D "The Effect of I/O Characteristics on the Performance of a Parallel Processor," 1971 IEEE International Computer Soc le ty Conference Digest, Sept. 22-24, 1971, pp. 127-128.

[67] Higbie, L C. "Supercomputer Archi- tecture," Computer, Dec. 1973, pp. 48-58.

[68] Cornell, J. A. "PEPE Application and Support Software," WESCON '72, Sept. 1972, pp. 1/3-1 to 1/3-3.

[69] Lawrie, D. H., et al. "GLYPNIR - A Programmmg Language for ILLIAC IV," Comm. ACM, March 1975, pp. 157-164.

[70] Anderson, G. A. "Multiple Match Resolvers: A New Design Method," IEEE Trans. Computers, Dec. 1974, pp. 131 7-1320.

[71] Lewin, M H "Retrmval of Ordered Lists from a Content-Addressed Memory," RCA Revtews, June 1962, pp. 215-229.

Associatwe and Parallel Processors

[721

• 253

Bush, V., "As We May Think," Atlan- tic Monthly, Vol. 176, July 1945, pp. 101-108.

[73] Slade, A. E., and McMahon, H. O. "A Cryotron Catalog Memory System" 1956 Eastern Jt. Computer Con- ference, pp. 115-120.

[74] Lee, C. Y, and Paull, M. C. "A Content-Addressable Distributed Logic Memory with Apphcation to Informa- tion Retrieval," Proc. 1EEE, June 1963, pp. 924-932.

[75] Gaines, R. S.; and Lee, C. Y. "An Improved Cell Memory," IEEE Trans. Electromc Computers, Feb. 1965, pp. 72-75.

[76] Crane, B. A.; and Githens, J. A. "Bulk Processing In Distributed Logm Memory," IEEE Trans. Electronic Computers, April 1965, pp. 186-196.

[77] Sturman, J. N. "An Iteratlvely Struc- tured General-Purpose Digital Com- puter," IEEE Trans. Computers, Jan. 1968, pp. 2-9.

[78] Sawtt, D A., et al. Association-Storing 'Processor Study, National Technical Information Service, June 1966, AD 488538.

[79] Llpovski, G. J. "The Architecture of a Large Dtstributed Logic Associative Memory," National Techmcal Infor- mation Service, July 1969, AD 692 195.

[80] Llpovski, G. J. "The Architecture of a Large Associative Processor," Proc. AFIPS Spring Jr. Computer Conf., 1970, pp. 385-396.

[81] Love, H. H., and Savltt, D. A. "An lterative-CeU Processor for the ASP Language," Assoctattve Information Techniques E. L. Jacks (Ed.), Amer- ican Elsevmr, New York, 1971, pp. 147-172.

[82] Hollander, G. L. "Drum Organization for Strobe Addressing," IRE Trans. Computers, Dec. 1961, p. 722.

[83] Mmsky, N. "Rotating Storage Devices as Partially Associative Memories," Proc. AFIPS Fall Jt. Computer Conf., 1972, pp. 587-595.

Computing Surveys, VoL 7, No. 4, December 1975

254

[841

• Kenneth J. Thurber and Leon D. Wald

Healy, L. D., et al. "The Architecture [96] of a Context Addressed Segment- Sequential Storage," Proc. AFIPS Fall Jr. Compu te r Conf. , 1972, pp. 691-701. [97]

[85] Kautz , W. H. "An Augmented Content-Addressed Memory Array for I m p l e m e n t a t i o n with Large-Scale [98] Integration," J. ACM, Jan. 1971, pp. 19-33.

[86] Kautz, W. H., and Pease, M. C. "Cellular Logic-in-Memory Arrays," [99] Nat ional Technical Information Serwce, Nov. 1971, AD 763 710.

[87] Seeber, R. R.; and Lmdquist, A. B. "Associative Logic for I-hghly Parallel Systems," Proc. AFIPS Fall Jt. Com- [100] puter Conf., 1963, Nov. 1963, pp. 489-493.

[88] Derickson, R. B. "A Proposed Asso- ciative Push Down Memory," Com- [101] puter Deszgn, March 1968, pp. 60-66.

[89] King, W. K. "Design of an Associative Memory," IEEE Trans. Computers, June 1971, pp. 671-674. [102]

[90] Thurber, K. J.; and Myrna, J. W. "System Design of a Cellular APL Machine," IEEE Trans. Computers, May 1970, pp. 291-303.

[91] Berg, R. O.; and Thurber, K. J. "A [1031 Multiplexed I/O System for Real Time Computers," Computer Design, May 1971, pp. 99-103.

[92] Stone, H. S. "Associative Processing for General Purpose Computers through the Use of Modified Mem- ories," Proc. AFIPS Fall Jt. Computer [104] Conf., 1968, pp. 949-955.

[93] Batcher, K. E. "Multi-Dimensional Access Sohd State Memory," US Patent 3800289, March 1974.

[94] Wald, L. D. "An Associative Memory [105] Using Large-Scale Integrat ion." NAECON '70, IEEE, New York, 1970, pp. 277-281. [106]

[95 ] Kressler, R. R., et al. "Development of an LSI Associat ive Processor," National Technical Information Serv- [107] me, August 1970, ,Mr Force Report No. AFAL-TR-70-142.

Simmons, R. F. "Storage and Retrieval Aspects of Meaning in Directed Graph Structures ," Comm. ACM, March 1966, pp. 211-215. Green, B. F. "Computer Languages for Symbol Manipulation," IRE Trans. Computers, Dec. 1961, pp. 729-735. Symonds, A. J. "Auxiliary Storage Associative Data Structure for PL/I," IBM Sys tems Journal, 1968, pp. 229-246. Flndler, N. V. "On a Computer Language which Simulates Associative Memory and Parallel Processing," Cybernettca, Vol. 10, No. 4, 1967, pp. 229-254. Gall, R. G.; and Brotherton, D. E. Assoctative Ltst Selector, National Technical Information Service, Oct. 1966, AD 802 993. Feldman, J. A., and Rovner, P. D., "An ALGOL-Based Associative Lan- guage," Comm. ACM, August 1969, pp. 439-449. Ash, W. L.; and Sibley, E. H. "TRAMP: An Interpretive Associative Processor with Deductive Capa- bilities," Proc. ACM 23rd National Conference, 1968, pp. 143-156. Von Neumann, J. "A System of 29 States with a General Transxtion Rule," Theory of Self-Reproducmg Automata A. W. Burks (Ed.), Uni- vers i ty of Ilhnois Press, Urbana, Illinois, 1966, pp. 1 3 2 - 1 5 6 , and 305-317. Holland, J. H. "A Universal Computer Capable of Executing an Arbitrary Number of Sub-Programs Simul- taneously ," Proc. AFIPS Fall Jt. Computer Conf., 1959, pp. 108-113. Shooman, W. "Parallel Computing with Vertical Data," 1960 Eastern Jt. Computer Conf., pp. 111-115. Gonzalez, M. J. "SIMDA Overview," Proc. 1972 Sagamore Computer Con- ference, August 1972, pp. 17-28. Unger, S. H. "A Computer Oriented Toward Spatial Problems," Proc. IRE, Oct. 1958, pp. 1744-1750.


Associative a~td Parallel Processors • 255

Iio8]

[109]

[llO]

[1111

[1121

[113]

[114]

11151

[1161

11171

Slotmck, D. L., et al. "The SOLOMON [118] Computer ," Proc. AFIPS Fall Jt. Computer Conf., 1962, Dec. 1962, pp. 97-107.

Westinghouse. "Multiple Processing Techmques," June 1964, AD 602693. [119] Hawkins, J. K.; and Munsey, C. A., "A Parallel Computer Organization and Mechamzatlons," 1EEE Trans. Com- puters, June 1963, pp. 251-262. [120]

Rohrbacher, D. L. "Advanced Com- puter Organization Study," National [121 ] Information Service, April 1966, AD 631870 and AD 6313811.

Higbie, L. C. "The OMEN Computers: Associat ive Array Processor s , " COMPCON '72, 1972, pp. 287-290.

Fulmer, L. C., and Mellander, W.C. [122] "A Modular Plated-Wire Associative Processor," Proc. 1970 1EEE Inter- nattonal Computer Group Conference, [123] 1970, pp. 325-335.

Goodyear, "STARAN APPLE Pro- g r a m m l n g Manua l , " Documen t

!124] GER-1563B, Sept. 1974. Cornell, J. A. "Parallel Processing of Ballistic Missile Defense Radar Data with PEPE," COMPCON '72, Sept. [125] 1972, pp. 69-72. Abel, N. E., et al. "TRANQUIL: A Language for an Array Processing Computer," Proc. AFIPS FallJt. Corn- [126] puter Conf., 1969, pp. 57-75. Millstein, R. E. "Compiler Design for ILLIAC IV," National Techmcal Infor- mat ion Service, Jan. 1972, AD 737260.

Minsky, M.; and Papert, S. "On some Associative, Parallel and Analog Com- putations," Associattve Information Techmques, E. J. Jacks (Ed.), Ameri- can Elsevier, New York, 1971. Thurber, K. J.; and Patton, P. C. "The Future of Parallel Processing," IEEE Trans. Computer, Dec. 1973, pp. 1 1 4 0 - 1 1 4 3 . Knight, K. E. "Changes m Computer Performance," Datamation, Vol. 12, Sept. 1966, pp. 40-54. Amdahl, G. M. "Vahdity of the Single Processor Approach to Achieving Large Scale Computing Capabilities," Proc. Sprmg Jr. Computer Conf., Vol. 30, 1967, Thompson Publishing, Washington, D. C., 1967, pp. 483. Berg, R. O.; and Kinney, L. L. "A Digital Signal Processor," COMPCON '72, Sept. 1972, pp. 45-48. Mclntyre, D. E. "The Table Lookup Problem Revisited," National Tech- nical Information Service, AD 831943, April 26, 1968. Kuck, D. J., and Sameh, A."Parallel Computation of Eigenvalues of Real Matrices," AD 737292, National Tech- nical Information Service, Nov. 1971. Stone, H. S. "An Efficient Parallel Algorithm for the Solution of a Trl- diagonal Linear System of Equations," J. ACM, Jan. 1973, pp. 27-38. Gilmore, P. A. "Numerical Solution to Partial Differential Equations by Asso- ciative Processing," Proc. AFIPS Fall Jt. Computer Conf., 1971, Vol. 39, AFIPS Press, Montvale, N.J., 1971, pp. 411-418.


associative and parallel processors

Documents