tmc cm-5

w. Daniel Hillis and Lewis W. Tixcker

THE CM-5 CONNECTIONMACHINE: A SCALABLE

SUPERCOMPUTER

he CM-5 Connection

Machine is a scalable homogeneous multiprocessor designed for

large-scale scientific and business applications. In this article we

describe its architecture and implementation from the standpoint

of the programmer or user of parallel machines. In particular, we

emphasize three features of the Connection Machine architecture:

scalability, distributed memory/global addressing, and distributed

execution/global synchronization. We believe that architectures of

this type will replace most other forms of supercomputing in the

foreseeable future. • Examples of the current applications of the

machine are included, focusing particularly on the machine's ability

to support a variety of programming models. The article is intended

to be a general overview appropriate to a user or programmer of

parallel machines, as opposed to a hardware designer. The latter may

prefer more detailed descriptions elsewhere [16, 23, 24].

Architecture vs.ImplementationIn describing the CM-5, it is useful todistinguish between issues of archi-tecture and implementation. Whilethere is no hard line between the two,the architectural features of the ma-chine are those features that are in-tended to remain constant acrossmultiple implementations. In a mi-croprocessor, for exatnple, one ofthe most important features of thearchitecture is usually the instruction

set. Object-code compatibility fromone generation of the machine to thenext is an important issue for manymicroprocessor applications. Super-computer applications, on the otherhand, are usually written in higher-level latiguages, so the itistructioti setis not a necessary part of the archi-tectural specification of the CM-5.The architecture is defined by otherissues, such as the addressability ofmemory, the user-visible mechanismsof synchronization, and the function-

COMMUNICATIONS OF TNB ACM November 1993/Vol.36, No.ll I I

ality of the communication and 1/C)systems.

We will describe both the architec-ttire of the CM-5 and its current im-plementation. Figtire 1, for example,is the architectural block diagram ofthe machine that is unlikely tochange. I'he specifications shown infable 1, on the other hand, are de-scriptions of the current implemen-tation, which are likely to changefrom year to year as new versions ofthe machine are introduced andfaster, denser chips become avail-able.

Architectural OverviewIhe CM-5 is a coordinated homoge-neous array of RISC microproces-sors. We note that the acronym forthis description is CHARM. CHARMarchitectures take advantage of high-volume microprocessors and mem-ory components to build high-performance supercomputers thatare capable of stipporting a widerange of applications and program-ming styles. By CHARM we meanany homogeneous collection of RIS(>microprocessors that has the follow-ing coordination mechanisms:

1. A low-latency, high-bandwidthcommunications mechanism that al-lows each processor to access datastored in other processors2. A fast global synchronizationmechanism that allows the entiremachine, including the network, tobe brought to a defined state at spe-cific points in the course of a compu-tation

These coordinating mechanismssupport the efficient execution ofsequentially deterministic parallelprograms. Both mechanisms areimportant for supporting the data-parallel model of programming [14].The CM-5 is a CHARM architectureand we believe most of the massivelyparallel machines currently on thedrawing boards in the U.S., Europe,and Japan are also CHARMS.

The CM-5 consists of a large num-ber of processing nodes connected bya pair of networks, implementing thecoordination functions describedearlier. The number of processorsmay vary from a few dozen to tens ofthousands. The number of proces-

sors in the machines ctirrently in-stalled ranges from 32 to 1,024. Thenetworks are similar to the back-platies of conventional computers inthat the machine's I/O and process-ing can be expanded by pluggingmore modules into the networks.Unlike a bus in a conventional com-ptiter, the bandwidths of the net-works increase in proportion to thenumber of processors (see Figtire 1).The memory of the machine is lo-cally distribtited across the machinefor fast local access. F.ach processoraccesses data stored in remote mem-ories of the processors via the net-work.

One of the most important fea-ttires of the CM-5 is its scalability.The current implementation, includ-ing networks, clocking, I/O system,and .software are designed to .scalereliably up to 16,384 processors. Thesame application software runs all ofthese machines. Because the machineis constructed from modular compo-nents, the number of processors,amount of memory, and I/O capacitycan be expanded on-site, withoutrequiring any change in applicationsoftware.

The Processor NodeMicroprocessor technology changesvery rapidly and the C;M-5 was de-signed with the assumption that theprocessing node would thereforechange from implementation to im-plementation. This implies that thesystem software, including compilersand operating system, be written in away that is largely processor-inde-pendent. It also reqtiires, at thehardware level, that the clocking sys-tem of the network is decoupledfrom that of the processor, so theprocessor and network can be up-graded independently.

Ihe current implementation ofthe CM-5 uses a SPARC-based pro-cessing node, operating at a 32- or40MHz clock rate, with 32MB ofmemory per node. The SPARC isaugmented with auxiliary chips thatincrease floating-point executionrate and memory bandwidth. Iheresult is a 5-chip microprocessor withperformance characteristics similarto that of single-chip microproces-sors that will be widely available in a

few more years. The SPARC waschosen primarily for its wide base ofexisting software. One drawback inusing a standard microprocessor,SPARC; included, is that standardRISC processors generally dependon a data cache to increase the effec-tive memory bandwidth. These cach-ing techniqties do not perform aswell in a large parallel machine, be-cause many applications referencemost of the data in memory on eachpass of an iteration. This restilts inrelatively little reuse of data cachereferences, reducing the effective-ness of the cache. In a parallel ma-chine, the absolute uncached band-width between the memory andprocessor is an important parameterfor performance.

In the current implementation ofthe CM-5 we addressed the memorybandwidth problem by adding fourfloating-point data paths, each withdirect parallel access to main mem-ory, to the standard SPARC micro-processor. The fotir data paths havean aggregate bandwidth to mainmemory of 5()()MB/sec per process-ing node, and 128MFLOPS peakfloating-point rate. I his is enoughbandwidth to allow a single memoryreference for each multiply and addwhen all of the floating-point unitsare operating at peak rate. A blockdiagram of the node is shown in Fig-ure 2. The data paths are capable ofperforming scalar or vector opera-tions on IFF.F. single- or double-precision floating-point numbers,including 64-bit integers. They alsosupport specialized operations stichas bit cotml. While the use of addi-tional floating-|)oiiu units with theirown paths to memory is an importantimplementation feature of the cur-rent CM-5, it is not generally visibleto users at the programming level.

The measured rate in a singlenode is currently 64MFLOPS formatrix multiply and IOOMFLOPS onmatrix-vector multiplication. Therate for the 500 x 500 Unpackbenchmark is 50MF"LOPS. The sus-tained rate for the node 8K-pointFF I s is 90MFLOPS. In a typical ap-plication, since much of the time isspent in interprocessor communica-tions, the peak execution rate of thenode is not usually a reliable predic-

1 2 NiivemlKfr I99:</Vol..'i6. Noll COMMUNICATIONS OF THE ACM

tor of average performance on appli-cations. In data parallel applications,where the opportunities for parallel-ism increase with the size of data,performance increases almost lin-early with an increasing number ofprocessors and dataset size. Thisnear-linear scaling is possible becausenetwork bandwidth scales in propor-tion to the number of processing

nodes. Benchmarks such as the largematrix Linpack [5], in which theproblem size increases with theamount of available memory, asshown in Figure 3, are an example ofa problem with near-linear speedup.Measurements on various sizes ofCM-5's show Linpack benchmarkrunning with near-linear speedups atrates of 45MFLOPS/node. A 16,000

processor CM-5 is expected to ex-ceed 700GFLOP/sec in performanceon this benchmark.

Choosing the right balance be-tween memory bandwidth, commu-nication bandwidth, and fioating-point capacity is one of the majordesign issues of a supercomputer. Inthe past, when floating point unitswere very expensive, designers mea-

Data and control network

PNB PN • PNH PN BlOPaiOPHlOP

To datanetwork

To controlnetwork

Instructionfetch,

programcontrol

RISCmicroprocessor

(SPARC)

64-bit bus

Datanetworkinterface

Floating-point

vectorprocessors

Controlnetworkinterface

Instructions Data

Vector unit

Vector unit

Vector unit

Vector unit64-bit

data paths

Memory(32MB)

n>ueujjo

ined

per

Sus

ta

40GFLOPS

30GFLOPS

20GFLOPS

1OGFLOPS

OGFLOPS

C

Ci\/l-5 performance- (Linpack benchmark)

_

512- Nodes

Y-MP C90o16 Proc8.

— «Nodes128 °CM-2

- Nodes M k

• Nodes- l i l i

0.4 0.8 1.2 1.6Problem size

CM-5

1

2.0

1024Nodes

1

2.4 x i O ^

Figure 1.CM-5 Communications networksThe CM-5 consists of processing, eon-trol, and I/O nodes connected by twoscalable networks which handle com-munication of data and control infor-mation. Unlike a bus on a conventionalcomputer, the communications networkis scalable in the sense that the band-width of the network grows in propor-tion to the number of nodes. Shown isthe expansion of the network thatresults when additional processing andI/O nodes are added.

Figure 2.CM-5 processing node with vectorunitsEach processing node of the currentimplementation of the CM-5 consists ofa RISC microprocessor, 32MB ofmemory, and an interface to the controland data interconnection networks. Theprocessor also includes four indepen-dent vector units, each with a direct64-bit path to an 8MB bank of memory.This gives a processing node with adouble-precision floating-point rate of128MFLOPS, and a memory band-width of 512MB/sec.

Figure S.Scalable perf=ormanceThe performance of data parallel appli-cations increases with the size of dataand available processing resources.Shown is the measured performance ofthe Linpack benchmark on the CM-5for Increasing problem size and numberof processors.

COMMUNICATIONS OF THB « • November 1993/Vol.36, No.ll 3 3

sured the efficiency of their machineby the percentage utilization of thefloating-point units in a typical appli-cation. This sometimes led to designsthat were inefficient at using an evenmore precious resource, such asmemory bandwidth. Since applica-tions vary in their floating-point-to-memory reference ratio, no architec-ture uses both of these resources at100% capacity in every application.Today, floating-point units repre.senta relatively small portion of the totalcost, so optimizing the design aroundtheir utilization makes little sense.Instead, the design point chosen in

the hypercubes or grids. Specifically,the network expands naturally asnew nodes are added, and the capac-ity of network traffic across the ma-chine (bisection bandwidth) is pro-portional to the number of nodes inthe machine. This is not the case withgrid-based networks. The fat-treetopology also has good fault toler-ance properties.

The data network uses a packet-switched router which guaranteesdeadlock-free delivery. The mes-sage-routing system uses an adaptiveload-balancing algorithm to mini-mize network congestion. Packets are

Table 1. CM-5 specsThe specifications ofthe current implementation ofthe CM-5 are shoxim here. Thelargest machine shipped to date is 1,024 processors, but the software networks clocking,power distribution, and so forth are all designed to scale to the 16,000 processor system.

Processors

Network address

Number of data paths

Peak M FLOPS

Memory

Memory bandwidth

Bisection bandwidth

I/O bandwidth

Global synch time

32

64

128

4GFL()PS

1GB

16(;B/.sec

320MB/sec

:S2()MB/sec

1.5 microsecs

1024

2048

4096

128( ;FLOPS

32GB

5l2(;iVso(

lOGB/sec

10GB/se(

3 microsecs

16384

32768

65536

2()48C;FIX)PS

512GB

8192GB/.sec-

160GB/sec

1 ()()(;B/sec

4 microsecs

the CM-5 was based on the balancepoint calculated by the equal ratiosmethod [12], which takes the relativecost and performance aspect intoaccount.

The Data NetworkLike the CM-2, the CM-5 incorpo-rates two important types of commu-nication networks: one for point-to-point communication of data, and asecond for broadcast and synchroni-zation. As in the CM-2, the data net-work is implemented as a packet-switched logarithmic network. TheCM-5 uses a fat-tree topology (seeFigure 4) [15], which is a more gen-eral and flexible structure than thegrid and hypercube used in theCM-1 and CM-2. This topology of-fers better scalability properties than

delivered at a rate of 20MB/sec tonearby nodes. The bisection band-width of a 1,024-node CM-5, that is,the minimum bandwidth betweenone-half of the machine and theother, is 5GB/sec in each direction.Network latencies across a Ik-proces-sor machine are between 3 and 7microseconds. This bandwidth isachieved on regular or irregularcommunications patterns. Boundschecking hardware, error checking,end-to-end counting of messages,and rapid context switching of net-work state protects users from inter-ference in a time-shared, multiple-partition configuration. As discussedlater, the data network, in conjunc-tion with the run-time system soft-ware, supports either a message-passing or a global address space

programming model.

The Control NetworkThe CM-5s conirol network sup-ports the broadcast, synchronization,and data reduction operations neces-sary to coordinate the processors.The network can broadcast informa-tion, such as blocks of instructions, toall processors simultaneously orwithin programmable segments.This control network also imple-ments global reduction operations,such as computing the total of num-bers posted by all processors, com-puting global logical AND, globalMIN, and so forth. Also, the controlnetwork implements global and seg-mented reduction and parallel prefixfunctions that have been shown to beimportant building blocks for paral-lel programming [3, 14].

Unlike the CM-2, the operationsof the individual CM-5 instructionsare not necessarily synchronized.Each processor has its own programcounter, and fetches instructionsfrom a local instruction cache. Rapidsynchronization, however, is pro-vided by the control network em-ploying a barrier synchronizationmechanism. Each processor mayelect to wait for all the other proces-sors to reach a defined point in theprogram before proceeding. Thismechanism allows processors to de-tect whether data network messagesare still in transit between processorsbefore establishing synchronization.The details of this mechanism arenot normally visible to the applica-tion programmer, since the appro-priate synchronization commandsare typically generated by compilersand run-time libraries.

From the programmer's stand-point, the fast synchronization sup-ports the sequential control flow re-quired by the data parallel model.The automatically inserted synchro-nization barriers guarantee that eachdata parallel operation is completed,before the next begins. Such syn-chronization events occur wheneverthere are potential dependencies be-tween the two parallel operations.

Input/OutputMany supercomputers spend muchof their time accessing and updating

November 199'i/Vol ffi. Noll eOMMUHK

data stored externally to the proces-sor. 1 he most challenging part ofinput/output is generally the transferof data to and from disk subsystems.The CM-5 was designed with the as-sumption that mass storage devices,such as disk arrays, will continue toincrease in capacity and bandwidth.One implication is that no fixed I/Ochannel capacity will remain appro-priate for very long. The CM-5 I/Osystem is constructed in such a way asto be connected direcdy into the pro-cessor's communication network.1 he network bandwidth available toa given I/O device is determined bythe number of taps into the network.For each I/O device, multiple net-work taps are combined to form asingle I/O channel that operates atwhatever rate the device will allow.

On CM-5 systems, in addition toprocessing nodes, there will typicallybe a number of itidependent nodeswith attached disk drive units. These"disk nodes" are configured by soft-ware to form a completely scalable,high-bandwidth RAID file system—

limited only by the aggregate band-width of its individual drives. To anapplication, the disks appear as a sin-gle very-high-bandwidth mass stor-age subsystem.

Unix Operating SystemThe CM-5 is normally operated as amultiuser machine. The system sup-ports a Unix-based operating systemcapable of time-sharing multipleusers and a priority-based job queu-ing system, compatible with NQS.Tasks are fully protected from oneanother through memory manage-ment units and bounds checkinghardware in the network. The net-work state is always accessible to theoperating system so messages intransit can be "swapped out" duringthe process switch from user to user.

In addition to normal time-shar-ing, the CM-5 supports a form ofspace sharing, called "partitioning."The CM-5 can be partitioned undersoftware control into several inde-pendent sections. Each partition ef-fectively works as an independent.

Figure 4.Data network topologyTlif data network uses a fat treetopology. Each interior node connectsfour children to two or more parents(only two parents are shown). Process-ing nodes form the leaf nodes of thetree. As more processing nodes areadded, the number of levels of the treeincreases, allowing the total bandwidthacross the network to increase propor-tionally to the number of processors.

COMMUNICATIONS OP TNB ACM Ncn'Cinbcf 1993/Vol.36. No.ll 35

For a 1-D array, compute the average of nearby neighbors.

Fortran 77 with automatic parallelization

integer size, iparameter (size=1000)double precision x(size), y(size)

do i = 2, size-1y(i) = (x(i-l)+x(i+l))/2.0

enddo

High-performance Fortran (Fortran 90)

forall (i=2:size-l)y(i) = (x(i-l)+x(i+l))/2.0

C*

where (pcoord(0)>0 && pcoord(0)<SIZE-l)y = ( [ .-llx+[.+l]x)/2.0;

Message passing

int nodesize = size/nproc;

i f (my self > 0)CMMD_send_noblock (myself-1,tag, &x[0], len);

if (myself < (nproc-1) {CMMD_send_noblock (myself+1,tag, &x[nodesize-1], len);CMMD_receive (myself+1,tag,&right,len);}

if (myself > 0)CMMD_receive(myself-1,tag,&left, len);

for (i=l; i < nodesize; i++) y[i] = (x [i-1] + x[i+l])/2.0,y[0] = (left + x[l] ) /2.0;y[nodesize-1] = (x[nodesize-2] + right)/2.0

*Lisp

(*if (and (> (self-address!!) 0)(< (self-address!!) x size))

(*set y (/ (+ (news!! x -1) (news!! = 1)) 2)))

Id

def relax x ={ l,u = bounds x;

typeof X = vector *0;in { array(1,u) of

I [i] = (x[i-l]+x[i + l])/2.0 I I i <- 1 + 1 to u-1}

36 November 1993/Vol.36, No.ll COMMUMICATIOHS o r 1

Figure 5.Rosetta stoneThi.s example program fragment showsthe same function written in differentprogramming styles. All of these pro-grams compile and run efficiently onthe CM-5. The program computes theaverage of neighboring elements in anondimensional array.

time-shared, machine. Hardware inthe communication network isolatesthese partitions in such a way that thecomputations in one partition of themachine have no effect on the com-putations in another. Each partitionappears to the software as a singlelinear address space, although it maymap into physically discontinuoussections of the machine. It is evenpossible to power down a partition ofthe machine for maintenance whileother partitions continue to operate.

Because the CM-5 is used in large-scale, long-running applications re-quiring very large datasets, the stan-dard Unix file system has been ex-tended to handle files with terabytesof data. Checkpointing facilities, in-voked either automatically or underuser control, have been added to theoperating system [22].

Relation of CM-5 to CM-2The CM-5 is the successor to theCM-2 and CM-200. The CM-5 shareswith the CM-2 the two-network orga-nization, the global address localmemory structure, tbe ability tobroadcast and synchronize instruc-tions, and special hardware supportfor data parallel execution. Its archi-tecture differs most dramaticallyfrom the CM-2 in the ability of theindividual nodes to execute instruc-tions independently and in the orga-nization of the I/O system. Also, theCM-5 uses a larger-grained 64-bitprocessor, instead of the single-bitprocessors and 32-bit floating-pointpaths of the CM-2. Most programswritten for the CM-2 only requirerecompilation to run on the CM-5.

The most ' important lessonlearned from the CM-2 was the im-portance of the scalable data parallelmodel which is used in most com-mercial and scientific applications ofmassively parallel machines. As itturned out, the strict per-instruction

synchronization of the CM-2 was notrequired to support this model. TheCM-5 is a MIMD machine (multipleinstruction/multiple data), althoughit incorporates many features thatare normally found only in SIMD(single instruction/multiple data)machines. The machine incorporatesseveral important architectural fea-tures from message-passing ma-chines such as the Caltech Hyper-cube [21]. The data network of fattrees is based on work from MIT [15]and the New York University Ultra-computer [20].

Support for MultipleProgramming ModeisThe CM-5 is particularly effective atexecuting data parallel programs,but it was designed with the assump-tion that no one model of parallelprogramming will solve all users'needs. The data parallel model,which is embodied in languages suchas Fortran-90, *Lisp, and C*, hasproved extremely successful forbroad classes of scientific and engi-neering applications. On the otherhand, this model does not seem to bethe best way to express a controlstructured algorithm such as playingchess by searching the game tree [8].In this case, a message-passing modelis often more appropriate. Otherapplications may be better suited byprocess-based models, in which taskscommunicate via pipes or sharedvariables, or by object-based modelsin which each processor executes adifferent program, or even dataflowmodels, such as those supported bythe language Id. Figure 5 shows a"rosetta stone" of program frag-ments performing a similar opera-tion in many different programminglanguages. All of these languagescompile efficiently for the CM-5. Forthis particular example, the data par-allel languages offer the most conciseformulation, but in other cases mes-sage-passing-based approaches mayoffer a more natural expression of analgorithm.

Since the actual hardware re-quired to support all of these pro-gramming models is very similar, it islikely that most parallel machines ofthe future will be designed to sup-port all of these programming mod-

els. All that is required is a good com-munication system, an efficientglobal synchronization mechanism,and a fast hardware broadcast andreduction. With today's technology,this is relatively inexpensive. This ispart of the charm of CHARM archi-tectures.

Shared Memory vs. SharedAddress SpaceSome of the most interesting issues incomputer architecture arise in re-solving the conflicts between theneeds of the programmer and theconstraints of the hardware. Some-times apparent confiicts are resolvedby careful separation of the actualrequirements. For example, considerthe shared vs. distributed memoryissue. Programmers would like all ofthe data to be accessible from everyprocessor, which suggests the use ofa single central memory that is acces-sible by all the processors. Hardwareconstraints, on the other hand, makethis central memory approach costlyand inefficient. Local memory, dis-tributed among the processors, ismuch faster and less expensive.

It would seem at first glance thatthe conflict is irresolvable, and thatthe user will be forced to choose be-tween convenience and efficiency.Fortunately, the choice is unneces-sary. It is possible to address both thehardware and the software needssimultaneously by distributing mem-ories for fast local access and provid-ing an additional mechanism for ac-cessing all of the memory through anetwork. This allows the compilerand operating system to provide theuser with a shared address space,without sacrificing the cost-effective-ness, speed and efficiency of distrib-uted memory. The CM-5 adopts thiscombination of sbared address spaceand distributed memory. Hennessyand Patterson have published a dis-cussion of these two dimensions ofarchitecture [10]. Figure 6 shows oneof their tables with the addition ofthe CM-5.

SiMD VS. MiMDA similar analysis can be applied tothe SIMD/MIMD issue. Program-mers often find it most convenient tospecify programs in terms of a single

COMMUNICATIONS OV TNB ACM November 1993/VoI.36, No.ll 3 7

flow of control [2], using a sequentialor a data-parallel programming style.This frees the programmer from is-sues of synchronization, deadlock-ing, stale access, nondeterministicordering, and so forth. Parallel com-puters, on the other hand, can oper-ate more efficiently executing inde-pendent instruction streams in eachprocessor, since a sequential pro-gram overspecifies the ordering andsynchronization. These two issuesare resolved by providing a machinewith both independent instructionstreams, and the synchronization andbroadcast mechanisms necessary tosupport a sequential programmingstyle. The program is guaranteed tocompute exactly the same results as itwould operating on a single proces-sor, although the result is generallycomputed more quickly.

If we take the classical SIMD ma-chine to be one that is synchronizedon every instruction, and the classicalMIMD machine to be one in whichsynchronization is left to the pro-grammer, then the CM-5 normallyoperates in the middle ground; thatis, synchronization occurs only asoften as necessary to guarantee get-ting the same answer as the globallysynchronized machine. These invo-cations of the hardware synchroniza-tion operations are generated auto-matically by the compiler. Theprocessors are free to operate inde-pendently whenever such synchroni-zation is unnecessary. Even applica-tions that were originally written fortraditional MIMD machines havebeen able to take good advantage ofthese hardware synchronizationmechanisms. Of course, it is also pos-sible to program the machine as astandard MIMD machine, or for thatmatter to force a synchronization onevery instruction execution, in whichcase it operates as a SIMD machine.This is illustrated in Figure 7.

ApplicationsWhen massively parallel computerswere first introduced, many ex-pected that they would only achievegood performance on a narrowrange of applications. Experiencehas shown that, contrary to initialexpectations, most applications in-volving large amounts of data can

take advantage of the computationalpower of massively parallel machines[9]. This is not to say that most exist-ing applications, as currently written,run well on massively parallel ma-chines. They do not. Most existingsupercomputer applications weredesigned with much smaller, non-parallel machines in mind. The aver-age application on Cray Y-MP at theNCSA Supercomputer Center, forexample, runs at 70MFLOP/sec in25MB of memory. Such small-scaleapplications rarely have the inherentparallelism to take advantage of mas-sive parallelism. On the other hand,large-scale applications that involvemany CBs of data can almost alwaysbe written in a way that takes goodadvantage of large-scale parallelism.In a recent survey of the world's mostpowerful supercomputers, four outof the top five systems reported wereConnection Machines [6].

Current applications of the Con-nection Machine system includehigh-energy physics, global climatemodeling and geophysics, astrophys-ics, linear and nonlinear optimiza-tion, computational fiuid dynamicsand magnetohydrodynamics, elec-tromagnetism, computational chem-istry, computational electromagnet-ics, computational structuralmechanics, materials modeling, evo-lutionary modeling, neural model-ing, and database search [19]. Thereare many other examples. Theseapplications use a variety of numeri-cal methods, including finite differ-ence and finite element schemes, di-rect methods, Monte Carlocalculations, particle-in-cell methods,ra-body calculations, and spectralmethods [4].

Many applications that were origi-nally developed for the CM-2 achievehigh rates of sustained performanceon the CM-5. For example, a rarefiedgas dynamics simulation, for model-ing situations such as spacecraft re-entry [17, 18], which runs at8CFLOP/sec on the largest CM-2,runs at a sustained rate of 64CFLOP/sec on a 1,024 processor CM-5(lCFLOP/sec is about the peak rateof a single processor on a Y-MP C90).Another example is the Clobal Cli-mate Modeiing software developedby Los Alamos National Laboratory

(see Figure 8). This simulation is de-signed to model global climate warm-ing due to the buildup of greenhousegases. The algorithm relies on a dy-namically varying mesh of over250,000 points and 1,200,000 tetra-hedra to cover the Earth's atmo-sphere. The application consists ofover 30,000 lines of Fortran. Dy-namic rearrangement of the meshautomatically adjusts local resolutionto resolve important details of thefiow, such as storm formations. Sim-ulation of a six-week time period re-quires about one hour of simulationtime on a 64K CM-2. When run on a1,024 processor CM-5, this six-weeksimulation completes in less than fiveminutes.

Massively parallel machines arealso becoming increasingly impor-tant in applications outside of scienceand engineering. Many of these com-mercial applications take advantageof the scalable input/output capabil-ity of the CM-5. For example, Ameri-can Express corporation is using twoCM-5's to process its large credit carddatabases. Another example is the oilindustry, where companies such asMobil and Schlumberger use CM-5'sfor large-scale seismic processing.

Because the CM-5 provides directuser-level access to the networkhardware, it is also used as an experi-mental platform for testing new ar-chitectures. There is no operatingsystem software overhead betweenthe user and the network. Computerscientists at the University of Califor-nia, Berkeley are using this propertyof the CM-5 to investigate a model ofparallel computation based on whatthey term a "Threaded AbstractMachine" (TAM). This abstract ma-chine is the compilation target forparallel languages such as ID90. Thislanguage, originally designed fordatafiow computers, relies on a fine-grained, message-driven executionmodel. They have taken advantageof direct access to the data network toimplement an ultralight form ofremote procedure.call, called activemessages [7].

At the University of Wisconsininvestigators are also using the;CM-5 for virtual prototyping; thatthey use the machine to simu'other architectures to better un

3 8 November 1993/Vol.3e. Noll COMMUNICATIONS OP THB ACM

stand alternative architecturalchoices. The CM-5 Wisconsin WindTunnel prototyping system mimicsthe operations of a large-scale, cache-coherent, shared-memory machine.Almost all application instructionsare directly executed by the hard-ware. References to shared data that

are not locally cached, invoke com-mtinication to retrieve memorypages from the global address space.The scalability and flexibility of theCM-5 has allowed these researchersto run full-sized applications on asimulated architecture and study theeffects of various cache coherency

schemes [11].Some of the applications of the

CM-5 really require even greatercomputing power than is availabletoday. For example, a consortium ofparticle physicists has formed to con-struct a special version of the CM-5for teraflops calculations in quantum

DistributedPhysicalmemorylocation

Centralized

AddressingMultiple Shared

Cosmic CubeIPSC/Í860nCUBE 2

CM-2, CM-5IBM RP3Cedar

IBM 3090-600MultimaxYMP-8

y

/

/

Instruction streamSingle Multiple

Globally synchronous

Pair-wise synchronous

ClassicSIMD

CM-5

ClassicMIMD

Figure 6.Parallel processors according tocentralized versus distributedmemory and shared vs. multipleaddressingSource: Patterson and HennessvThe hardware notion of shared memoryis often confused with the softwarenotion of shared addressing. Machineswith locally distributed memory can bedesigned to support a program modelon which the user sees all data in asingle address space. This allows themachine to combine the hardwareefficiency of distributed memory withthe software convenience of a singleaddress space.

Figure 7.Instruction streamIn a classical SIMD machine, globalsynchronization is provided by a singlebroadcast instruction stream. In aclassical MIMD machine, multipleinstruction streams are synchronizedvia pairwise interactions; for example,through messages or shared memoryreferences. The CM-5 is an example ofa coordinated MIMD machine, in whichmultiple instruction streams are syn-chronized whenever necessary throughglobal synchronization hardware.

Figure 8.X3D—Massively parallel adaptiveadvanced climate modelThis global climate model simulation isan example of a scientific application ofthe CM-5. Lines represent an irregularcomputational mesh of approximately1,200,000 tetrahedra and is used tocompute Lagrangian hydrodynamics,Monte Carlo Transport, and thermalheat diffusion. Land masses are shownin yellow. The goal of this application isto better answer fundamental questionsabout changes in the earth's globalclimate through computer simulation.Credit: Harold Trease (Los Alamos NationalLaboratory)

November 1993/V(.I.:)6. No.ll 39

chromodynamics. These calculationscould potentially advance our under-standing of the structure of atomicparticles [1]. Protein structure pre-diction and global climate modelingare other examples of problems thatwill ultimately require terafiops per-formance. These calculations havetremendously large potential eco-nomic impact. As a final example ofan application that will require evengreater performance, one of us(W.D. Hillis) is using a ConnectionMachine to design computer algo-rithms by a process analogous to bio-logical evolution [13]. Perhaps somefuture version of the ConnectionMachine will use circuits designed bythese evolutionary methods.

SummaryIn summary, we believe the mostimportant features of the Connec-tion Machine are its .scalability, itsglobal access of locally distributedmemory, and its global coordinationof local executing processes. To-gether, these features allow the CM-5to support a variety of parallel pro-gramming models and a wide rangeof applications with good perfor-mance. Because the cost of this uni-versality is relatively low, we predictthat the architectures of this type,using coordinated homogeneous ar-rays of RISC microprocessors(CHARM), will be the dominantform of large-scale computing in the1990s. •

References1. Aoki, A., et al. Physics goals of the

QCD Teraflop project. J. Mod. Phys.C, 2, 4 (1991), 892-947.

2. Backus, J. Can programming be lib-erated from the von Neumann style?Commun. ACM 8 (1978).

3. Blelloch, G. Scan primitives and par-allel vector models. Ph.D. thesis, Car-negie-Mellon University, Jan. 1989.

4. Boghosian, B. Computational physicson the Connection Machine. Comput.Phys. 4, 1 (1990).

5. Dongarra, J.J. Performance of vari-ous computers using standard linearequations software. University ofTennessee, Sept. 1992.

6. Dongarra, J.J. TOP500. Tech. Rep.37831, Computer Science dept.. Uni-versity of Tennessee, July 1993.

7. Eicken, T., Culler, D., Goldstein, S.

and Schauser, K. Active Messages: Amechanism for integrated communi-cation and computation. In Proceed-ings of the Nineteenth International Sym-posium on Computer Architecture. ACMPress (May 1992).

8. Fox, G.C. Parallel computing comesof age: Supercomputer level parallelcomputations at Caltech. Concurrency,1, 1 (Sept. 1982).

9. Fox, G.C. Achievements and pros-pects for parallel computing. Concur-rency: Pract. Exp., 3, (1991), 725.

10. Hennessy, J.L. and Patterson, D.A.Computer Architecture: A QuantitativeApproach, Morgan Kaufmann, 1990.

11. Hill, M., Larus, J., Reinhardt, S. andWood, D. Cooperative shared mem-ory: Software and hardware for scal-able multiprocessors. Tech. Rep.1096, Computer Science Dept., Uni-versity of Wisconsin-Madison, July1992.

12. Hillis, W.D. Balancing a Design. IEEESpectrum (May 1987).

13. Hillis, W.D. Co-Evolving parasitesimprove simulated evolution as anoptimization procedure. Physica D 42(1990).

14. Hillis, W.D. and Steele, G.L. Dataparallel algorithms. Commun. ACM29, 12 (Dec. 1986).

15. Leiserson, C.E. Eat-trees: Universalnetworks for hardware-efficient su-percomputing. IEEE Trans. Comput.C-34, 10 (Oct. 1985).

16. Leiserson, C.E., et al. The networkarchitecture of the connection ma-chine CM-5. In the Eourth AnnualACM Symposium on Parallel Algorithmsand Architecture (]unc 1992).

17. Long, L.N., Kamon, M., Chyczewski,T.S. and Myczkowski, J. A determin-istic parallel algorithm to solve amodel Boltzmann equation (BGK).Comput. Syst. Eng. 3, 1-4, (Dec. 1992),337-345.

18. Long, L.N.. Kamon. M., Myczkowski,J. A massively parallel algorithm tosolve the Boltzmann (BGK) equation.AlAA Rep. 92-0563. Jan. 1992.

19. Sabot, G., Tennies, L., Vasilevsky, A.and Shapiro, R. In Scientific Applica-tions of the Connection Machine, H.D.Simon, Ed. World Scientific, RiverEdge, NJ., Second ed., 1992, pp.364-378.

20. Schwartz, J. Ultracomputers. ACMTrans. Program. Lang. Syst. 2, 4 (Oct.1980).

21. Seitz, C.L. The cosmic cube. Commun.ACM 28, 1 (Jan. 1985).

22. Thinking Machines Corporation.CM-5 Software Sum., CMost Version7.1, Jan. 1992.

23. Thinking Machines Corporation.

The Connection Machine CM-5Tech. Sum. (Oct. 1991).

24. Wade, J. The vector coprocessor unit(VU) for the CM-5. Symposium Rec-ord: Hot Chips IV, Stanford Univer-sity, IEEE Computer Society, August1992.

CR Categories and Subject Descrip-tors: C.1.2 [Processor Architectures]:Multiple Data Stream Architectures—connection machines; C.5.1 [Computer Sys-tem Implementation]: Large and Me-dium ("Mainframe") Computers—super(very large) computers

General Terms: Design, PerformanceAdditional Key Words and Phrases:

CM-5, Massively Parallel Systems

About the Authors:W. DANIEL HILLIS is chief scientist andcofounder of Thinking Machines Corp.He is the architect of the ConnectionMachine computer. His research interestsinclude parallel programming, evolution-ary biology, computer architecture, andthe study of complex dynamical systemswith emergent behavior. His current re-search is on evolution and parallel learn-ing algorithms.

LEWIS W. TUCKER is director of pro-gramming models at Thinking MachinesCorp., and is responsible for the CM-5'smessage passing and scientific visualiza-tion library development. His researchinterests include parallel architecture de-sign, scientific visualization, and com-puter vision.

Authors' Present Address: ThinkingMachines Corp., 245 Eirst Street, Cam-bridge, MA 02142; email: {danny, tucker}@think.com

Connection Machine is a registered trademark ofThinking Machines Corp.

Permission to copy without fee all or part of thismaterial is granted provided that the copies are notmade or distributed for direct commercial advantage,the ACM copyright notice and the title of the publi-cation and its date appear, and notice is give thatcopying is by permission of the Association forComputing Machinery. To copy otherwise, or torepublish, requires a fee and/or specific permission.

© ACM 0002-0782/93/1100-030 $1.50

40 November 1993/Vol.36, No.ll COMMUNICATIONS Off THH ACM

Copyright of Communications of the ACM is the property of Association for Computing Machinery and its

content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's

express written permission. However, users may print, download, or email articles for individual use.

tmc cm-5

Documents

point units

parallel machines

point rate

linpack benchmark

data network

aggregate

tree topology

risc microprocessor