aluyeung/papers/fccm98.pdf · andras,piano,agarw al g @lcs.mit.edu abstract the semic onductor...

16

Upload: lamdang

Post on 15-Sep-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Exploring Optimal Cost-Performance Designs

for Raw Microprocessors

Csaba Andras Moritz Donald Yeung Anant Agarwal

Laboratory for Computer Science

Massachusetts Institute of Technology

Cambridge, Massachusetts 02139

fandras,piano,[email protected]

Abstract

The semiconductor industry roadmap projects thatadvances in VLSI technology will permit more thanone billion transistors on a chip by the year 2010.The MIT Raw microprocessor is a proposed architec-ture that strives to exploit these chip-level resourcesby implementing thousands of tiles, each comprisinga processing element and a small amount of memory,coupled by a static two-dimensional interconnect. Acompiler partitions �ne-grain instruction-level paral-lelism across the tiles and statically schedules inter-tilecommunication over the interconnect. Because Rawmicroprocessors fully expose their internal hardwarestructure to the software, they can be viewed as a gi-gantic FPGA with coarse-grained tiles, in which soft-ware orchestrates communication over static intercon-nections.

One open challenge in Raw architectures is to de-termine their optimal grain size and balance. Thegrain size is the area of each tile, and the balance isthe proportion of area in each tile devoted to mem-ory, processing, communication, and I/O. If the totalchip area is �xed, more area devoted to processing willresult in a higher processing power per node, but willlead to a fewer number of tiles.

This paper presents an analytical framework usingwhich designers can reason about the design space ofRaw microprocessors. Based on an architectural modeland a VLSI cost analysis, the framework computes theperformance of applications, and uses an optimizationprocess to identify designs that will execute these ap-plications most cost-e�ectively.

Although the optimal machine con�gurations ob-tained vary for di�erent applications, problem sizesand budgets, the general trends for various applica-tions are similar. Accordingly, for the applicationsstudied, assuming an 1 billion logic transistor equiv-alent area, we recommend building a Raw chip with

approximately 1000 tiles, 30 words/cycle global I/O,20Kbytes of local memory per node, 3-4 words/cyclelocal communication bandwidth, and single-issue pro-cessors. This con�guration will give performance nearthe global optimum for most applications.

1 Introduction

Advances in semiconductor technology have madepossible the integration of multiple functional units,large cache memories, recon�gurable logic arrays andperipheral functions into single-chip microprocessors.Unfortunately, increases in the performance of con-temporary microprocessors have come at the cost ofincreasing ine�ciencies in silicon area usage. The in-e�ciencies arise from the complexity of designs thatuse hardware support to exploit more instruction levelparallelism.

Maintaining a rapid increase in microprocessor per-formance will require a cost e�cient utilization of sili-con area. The MIT Raw microprocessor is a proposedarchitecture that exposes its internal hardware struc-ture to the compiler, so that the compiler can deter-mine and orchestrate the best mapping of an appli-cation to the hardware. A Raw microprocessor [1] isreminiscent of a coarse-grained FPGA and comprisesa replicated set of tiles coupled together by a set ofcompiler orchestrated, pipelined, switches (Figure 1).Each tile contains a simple RISC-like processing coreand an SRAM memory for instructions and data. In-struction memory allows the multiplexing of the com-pute logic on a cycle by cycle basis. SRAM mem-ory distributed across the tiles eliminates the memorybandwidth bottleneck, provides low latency to eachmemory module, and prevents o�-chip I/O latencyfrom limiting e�ective computational throughput.

The tiles are interconnected by a high-speed 2D

mesh network, allowing inter-tile communications tooccur with register-like latencies. The switches them-selves contain some amount of SRAM so that the com-piler can load into the switch a program that multi-plexes the interconnect in a cycle by cycle fashion, justas in a virtual-wires based multi-FPGA system [4].

A typical Raw system includes a Raw microproces-sor coupled with o�-chip RDRAM (RamBus DRAM)through multiple high bandwidth paths. The two levelmemory hierarchy, namely, a local SRAM memory at-tached to each tile inside the Raw chip, and a largeexternal RDRAM memory, is necessary to be able tosolve large problems that exceed the size of the on-chipmemory.

Raw architectures achieve the performance ofFPGA-based custom computing engines by exploit-ing �ne-grained parallelism and fast static communi-cation, and by exposing the low-level hardware detailsto facilitate compiler orchestration. Unlike FPGA sys-tems, however, Raw machines support instruction se-quencing and are more exible because the executionof a new operation can be accomplished merely bypointing to a new instruction. Compilation in Rawis faster than in FPGA systems because it binds intohardware commonly used compute mechanisms suchas ALUs and memory paths, thereby eliminating re-peated low-level compilations of these macro units.Binding of common mechanisms into hardware alsoyields better execution speed, lower area, and betterpower e�ciency than FPGA systems.

The designer of an FPGA device or a Raw micro-processor is faced with the challenge of determiningthe best division of VLSI resources among comput-ing, memory, and communication. This challenge istermed the balance problem. Furthermore the design-ers of both an FPGA and a Raw device must addressthe grain size issue { in other words, whether to im-plement a few powerful tiles, or whether to use manysmall tiles each with a lower performance.

This paper presents an analytical framework withwhich designers can reason about the division of re-sources in a VLSI chip. Although our analysis inthis paper is focussed on the Raw microprocessor, theanalysis generalizes to other architectures. Our ob-jective in this paper is to gain more insight into cost-performance optimal designs given a �xed amount ofresources.

The framework presented in this paper focuses onthe performance requirements of applications, intro-duces an architecture model, a cost model and a per-formance model for applications, and de�nes an op-timization process to search for performance optimal

designs given a cost constraint.

The architecture model de�nes an architecturebased on parameters that include the number of tilesP , the processing power of each tile p, the amount ofmemory in each tile m, and the communication band-width out of each tile c. The cost model estimatesthe cost in terms of chip area of realizing the givenarchitecture with the speci�ed set of parameters. Theperformance model estimates the runtime of each ap-plication as a function of the problem size. Perfor-mance estimation is based on both (1) a characteriza-tion of the application and its algorithms in terms ofits requirements including processing steps, memoryand communication volumes, and (2) the architecturemodel.

Together with a cost constraint de�ned in termsof the cost model, our performance model allows usto perform a constrained optimization on the inde-pendent architectural variables. We can, for example,compute the points or contours in the architecturalspace that correspond to the best performance for agiven cost, lowest cost for a given level of performance,or best e�ciency de�ned by performance/cost.

The algorithms used in this study have beenadapted to the Raw system architecture illustrated inFigure 1 by �rst partitioning them into subproblemsthat can �t within the Raw chip. Each subproblemis loaded from the external global RDRAM memoryinto the set of local memories in the tiles. Compu-tation occurs on the subproblem, and the results arestored back into external RDRAM. All the subprob-lems are visited (possibly multiple times) in sequence.The algorithmic slowdown due to blocking the prob-lem in this manner is accurately modeled. Each sub-problem is solved in parallel with a blocking algorithm.Applications studied in this paper include Jacobi Re-laxation, Dense Matrix Multiply, Nbody, FFT, andLargest Common Subsequence.

The speci�c contribution of this paper include:

� A general framework for reasoning about the de-sign space of VLSI-based parallel architectures in-cluding models for cost and performance.

� Insights on optimal grain size and balance in Rawmicroprocessors.

The remainder of this paper is organized as follows.Section 2 describes the three models developed in thispaper: the performance model, the cost model and theapplication model and gives a qualitative analysis ofcost and performance. Section 2.7 formulates the op-timization process based on previous model assump-

REGS

ALU

SWITCH

PC

SRAM

CPUIMEM

DMEMNode

RDRAM

Raw Microprocessor

Figure 1: Raw system composition. A typical Raw system includes a Raw microprocessor coupled with o�-chipDRAM and stream IO devices. Each Raw tile contains a simple RISC-like processor, an SRAM memory forinstructions and data, and a switch. The tiles are interconnected in a 2D mesh network that is orchestrated bythe compiler. The switches themselves contain some amount of SRAM so that the compiler can load into theswitch a program that multiplexes the interconnect in a cycle by cycle fashion, just as in a virtual-wires basedmulti-FPGA system.

tions, and Section 3 gives our experimental results.Section 4 concludes the paper.

2 Framework

This section presents the analytical framework usedin analyzing candidate designs in terms of their grainsize and balance. We �rst start with a motivation fora study of grain size issues.

2.1 Motivation

Two key questions in the design of a Raw micro-processor involve the grain size of its tiles and theirbalance. The grain size re ects the sizes of variouscomponents inside the tiles such as memory, process-ing, and communication. A very coarse grain designwould involve multiple-issue superscalars for process-ing and large local memories. Very �ne grain designswould be similar to contemporary FPGAs and includea few bits worth of logic and memory within each tile,and a few wires connecting the individual tiles. De-signs with a moderate grain size would involve verysimple single-issue processors in each node.

Grain size and balance play a large part in deter-mining the e�ciency or performance per unit cost of

a machine assuming a �xed total budget. If an engi-neer builds a small number of very large (coarse grain)nodes, a point of diminishing returns is reached wherenode performance increases very slowly (if at all) asnode size is increased. On the other hand, buildinga large number of very small (�ne grain) nodes willalso result in diminishing returns as the communica-tion costs dominate. The highest e�ciency occurs atan optimal point between the two extremes. Simi-larly, as observed by Kung and others [11, 5], there isan optimal balance of resources between the proces-sor, memory, and communication components withina node.

While there has been much debate on this topic,few concrete results have been reported. Machine bal-ance and grain size continues to be determined moreby convenience and market forces than by engineer-ing analysis. Our primary motivation in undertakingthis study is to provide an analytical framework to en-able engineers to obtain insights into the tradeo�s inchoosing various machine parameters.

Let us �rst provide an overview of the framework.Table 1 summarizes our notation organized by modelcategory. Throughout the paper, execution times aremeasured in machine cycles, information in units ofmachine words, and cost in SRAM bit equivalents(Sbe). As discussed in Section2.4, an Sbe is the area

ARCHITECTURE MODELP number of tiles in Rawp processing power of each tilem amount of SRAM in a tilec local communication bandwidthl single hop interconnect latency of a wordo software overhead for communicationkd average network distance traversed by messageslg DRAM latency for global communicationbg global communication bandwidth or DRAM accessing bandwidthx machine con�guration: x =< P; p;m; c; l; o; kd; lg; bg >

COST MODELKp(p) processor cost per tileKm(m) memory cost per tileKc(c) local communication router cost per tileKlg(lg) global latency cost for entire Raw systemKbg(bg) global bandwidth cost for entire Raw systemK(x) total cost of Raw system

APPLICATION MODELN problem sizeN 0 subproblem size: part of the original problem requiring one global step

Rp(P;N;N 0) total amount of computation required per tile to solve the problemRm(P;N;N 0) total amount of local memory required per tileRam(P;N;N 0) total amount of local memory per tile required to hold a subproblem

Rlm(P;N;N 0) total amount of bu�er per tile for overlapping local communication

Rm(P;N;N 0) total amount of bu�er per tile for overlapping global communicationRc(P;N;N 0) total amount of data required to be sent or received in-between tilesRo(P;N;N 0) software overhead attached to local communication eventsRbg(P;N;N

0) total amount of global communication required per chipRlg(P;N;N

0) total amount of overhead and latency required for global communication

PERFORMANCE FUNCTIONSTs(N;p) sequential run-time of application

T (N;N 0; x) parallel run-time of applicationTp(N;N 0; x) computation time per node including overheadsTc(N;N 0; x) local communication time per nodeTg(N;N 0; x) global communication time per nodeTai(N;N 0; x) amount of overhead per tile resulting from algorithmic load imbalance

Table 1: Overview of model parameters and functions.

occupied by one bit of SRAM memory.

2.2 Overview of the Framework

Let us overview our analytical framework illus-trated in Figure 2 by considering a simple machinemodel. In its simplest form, a parallel machine can becharacterized by the number of tiles or nodes P , theprocessing power of each node p (operations per cy-cle), communication bandwidth of each node c (wordsper cycle), and the amount of local memory per nodem (words).

For a given problem size and partitioning strategy,an application can be described by its processing, com-munication and memory requirements, or Rp (opera-tions to be performed), Rc (words to be communi-cated) and Rm (words).

The performance of the application in terms of its

runtime T is derived from the application require-ments and the architectural model. If the processingtime Tp =

Rp

pand the communication time Tc =

Rc

c,

then if processing and communication is fully over-lapped, the runtime is given by T = max(Tp; Tc).

We use cost models Kp(p), Kc(c), Km(m) to mapthe machine parameters P; p; c;m into costs. In otherwords the processor cost model Kp(p) provides thearea cost of implementing a processor that can per-form p operations per cycle. The total machine costfor a P processor machine is then K = P (Kp +Kc +Km).

Given an application with a �xed problem size Nand an area budget B, a constrained optimizationproblem is de�ned with the objective of �nding theoptimal machine con�guration that gives the smallestruntime for that budget. In other words the frame-work �nds the set of architectural parameters P; p; c;m

Application(s)Application Cost

Objective Constraints:Problemsize Budget

OPTIMIZATION

OPTIMAL

Architecture

cost & balancemin(Runtime)=

Performance

Division of resources & chipsizes

Model

Model Model

Functions

Figure 2: Analytical framework. The key componentsof the framework are the models and the optimiza-tion process. Given an application with an associatedproblem size and a �xed budget, the constraint equa-tions are derived for the optimization. The nonlinearoptimization process searches the machine con�gura-tion space that gives the minimal runtime for the ap-plication.

that yield a minimum value for T , given that the costK cannot exceed the available budget B. Or moreformally,

�nd P; p; c;mto minimize T = max(Tp; Tc),subject to B � K.As discussed in more detail later, the optimization

process is sped up by a set of balance constraints. Thebalance constraints state that for the optimal solutionthe computation time and communication time mustbe equal, and that the physical memory should �t theproblem. The balance constraints greatly reduce thesize of the search space, and thus the complexity ofthe optimization procedure.

The following sections discuss each of the compo-nents of the framework and the optimization processin more detail.

2.3 Architecture model

This section discusses parameters necessary forarchitecture characterization. Although several ap-proaches to modeling the performance of a parallelcomputer have been proposed in the literature [2, 3],none are completely suited to modeling �ne-grain par-allel systems built on a chip. Figure 3 shows our char-acterization of a Raw system using the parameters de-

Global IODRAM

bg

c,l

c,l

c,l

c,lc,l

c,lc,l

c,l

P=4 nodes microprocessor

gl

p,m,op,m,o

p,m,op,m,o

b

l

g

g

Figure 3: A four node illustrative Raw system char-acterized by the parameters < P; p;m; c; l; o; bg; lg >where the processing power per node in operationsper cycle is p, the amount of SRAM memory per nodeis m, the local communication bandwidth per node inwords per cycle is c, the software overhead for a com-munication event in cycles is o, the single hop commu-nication latency is l, the global o�-chip communicationbandwidth per Raw chip in words per cycle is bg, theRDRAM latency expressed in cycles is lg .

scribed below. Our machine characterization di�ersfrom previous ones in the sense that it captures bothlocal and global communication performance, and in-cludes software overheads.

We choose as independent parameters the numberof nodes, P , the processing power per node in opera-tions per cycle, p, the memory per node m in words,the local communication bandwidth per node in wordsper cycle, c, the software overhead for communicationin cycles, o, the single hop latency of the network,l, the global o�-chip communication bandwidth pernode in words per cycle , bg, and the RDRAM latencyexpressed in cycles, lg.

As an example, sending a local inter-tile messageof length L words �rst involves spending o cycles inlaunching the message. The message header wordtravels on average a distance of kd hops in the net-work, using l cycles per hop. Because the bandwidthout of a node is c words per cycle, subsequent mes-sage words take 1

cto enter the network. The receiving

tile would also spend o cycles receiving the message.Thus, the communication time per message is

Tc = 2o+ kdl + (L� 1)1

c(1)

Writing a block of data to the o�-chip RDRAM

memory �rst involves an overhead o associated withstarting up global communication. The latency of ac-cessing the DRAM will be the sum of the latency oftraversing the interconnection network in one dimen-sion (kdl=2) plus lg the DRAM latency. (We divideby two to indicate that RDRAM memory messagesdo not have to traverse both the X and Y network di-mensions) The transfer rate of subsequent words willbe the minimum of the local communication band-width and the global communication bandwidth pertile (since multiple tiles might be writing externalmemory). Thus the time for a writing a block of sizeL to memory is,

Tg = o+kd2l + lg + (L� 1)max

�1

c;P

bg

�(2)

Communication locality can be captured at the ap-plication level by accounting for it in the average dis-tance that messages travel (kd). We ignore contentione�ects (e.g. resource and network contention) also be-cause we assume that the compiler can statically or-chestrate communication events much as in a virtual-wires system. We also use a conservative approach inde�ning applications' communication requirements.

2.4 Cost model

We use silicon area as a measure of cost. Siliconarea re ects the fundamental cost of building a com-ponent and is a good basis for comparing alternativesas opposed to market price which includes many arti�-cial factors. The cost model is based on CMOS micro-processors, SRAM and DRAM memories, and a meshinterconnection technology. For simplicity we considerthe o�-chip RDRAM memory free. Although our as-sumptions may change speci�c numerical results, themethodology for determining balance and grain sizeremains the same.

We normalize cost to units of SRAM bits, viz. onebit of SRAM takes one unit of area and therefore oneunit of cost. We express the cost of all other compo-nents in terms of SRAM bit equivalents (Sbe).

We use the notion of relative density to enable thenormalization of logic, memory and communicationareas into units of SRAM bit equivalents. Relativedensity captures the area impact of wires and more ir-regular structures such as logic areas versus the moreregular memory arrays. Although an SRAM bit com-prises typically 4 to 6 transistors we observe that thearea it occupies is similar to the area of a logic transis-tor in a CPU die because of its regular structure andtherefore its higher relative density. Thus, the chip

size expressed in Sbe units is equivalent to the totalnumber of transistors for logic areas.

Area Relative density

CPU logic transistor 1 SbeRouter logic transistor 1 Sbe

SRAM bit 1 SbeDRAM bit 1/16 Sbe

Table 2: Relative densities of constituent VLSI com-ponents.

A DRAM bit is realized with one transistor andthe area it occupies is 10-16 times smaller than anSRAM bit area. We arrived at this conclusion as thetypical SRAM cell requires a wire grid of dimension3� 4 compared to a DRAM cell implemented on theintersection of two wires. Factors such as the numberof metallic layers may change the relative density rela-tions as more layers increase the density of logic areas.The logic area density is also reduced because of thegreater amount of area devoted to wiring.

The following cost functions are based on empiricalobservations and statistics gathered on current imple-mentations of superscalars and router chips.Processor cost Kp The processor cost model com-putes the area cost as a function of p. We �nd itconvenient to relate p to cost kp using an intermedi-ate parameter i, which is the number of issue units iin the processor. Thus, i = 4 implies a 4-Way super-scalar with a maximum of 4 operations per cycle.

We model the relationship between processing costand instruction issue structure as a quadratic curve,which captures the cost increase due to multiple issuesuperscalars.

Kp(i) = Bp +Kps(i� 1)2 (3)

In the above, a cost of Bp is required to achieve asingle issue processor with i = 1.

We relate processing power p and the number ofissue units i using:

p =pi (4)

This model captures the relationship between per-formance and cost due to more aggressive clock ratesof lower issue processors. Typically single issue designsobtain 1.6-2 times faster clock rates than correspond-ing high-issue rate processors. It also captures thefact that it is easier to obtain performance close to thetheoretical maximum cycles per instruction in lower-issue processors as they require a smaller amount ofinstruction-level parallelism in applications.

Studying the layout of some simple RISC processors[6, 14, 13, 8] leads to a base cost of Bp = 2:5 � 105

transistor. That is, a minimal single issue 64 bit pro-cessor can be built in the area of 250K SRAM bitsor with 250K logic transistors. A cost constant ofKps = 4 � 105 Sbe, was arrived at from the study ofsome high-end processors [22, 20, 21, 8].

For validation, Figure 4 compares the number oftransistors dedicated to logic in several superscalarmicroprocessors with our cost model for Kp(i). Weobserve that for higher-issue superscalars the varia-tion in the number of transistors dedicated to logicareas is large. This variation is caused by importantdi�erences in implementation of components like issuestructure, scheduling and memory interfaces. A moredetailed cost model for superscalars may also deal withthe cost impact of dynamic or static issue structures,scheduling and memory interfacing.Memory cost. We approximate memory cost as alinear function of capacity m.

Km(m) = Bm +Wm (5)

Here, m is the memory size in words, Kms is the costper word of memory, and Bm is the �xed overheadcost of the memory. This overhead includes logic fortranslation, address decode, data multiplexing, andmemory peripheral circuitry. For our calculations, weassume that W (wordsize) = 64, and the overhead,Bm, is 5� 104.

Communication cost. Main components of a typ-ical router comprise a routing module, a crossbar ar-biter, and input output modules often including largeFIFOs. We observe that most of the area in currentrouter chips are taken up by FIFOs and pad frames(c.a. 20%). The amount of FIFOs depends on suchfactors as the number of virtual channels. The area ofqueues re ects size of message its and a length whichis typically 16-20 its. A it is the number of bitstransferred in one cycle, and therefore it also equalsc expressed in bits. One word per cycle communica-tion bandwidth thus requires a it size of one word.Although not necessary, we also assume the itsize isequal to the physical channel width. We denote thedimension of the network as n. The total number ofbidirectional channels is then 2n. Our results focus ontwo-dimensional networks, so n = 2 for most of thispaper. Logic areas such as the crossbar usually occu-pies a small part of the total area. The cost functionfor the routers is described in the following equation:

Kc(c) = Bc+KcsW �FIFOl�2n�SetofQ�c: (6)

0

1000000

2000000

3000000

4000000

5000000

6000000

1

1.5 2

2.5 3

3.5 4

4.5

i

Tra

nsis

tors

Kp(i)

R2000

R5000

PA7100

PA7200

PC601

Pentium II

PC604e

R8000

R10000

PC620

PA8000

Alpha21164A

Ultra SPARC II

Figure 4: Comparison between the processor costfunction Kp(i) and cost of logic areas in current su-perscalar microprocessors. Cache memory areas arefactored out.

In the equation above, FIFOl is the length of theFIFOs and SetofQ is the number of queue sets dueto virtual channels. Our results use SetofQ = 1. Thecommunication cost factor, Kcs, is derived by �ttingthe cost function equation with the areas of routingchips shown in Table 3.

For our calculations, we useKcs = 25. For example,a router with a 64 bit it size and with one set ofqueues, each with length 16 its, takes approximately125000 logic transistor area in our model.

The base area for a router, Bc is estimated at2:5 � 104. We arrive at this from a study of sim-ple routers [10, 9, 6, 15]. Examples of routers withthe number of transistors used in current implementa-tions are shown in Table 3. The The estimates usingour communication cost model are also shown. Thecomparison indicates that our cost model re ects rel-atively accurately the area occupied by these routersexcept the RDT [7] router chip that has more than halfof its area devoted to a multicast mechanism moduleand a bit-map generator.

Global communication cost. We approximateglobal communication cost as a linear function ofglobal o�-chip communication capacity. The base areafor global I/O, Bbg = 104, is estimated to be some-what smaller than a simple router area as no routingfunctions are necessary. The global communicationbandwidth is limited by the maximum number of pinsa packaging technology will allow. As current micro-processor packaging technologies use from one hun-

Router Transistors Estimated Type Network Flits FIFOl SetofQ Pins

J machine 29000 29050 wormhole 3D mesh 9 3 1 -Postech 30140 31400 virtual cut-through 2D mesh 8 8 1 100Mosaic C 60000 44200 wormhole, asynch. 2D mesh 8 4 3 c.a. 88Chaos 110000 121000 chaotic 2D mesh 16 20 3 132RDT 320000 111400 wormhole, 2v 3D RDT 18 16 2 299RR 600000 625000 adaptive, 5v 2D mesh 75 16 5 300

Table 3: Important cost factors for router chips. In the Type column we give the number of virtual channelswhere necessary, e.g 2v means 2 virtual channels. The second and third columns compare the actual number andthe estimated number of transistors. With F lits we show the it size or the number of bits transferred in onecycle. FIFOl shows the length of FIFOs in its and SetofQ shows the set of queues in the design often re ectingthe number of virtual channels.

dred to several hundred pins, we assume that in 10-12 years packaging will allow no more than roughly2000 pins. The maximum possible global bandwidthis then bmax = 2000=W . The global communicationcost factor, Kbs = 105 multiplied with the wordsizeis approximately the cost in SRAM bit equivalents ofone word per cycle of global I/O bandwidth.

Kbg(bg) =

� 1 if bg > bmax

Bbg +KbsgWbg otherwise(7)

Global latency cost. For simplicity we assume thiscost as constant re ecting the more or less constantspeed of DRAM access over time. Blg is estimated at105.

Klg(lg) = Blg (8)

Total cost of the system. The total cost of thesystem is equal to the sum of its components.

K(x) = P (Kp(p)+Kc(c)+Km(m))+Kbg(bg)+Klg(lg)(9)

2.5 Application model

The application model contains functions and pa-rameters that are necessary for application perfor-mance characterization. To predict the performanceof an application with a particular machine con�gu-ration, we assume that the resource demands are uni-form over time and that processing, local and globalcommunication can be completely overlapped. Somealgorithms, such as those used in dynamic program-ming, also require the estimation of the algorithmicimbalance or the idle time due to synchronization over-head. Applications with several phases can be han-dled by dividing the application into its phases andcharacterizing each phase separately. Our assumptionthat processing, local and global communication are

overlapped imposes constraints on how the problemis partitioned and on the total amount of memory re-quired. As we will show later, besides the memoryneeded to hold the problem, local and global commu-nication bu�ers are required in order to be able tooverlap communication times.

We will exemplify the concepts of this section byanalyzing the Jacobi relaxation problem. The require-ments of the other applications considered in this pa-per are presented in the Appendix. The Jacobi Relax-ation problem is an iterative algorithm which, given aset of boundary conditions, �nds discretized solutionsto di�erential equations of the form r2A + B = 0.Each step of the algorithm replaces the value at eachnode of a grid with the average of the values of itsnearest neighbors.

The original Jacobi problem de�ned by a grid of sizeN is partitioned in subgrids of size N 0 as illustratedin Figure 5. Each subgrid or subproblem is solved bystoring the subproblem of size N 0 in the internal mem-ory of a Raw microprocessor and running a blockingrelaxation algorithm. After a given number of phases,the subgrid is stored in external RDRAM, and thenext subgrid is loaded. Clearly, a given subgrid has tobe loaded and operated upon multiple times to re ectthe e�ect of synchronization with the values computedin neighboring subgrids.

Because, values from neighboring subgrids do notimpact the relaxations on a given subgrid stored in themicroprocessor, the number of iterations needed forconvervence increases. We choose is =

pN 0=2 as the

number of iterations after which resynchronizationsmust occur between subproblems. Starting with someboundary conditions this means propagating bordervalues to all points in a subproblem. We chose thetotal number of iterations as being it = N2 giving anerror reduction factor of ten.

Let us analyze the requirements of this application.

0

0.002

0.004

0.006

c[words/cycle]

0

5000

10000

15000

Nodes

0

200000

400000

600000

800000

Runtime

0

0.002

0.004

0.006

c[words/cycle]

Figure 6: Graphical illustration of the solution to the optimization problem; example shown is for Jacobi Relax-ation (one iteration), gridsize 4000� 4000 points, software overhead is zero, and global latency equals 100 cycles.The two surfaces correspond to the runtime performance function and a combined equation for the constraintfunctions. The intersection of the two surfaces determines the points that give balanced machine con�gurations.The points (copt; Popt) that correspond to the smallest runtime are the global optimal solutions of the optimization.Other parameters such as processing power and memory size can be determined from the constraint equationsby substitution.

N

N’

Figure 5: Jacobi Relaxation. The problem of size Nis �rst partitioned in subproblems of size N 0. Eachsubproblem is solved with blocking on P processors.Each processor receives bordering data from its fourneighbors and sends its data along borders to its neigh-bors. Subproblems are resynchronized after a numberof iterations.

Required processing per node Rp. This require-ment re ects the total amount of computation re-quired per Jacobi node given the algorithmic assump-tions described above. The total number of operationsfor each point are three additions and one multiplica-tion.

Rp = it4N

P= 4

N3

P: P 2 [1; N ] (10)

Required amount of memory words per node

Rm. The required memory is comprised by the mem-ory required to solve the subblock of size N 0 and alsothe memory bu�ers needed to overlap local and globalcommunication.

Rm =N 0

P+ 4

rN 0

P+ 2

N 0

P= 3

N 0

P+ 4

rN 0

P: (11)

Required number of words of local communi-

cation per node Rc. The required local communi-cations is the total amount of data sent or receivedduring the whole execution time. For any iterationeach processor requires the bordering points from itsneighbor processors.

Rc = it � 8

rN 0

P� N

N 0= 8

N3

pN 0P

(12)

Required local communication events Ro.

These events incur a software penalty for initiatinga communication step. It re ects the total number oftimes a local send or receive is issued.

Ro = Rc � 1qN 0

P

(13)

Required latency of events Rl. Re ects the totalnumber of times a local send is issued.

Rl = Rc � 1

2q

N 0

P

(14)

Required global communication Rbg. Re ects thetotal amount of words of global communication perchip.

Rbg =itis2N = 2N2

pN (15)

Required global communication events Rlg. Re- ects the total times global sends or reads are initiatedper chip.

Rlg =Rbg

N 0(16)

2.6 Performance Functions

The performance functions estimate the runningtime of an application in terms of application require-ments and architecture parameters.

Runtimes < T; Tp; Tc; Tg >. Let the times for pro-cessing, local communication, and global communica-tion be Tp; Tc; and Tg respectively. Under the assump-tions that local and global communication time areoverlapped with computation, the parallel runtime isde�ned as the maximum of these times.

T = max(Tp; Tc; Tg)

Tp =Rp

p+Roo+Rlgo

Tc =Rc

c+Rlkdl

Tg =Rbg

bg+Rlg

kd2l +Rlglg (17)

As an example, if the number of operations thatmust be performed is Rp and the processing power isp operations per cycle, then the processing time is sim-ply Rp=p. Similarly, if the number of events incurringthe message overhead (o cycles) is Ro, then the timewasted in message overhead activity is Roo.

2.7 The optimization problem

In this section we describe in more detail the opti-mization procedure. The optimization procedure isalso illustrated graphically in Figure 6.

The problem solved is the following constrained basednonlinear optimization problem:

Given: a �xed chip area or budget B and a problemsize N

Objective:

min (T (N;N 0; x)) (18)

subject to the constraints given below, where x isa speci�cmachine con�guration < p; P;m; c; o; l; bg; lg >. Thesolution of this optimization is the optimal machinecon�guration: xopt =< p; P;m; c; o; l; bg; lg >opt andand the optimal subproblem size: N 0

opt

Constraints:

1. Budget B must be greater or equal than thetotal cost. The total cost of the system is computedas the sum of its components.

B � P (Kp +Kc +Km) +Kbg +Klg (19)

It is expedient to use an additional set of balanceconstraints as given below, when the communicationand computation are overlapped. The balance con-straints focus the search for the optimal solution tobalanced machine con�gurations. In other words, sec-ond and third equations state that communication andcomputation times should be equal. If they are notequal, we can take resources from the faster compo-nent without increasing runtime. The fourth balanceconstraint states that the memory should �t the prob-lem. If the memory is larger than this amount, it canbe reduced without impacting performance. When lo-cal, global communication times are equal and mem-ory �ts the problem, the machine con�guration is bal-anced for the application. In a balanced machine each

resource is utilized to its fullest. The balance con-straints greatly reduce the search space, and thus thecomplexity of the optimization procedure.

2. Balanced local communication with computa-tion.

Tp = Tc (20)

3. Balanced global communication and computa-tion.

Tg = Tp (21)

4. Memory on processor element must �t the mem-ory required for a block. Besides the memory re-quired for actual computations Ra

m, bu�ers for localand global communications Rl

m; Rbm are allocated be-

cause of overlapping conditions.

m = Rm = Ram +Rl

m +Rgm (22)

3 Analysis

In this section, we study a set of applications inthe context of the framework presented. The appli-cations are: Jacobi Relaxation, Dense Matrix Multi-ply, Nbody, FFT, Largest Common Subsequence. Wechose these applications becouse they are diverse andrequire con icting machine performances to run ef-�ciently. The optimization procedure has been im-plemented in Mathematica. We use a 3 cycle soft-ware overhead and a 100 cycle DRAM access latency.We also counted an 8Kbyte SRAM-based instructioncache per node.

In all the experiments we used a budget of 1 bil-lion SRAM bit equivalents or the area required for 1billion logic transistors. This budget is achievable in10-12 years as projected by the Semiconductor Indus-try Association (SIA) given a 10-20% growth rate peryear of die areas and a growth rate in transistor countsof between 60 and 80% per year due to increasing den-sities.

Application speci�c results. Figure 7 shows theoptimal division of chip resources for the various ap-plications as a function of problem size. The optimalamount of each resource is shown in greater detail inFigures 8 through 12. Table 4 summarizes the optimalcon�gurations and chip sizes.

Perhaps the most important result from Figure 7is that the amount of area devoted to processing andlocal communication is more or less constant at about75 percent for all the applications and problem sizes.

The global communication bandwidth of 30 wordsper cycle is the maximum achievable given a packag-ing technology allowing 2000 pins. The only applica-tion that is I/O limited and requires this bandwidthis FFT. All the other applications have a negligiblearea allocated to global communication. The totalchip area for global communication is relatively smalleven for FFT. Therefore, providing the maximum pos-sible global bandwidth is not a bad idea in a �nal con-�guration.

As we can see, the relative communication area re-quired is small in applications such as Jacobi and LCSas they also show good spatial locality. These appli-cations can use most of the resources for processing.FFT and Nbody require the largest communicationarea with an optimal communication bandwidth be-tween 4 and 5 words per cycle. The division betweenprocessing and memory areas is uniform.

The matrix multiplication based on Connor's mem-ory e�cient blocking algorithm gives the most uni-formly divided con�guration. For this application,memory, local communication and processing areasare approximately equal.

The amount of memory per node obtained is rela-tively small compared to modern day multiprocessorsin all applications. The reason is twofold. First, thetotal amount of memory in the entire Raw chip is stillquite large, since it is the product of P and m. Sec-ond, fast local communication obviates the need forhuge amounts of local memory. The matrix multipli-cation required the largest amount of memory givinga total of 24 Kbytes per node. The smallest memoryis required for Nbody.

For all the applications the optimal processingpower obtained is equivalent to a single-issue proces-sor. The total number of processors P varied between1100 to 2310 for large problem sizes.

Although the optimal machine con�gurations ob-tained vary for di�erent applications, problem sizesand budgets, the general trends for various applica-tions are similar. Accordingly, for the applicationsstudied, assuming an 1 billion logic transistor equiv-alent area, we recommend building a Raw chip withapproximately 1000 tiles, 30 words/cycle global I/O,20Kbytes of local memory per node, 3-4 words/cyclelocal communication bandwidt, and single-issue pro-cessors. This con�guration will give performance nearthe global optimum for most applications.

Sensitivity of grain size. The framework helps an-swer many other questions about machine con�gura-tions. Let us study the sensitivity of performance to

0%

20%

40%

60%

80%

100%

N=1

0^2

N=1

0^4

N=1

0^6

N=1

0^8

N=1

0^2

N=1

0^4

N=1

0^6

N=1

0^8

N=1

0^2

N=1

0^4

N=1

0^6

N=1

0^8

N=1

0^2

N=1

0^4

N=1

0^6

N=1

0^8

N=1

0^2

N=1

0^4

N=1

0^6

N=1

0^8

Jacobi Matmul FF T Nbody LCS

Bre

akdo

wn

reso

urce

s be

twee

n va

rious

com

pon

ents

Kbg

Km

Kc

Kp

Figure 7: Breakdown of chip areas for processing, memory, local communication and global communication thatgive optimal machine con�gurations for a budget of 1 billion logic transistor equivalent area.

0

500

1000

1500

2000

2500

Jacobi Matmul FFT Nbody LCS

Pro

cess

ors N=10^2

N=10^4

N=10^6

N=10^8

Figure 8: Number of processors in optimal machinecon�gurations for di�erent problem sizes.

0

5

10

15

20

25

30

Jacobi Matmul FFT Nbody LCS

Glo

bal b

andw

idth

, bg

[wor

ds/c

ycle

]

N=10^2

N=10^4

N=10^6

N=10^8

Figure 9: Global io bandwidth bg in optimal machinecon�gurations for di�erent problem sizes.

0

500

1000

1500

2000

2500

Jacobi Matmul FFT Nbody LCSSR

AM

dat

a m

emor

y pe

r no

de in

Wor

ds

N=10^2

N=10^4

N=10^6

N=10^8

Figure 10: Local SRAM data memory m per node inoptimal machine con�gurations.

0

2

4

6

8

10

12

14

Jacobi Matmul FFT Nbody LCS

Com

mun

icat

ion

band

wid

th c

[wor

ds/c

ycle

]

N=10^2

N=10^4

N=10^6

N=10^8

Figure 11: Local communication bandwidth c in opti-mal machine con�gurations for di�erent problem sizes.

0

1

2

3

4

5

6

7

8

Jacobi Matmul FFT Nbody LCS

Pro

cess

ing

pow

er p

[ope

ratio

n/cy

cle]

N=10^2

N=10^4

N=10^6

N=10^8

Figure 12: Processing power p in optimal machine con-�gurations for di�erent problem sizes.

the machine con�guration near the optimum machinecon�guration point. This study is useful to determinea machine con�guration that is robust across manyapplications. As an example, let us determine the ma-chine con�guration with the smallest number of nodeswhose performance is within 25 percent of the optimalcon�guration.

Results are shown in Table 5. For each applica-tion, the �rst row gives the optimal con�guration. Thesecond row gives the con�guration with the smallestnumber of nodes under the condition that the perfor-mance is no worse than 25 percent of the optimal. Aswe can see, balanced machine con�gurations with lessnodes usually take advantage of the parallelism avail-able in superscalar processors. However, for all theapplications studied the con�guration that gave bestperformance used nodes based on 2-way superscalarsat most.

Design comparisons. The framework also allows usto compare competing designs for the same budget.As an example, let us compare the two designs: (1)using on-chip SRAM and routers with 16 it FIFOs,and (2) using only a small SRAM cache and the restof memory in on-chip DRAM as well as small 2 itFIFOs. We derive the performance/cost optimal con-�gurations and look to application performance fordi�erent problem sizes.

Since DRAM densities are much higher than SRAMdensities we can have more memory per node in alter-native (2). One problem in using DRAMs is that theaccess latency that is much higher than correspond-ing SRAMs. To reduce the impact of the latency, weinclude a small SRAM cache in each node and as-sume that the SRAM cache results in a near perfecthit rate. Case (2) also has small FIFOs { With good

static scheduling of the communication channels theneed for deep FIFO's is reduced.

The question is how much do these changes im-pact the performance of applications given perfor-mance/cost optimal partitioning of resources in bothcases? Figure 13 shows the performance ratio betweenthe second and the �rst designs. It is easy to see thatthe larger amount of on-chip memory in case (2) re-sults in signi�cantly higher performance.

0

0.5

1

1.5

2

2.5

3

3.5

4

N=10^2 N=10^4 N=10^6 N=10^8P

erfo

rman

ce im

prov

emen

t

Jacobi

Matmul

FFT

Nbody

LCS

Figure 13: Performance comparison between two costoptimal designs each with a budget of 1 billion logictransistor area. In the �rst design we use SRAM androuters with 16 its FIFOs, while in the second designwe use on-chip DRAM with a 1Kbyte SRAM cacheper node and 2 it FIFOs.

4 Conclusions

This paper provides a framework for reasoningabout single chip microprocessors such as Raw withreplicated, �ne-grain processing elements. The frame-work uses a machine characterization that considersprocessing, memory, local and global communication,and latency as separate machine resources. This is aunique characterization of machine space since it cap-tures the e�ects of locality by treating local and globalcommunication separately. The framework incorpo-rates a cost model based on empirical observationsand statistics gathered on current implementations ofsuperscalars and router chips.

The framework recognizes the importance of bal-ance in good design, and integrates this idea with acost and performance model to provide a useful de-sign tool. Having provided this framework, this paperchooses a diverse application suite in order to exercisethe framework and to address some general questions

Problem Size P c p m bg PKp PKc PKm Kbg Klg

Matmul 108 1290 2.6 1.25 1640 3.9 35 35 28 0.02 0.01

Matmul 106 1290 2.3 1.25 2230 3.3 35 30 33 0.02 0.01

Matmul 104 724 8 1.5 97 8.8 25 65 9 0.05 0.01

Jacobi 108 2180 0.19 1.25 464 4.4 60 7.4 32.6 0.03 0.01

Jacobi 106 2171 0.19 1.25 502 4.2 60 7.4 32.6 0.03 0.01

Jacobi 104 1950 1 1.25 25 21 53 22 23 0.16 0.01

Nbody 108 1100 5 1 61 0.06 26 60 13 0.0004 0.01

Nbody 106 1080 5 1 67 0.06 26 60 13 0.0004 0.01

Nbody 104 1070 5 1 8 0.5 26 60 13 0.004 0.01

FFT 108 1160 4.2 1.25 178 30 31 50 15 0.2 0.01

FFT 106 1160 4.2 1.25 178 30 31 50 15 0.2 0.01

FFT 104 1160 4.2 1.25 178 30 31 50 15 0.2 0.01

LCS 108 2310 0.01 1.25 337 0.015 63 3.7 32 0.0001 0.01

LCS 106 2330 0.01 1.25 291 0.014 63 3.7 32 0.0001 0.01

LCS 104 2290 0.25 1.5 20 0.25 53 9 27 0.002 0.01

Table 4: Breakdown of resources and optimal machine con�gurations for three problemsizes. Columns P to bgrepresent the optimal machine con�guration and the columns from PKp to Klg are the chipsizes in percent ofthe total cost.

Problem P N 0 c p m bg PKp PKc PKm Kbg Klg

Matmul 1290 548x548 2.6 1.25 1640 3.9 35 35 28 0.01 0.01(18%) 826 685x685 1.7 2 3975 2.5 46 20 32 0.01 0.01

Jacobi 2180 302230 0.19 1.25 464 4.4 60 7.4 32.6 0.01 0.01(24%) 811 302230 0.3 2 538 5 77 5.5 16 0.01 0.01

Nbody 1100 8306 5 1 61 0.06 26 60 13 0.0002 0.01(21%) 799 295145 5 1.5 2954 0.001 28 47 24 0.00001 0.01

FFT 1160 68718 4.2 1.25 178 30 31 50 15 0.2 0.01(23%) 356 3518440 2.7 1.75 29594 30 17 10 71 0.2 0.01

LCS 2310 193427 0.01 1.25 337 0.015 63 3.7 32 0.0001 0.01(23%) 1119 4354 0.05 1.75 499 0.015 73 2.4 23 0.0001 0.01

Table 5: Solutions that come within 25% of the optimal for a problem size of 108 with the smallest number ofnodes P . The �rst row of each application shows the global optimum, and the second row shows the solution withthe minimum number of processors and performance within 25% of the optimal. The numbers in parenthesesshow the performance degradation compared to the global optimum for the con�gurations with the minimumprocessors. The �rst columns between P to bg represent the optimal machine con�guration and the columns fromPKp to Klg are the chipsizes in percent of the total cost.

in parallel computer design in general. More specif-ically, it addresses the questions of on-chip resourcedivision in the MIT Raw microprocessor.

Although the optimal machine con�gurations varyfor di�erent applications, problem sizes and budgets,the general trends are consistent. The framework rec-ommended that chip designers devote about 75 per-cent of the chip area to processing and local com-munication. The framework further suggested thatfor the applications studied and assuming an 1 bil-lion logic transistor equivalent area, designers shouldbuild a system with about 1000 nodes, 30 words/cycleof global I/O, 20Kbyte of local memory per node,3-4 words/cycle local communication bandwidth andsingle-issue processors for optimal performance.

5 Acknowledgements

This work leverages the early work on grain size byYeung, Dally, and Agarwal [5]. The research is fundedby DARPA contract #DABT63-96-C-0036. We arealso grateful to Tom Knight for many relevant discus-sions on cost modeling.

A Applications

References

[1] Elliot Waingold, Michael Taylor, DevabhaktuniSrikrishna, Vivek Sarkar, Walter Lee, Victor Lee,Jang Kim, Matthew Frank, Peter Finch, RajeevBarua, Jonathan Babb, Saman Amarasinghe, andAnant Agarwal Baring it all to Software: RawMachines. IEEE Computer, September 1997,pp. 86-93.

[2] D. Culler, R. Karp, D. Patterson, A. Sahay, K.Schauser, E. Santos, R. Subramonian, and T.Eicken, \LogP: Towards a Realistic Model of Par-allel Computation," Proc. of Fourth ACM SIG-PLAN Symp. on Principles and Practices of Par-allel Programming, May 1993.

[3] A. Alexandrov, M. Ionescu, K. E. Schauser,and C. Scheiman. \LogGP: Incorporating LongMessages into the LogP Model" Proc. of theSPAA'95, Santa Barbara, CA, July 1995.

[4] J. Babb and R. Tessier and M. Dahl and S.Hanono and D. Hoki and A. Agarwal. Logic Em-ulation with Virtual Wires. IEEE Transactions

on Computer Aided Design, VOL. 16, No.6, June1997, pp. 609-626.

[5] Donald Yeung, William J. Dally, Anant Agarwal.How to Choose the Grain Size of a Parallel Com-puter. MIT/LCS Technical Report, MIT-LCS-TR-739.

[6] Charles L. Seitz, Nanette J. Boden, JakovSeizovic, and Wen-King Su. The Design ofthe Caltech Mosaic C Multicomputer. Researchon Integrated Systems, Proceedings of the 1993Symposium, The MIT Press. Cambridge, Mas-sachusetts, 1993. pp. 1-22.

[7] Hiroaki Nishi, Ken-ichiro Anjo, Tomohiro Ku-doh, Hideharu Amano. The RDT Router Chip:A Versatile Router for Supporting Shared Mem-ory. Special Issue on Architecture, Algorithms andNetworks for Massively Parallel Computing, IE-ICE, VOL. E00-A, No.1 January 1997.

[8] CPU Info Center.http://infopad.eecs.berkeley.edu/CIC/tech/

[9] Peter R. Nuth and William J. Dally. The J-Machine Network, Proceedings of the 1992 IEEEInternational Conference on Computer Design:VLSI in Computers and Processors, October1992. pp. 420-423.

[10] Charles L. Seitz and Wen-King Su. A Familyof Routing and Communication Chips Based onthe Mosaic. Research on Integrated Systems, Pro-ceedings of the 1993 Symposium, The MIT Press.Cambridge, Massachusetts, 1993. pp 320-337.

[11] H. T. Kung. Memory Requirements for BalancedComputer Architectures, IEEE 1986. pp. 49-54.

[12] Thomas J. Holman and Lawrence Snyder. Archi-tectural Tradeo�s in Parallel Computer Design.Advanced Research in VLSI, Proceedings of the1989 Decennial Caltech Conference, The MITPress. Cambridge, Massachusetts, March 1989.pp. 317-334.

[13] Paul Chow The MIPS-X RISC Microprocessor,Kluwer Academic Publishers, August 1989.

[14] William J. Dally Architecture of a Message-Driven Processor, Proceedings of the 14th An-nual Symposium on Computer Architecture, June1987, pp. 189-196.

Application Rp Rc Rl Ro Rm Rbg Rlg

Jacobi 4N3

P8 N3

pN0P

4N3

N08N

3

N03N

0

P+ 4p

N0

P2N2

pN 2N

2pN

N0

Matmul 2N3

P4 N3

N0

pP

2N3pP

N034N

3pP

N037N

02

P2N

3

N0+N2 2 N3

N03

Nbody 2N2

P2N

2

PN2

P2N

2

PN0

P4N

2

N0

N2

N02

FFT 12NPlogN 2N

PlogN N

N0logN 2 N

N0logN 4N

0

P2N logN 2 N

N0logN

LCS 2N2

P2N

2

N0

N2

N02N

2

N04N

0

P4N N

N0

Table 6: Overview application requirements.

[15] William J. Dally and Charles L. Seitz. The TorusRouting Chip, Distributed Computing, Volume1. pp. 187-196.

[16] Anant Agarwal, David Chaiken,Godfrey D'Souza, Kirk Johnson, David Kranz,John Kubiatowicz, Kiyoshi Kurihara, Beng-HongLim, Gino Maa, Dan Nussbaum, Mike Parkin,and Donald Yeung. The MIT Alewife Machine: ALarge-Scale Distributed-Memory Multiprocessor.Proceedings of the Workshop on Scalable SharedMemory Multiprocessors. Kluwer Academic Pub-lishers, 1991. Also appears as MIT/LCS MemoTM-454, 1991.

[17] William J. Dally et al. The J-Machine: A Fine-Grain Concurrent Computer, Proceedings of theIFIP (International Federation for InformationProcessing), 11th World Congress, Elsevier Sci-ence Publishing, New York, 1989. pp. 1147-1153.

[18] CM5 Technical Summary, Thinking MachinesCorporation, Cambridge, MA. Oct, 1991.

[19] CRAY T3D System Architecture Overview, CrayResearch, Inc. Revision 1.C, September 23, 1993.

[20] Keith Diefendor� and Michael Allen. Organiza-tion of the Motorola 88110 Superscalar RISC Mi-croprocessor, IEEE Micro, Volume 2, Number 2,April 1992. pp. 40-63.

[21] Dennis Allison and Michael Slater. National Un-veils Superscalar RISC Processor, MicroprocessorReport, Volume 5, Number 3, February 20, 1991.

[22] Daniel Dobberpuhl et. al. A 200 Mhz 64b Dual-Issue Microprocessor, IEEE Solid State CircuitsConference, Volume 35, February 1992. pp. 106-107.