p1184-padua - advanced compiler optimizations for supercomputers

7/27/2019 p1184-Padua - Advanced Compiler Optimizations for Supercomputers

1/18

SPECIAL SSUE

Articles

ADVANCED COMPILER OPTIMIZATIONSFOR SUPERCOMPUTERS

Compilers for vector or multiprocessor computers must have certainoptimization features to successfully generate parallel code.

DAVID A. PADUA and MICHAEL J. WOLFE

Supercomputers use parallelism to provide userswith increased computational power. Most super-computers are programmed in some higher levellanguage, commonly Fortran; all supercomputervendors provide Fortran compilers that detect paral-lelism and generate parallel code to take advantageof the architecture of their machines [25, 46, 531.This article discusses some of the common (andnot so common) features that compilers for vector ormultiprocessor computers must have in order to suc-cessfully generate parallel code. The many examplesgiven throughout are related to the generic types ofmachines to which they apply. Where appropriate,we also relate these parallel compiler optimizationsto those used in standard compilers.Some of the supercomputers available today are the FX Series from AlliantComputer Corporation. the Cyber 205 from Control Data Corporation, theConvax C-I from Conwx Computer Corporation. the Cray 2 and the CrayX-MP 1151 rom Cray Research. the Facom VP from Fujitsu [39], the S-810Vector Processor from Hitachi 1401. he SX System from NEC Corporation. andthe SCS-40 from Scientific Computer Systems. Alliants FX and Convexs C -lare usually classified as minisupercomp uters. Besides the vendor-supplied compilers. there are a number of experimentaland third-party source-~o-so~~rce restructurers . Among them are the Univer-sity of Illinoiss Parafrase 1321, KAls KAP [17. 181. Rice Universitys PFC 1.51,and Pacific Sierras VAST [I 11.This work was supported in part by the National Science Foundation underGrant US NSF DCR84.06916 and US NSF DCR84.10110. the U.S. Departmentof Energy under G rant IJS DOE-DE-FG02.85ER 25001, and the IBM Donation.

Cl 1986 ACM OOOI-0782/86/1200-1184 75a

Fortran is currently the programming language ofchoice for supercomputers, largely because vendorspresently provide optimizing, vectorizing, andconcurrentizing3 compilers only for Fortran. In addi-tion, Fortran has historically been the most fre-quently used numerical programming language, andmuch investment has been made in its programs.The examples in the text are written in Fortran,using some of the new features described in the lat-est Fortran-8x proposal [i], such as the end dostatement and array assignments. The literature dis-tinguishes two types of concurrent loops: doall anddoacross [16, 431. The latter imposes a partial exe-cution order across iterations in the sense that someof the iterations are forced to wait for the executionof some of the instructions from previous iterations.Doall loops do not impose any partial orderingacross iterations, even though there may be criticalregions in the loop bodies.* Despite our use ofFortran, none of the features or transformations ex-plained here are peculiar to this language, and theycan be successfully applied to many others.The term parallclize is often used to describe the translation of serial code forparallel computers. We prefer the mow specific cor~currtwt over parallel. sinceparallel may moan vector. concurrent. or lockstep multiprocessor computa-tion. Also, Alliant Computer Corporation. the first commercial vendor to sup-ply a compiler that automatically translates code for multiple processors. usesthe term cor~currer~f.a The DOALL statem ent was originally defined in the Burroughs FMP Fortran113, 361.

December 1986 Volume 29 Number 12


2/18

Special Issue

We will discuss how data-dependence testing ofsome form is required in any compiler that wishes todetect parallelism in serial code, and explain thedifferent types of parallel code that can be gener-ated. Vector code is appropriate for computers withvector instructions sets, while concurrent constructsare used in multiprocessor environments. Also cov-ered are ways to improve the computation of thedata-dependence graph. A (very incomplete) catalogof transformations and restructuring tricks that com-pilers use to optimize code for parallel computers isgiven, followed by a discussion of how these com-pilers communicate with programmers.DATA DEPENDENCEData-dependence testing [5, 8, 10, 521 is required forany form of automatic parallelism detection. Data-dependence relations are used to determine whentwo operations, statements, or two iterations of aloop can be executed in parallel. For instance, in thecodeS ,: A=B+Cs* : D=A+2.S3: E = A * 3.statements S, and sz cannot be executed at thesame time since Sz uses the value of A that is com-puted by s,. This is called true dependence or f7owdependence since the data value flows from S, to s2,and is denoted S, 6 sz. s3 also depends on S,(denoted s , 6 sJ; thus S, must be executed beforeboth sz and s3. The data-dependence relations areoften depicted in a data-dependence graph, with arcsrepresenting the relations, as follows:

Notice that Sz and s3 are not connected by data-dependence arcs and so may be executed in parallelif two processors are available.Two other kinds of data dependence are impor-tant. In the program segmentS ,: A=B+Cs* : B=D/2S, uses the value of B before Sz assigns a new valueto B. Since S, is to use the old value of B, it mustbe executed before s2; this is called antidependence,as the relation is from the use to the assignment, and

is denoted S, 5 S2. The third kind of dependenceis shown in the program segment below:s,: A=B+Cs2 : D=A+2s3: A=E+FHere s3 assigns a new value to A after S, has alreadygiven a value to A. If S , is executed after S3, then Awill contain the wrong value after this program seg-ment. Thus, S, must precede S3; this is called outputdependence and is denoted S, 6 S3.The flow of control must also be taken into ac-count when building data-dependence relations. Forinstance, in the program segmentS,: A = B + Cif ( X >= 0 ) thensz : A = 0.elses3 : D=Aend ifthe relations S, 6 sz and S, 6 s3 hold, buts2 6 s3 does not hold even though S2 assigns avalue to A and S3 uses A, and S3 appears after Sz inthe program. Since sz and s3 are on differentbranches of the same if statement, the value of Aused in s3 will never come from s2.Since the actual execution flow of a program isnot known until run time, a data-dependence rela-tion does not always imply data communication ormemory conflict. For instance, in this program seg-mentS,: A = B + CC : if ( X >= 0 ) thenS2: A=A+2end ifs3: D = A * 2.1the data-dependence relations S, 6 S3 and Sz 6 S3will both be computed by the compiler, even thoughS3 will in fact take the value of A from only one ofS, or S2, depending on the value of X.Many compilers also use the concept of depen-dence from an if statement to the statements undercontrol of the if. This is called control dependenceand is denoted 6. In the program segment above, forexample, the control-dependence relation C, 6 Szholds. Control-dependence relations are often addedto the data-dependence graph in order to find whichstatements in the program can be reordered, orwhen looking for cycles in the graph.Data Dependence in LoopsInside loops we are interested in data-dependencerelations between statements and also dependence

December 1986 Volume 29 Number 12 Communications of the ACM 1185


3/18

Special Issue

relations between instances of statements. We distin-guish between different insta:nces of execution of astatement by superscripting the statement label withthe loop iterations. For instance, in the loopdo I = I,3S1: A(1) = B(1)

do J = 1,2.sz C(I,J) = A(I) + B(J)end doend dostatement S, is executed three times: s 1, s :, s :; andSz is executed six times: S:,, S:,, S:,, S$*, S:,,s 2 2 (we put the outer loop iteration number first).We can also draw the iterations as points withCartesian coordinates, as below:

cs.; --- +sy-.--~sy-\

-----------------/c ST--,s$,-.---c$>,\//--------------//\ * s: ---,s;, I--- +s;,2

This diagram ilbistrates the iteration space; the dot-ted arrows show the order in which the instances ofthe statements are executed.To find data-dependence relations in loops, thearrays and subscripts are examined. In the loop

do I = 2,Ns,: A(1) = B(1) + C(1)sz: D(1) = A(1)end dothe relation S, 6 sz holds, since, for any iteration i,S;willassignA(i) andS;willuseA(i) onthesame iteration of the loop. Since the dependencestays in the same iteration of the loop, we says, 6= sz.

In the following similar loopdo I = 2,Ns,: A(1) = B(1) + C(1)s2: D(1) = A(I-1)end do

the relation S, 6 sz still holds, but for any iterationi , S ; will use an element of A that was assignedon the previous iteration of the loop by s ;- (exceptS$, which uses an old value of A( 1 )). Since thedependence flows from iteration i-l to iteration

i, we say that S, 6, Sz (the relation is S


4/18

Special Issue

(where c, d, j, and k are integer constants), thegreatest common divisor of c and d (GCD (c,d))must divide (k-j). For example, no dependencewould exist in the following loopdo I = L,Us, : A(2*1) = . . .

sz : . . . = A(2*1+1)end dosince the GCD ( 2,2 )=2, which does not dividel-0=1.More importantly, there must exist two values ofthe loop index variable I, say x and y, such thatLCxlylUc*x+j = d*y+kIn the sample loop below

do I = 1,lOs,: A(19*1+3) = . . .sz: . . . = A(2*1+21)end dothe only two values that satisfy this dependenceequation are x = 2 and y = 10 :

19*2+3 = 41 = 2*10+21Solving this Diophantine equation is the subject ofseveral of the references cited above, which alsogeneralize this to nested loops where several loopindexes appear in a single subscript.CODE GENERATIONWhen a good data-dependence graph has been builtfor a loop nest, the compiler can generate parallelcode. Since many supercomputers have vectorinstruction sets, vectorization is important. Mostvector computers can compute certain reductionoperations, such as the sum of all the elements of avector, using vector instructions. At least one super-computer also has hardware to assist in the solutionof first order linear recurrences.5Newer supercomputers achieve higher speeds byusing multiple processors. For these machines, gen-eration of concurrent code will utilize all processors.Concurrent loops and concurrent blocks of code aretwo types of parallel code that can be generated.Loop VectorizationLoops are vectorized by examining the data-dependence graphs for innermost loops. A simplegraph algorithm to find cycles in the data-dependence graph identifies any trouble spots inA first order linear recurrence is defined by the equations x, = ci + a, XX,-,i = 2. 3.. II and x1 = 0.

vectorization. If there are no cycles in the graph,then the whole loop can be vectorized. For example,the following loopdo I = l,NS1: A(1) = B(1)S2: C(1) = A(1) + B(1)

s3 : E(1) = C(I+l)end dohas the following data-dependence graph:

Sl,i,

SZ+ Flow dependence

~S3

$ -* Antidependence

Because the data-dependence graph has no cycles, itcan be completely vectorized, although some state-ment reordering will be necessary. Since S3 g Sz ,s3 must precede sz in the vectorized loop:S,: A(1:N) = B(l:N)S 3: E(l:N) = C(2:N+l)S2: C(l:N) = A(1:N) + B(1:N)

The following loop contains a data-dependencecycle:do I = 2,N

S1: A(1) = B(1)sz : C(1) = A(1) + B(I-1)s,: E(1) = C(I+l)54 : B(1) = C(1) + 2.end doThe following is the data-dependence graph for thisloop:

Sl

Statements sz and s4 comprise a cycle in the data-dependence graph; when the dependence graph



5/18

Special ssuc

contains a cycle, the strongly connected components(or maximal cycles) must be found. All the statementsin a data-dependence cycle must be executed in aserial loop unless the cycle can be broken. However,other statements may still be vectorized. Thus, theprevious loop may be partially vectorized as follows:s,: A(2:N) = B(2:N)S 3: E(2:N) = C(3:N+l)do I = 2,Nsz: C(1) = A(1) + B(I-1)S4: B(1) = C(1) + 2.end do

A sufficient (but not necessary) condition forvectorization of a loop is that no upward data-dependence reXations (Si 6 Sj, and Sj lexically pre-cedes SJ appear in the loop. This will guaranteelegal vectorization, but will miss loops where simplestatement reordering would allow vectorization (asin the first example above).Certain reduction operations appear frequentlyin programs and are recognized by vectorizingcompilers. Prorninent among these is the SUMof a vector:

do I = 1,NS 1: A(1) = B(1) + C(1)S2: ASUM = ASUM + A(1)end doVector code for this loop would appear as follows:S ,: A(l:N) := B(l:N) + C(l:N)Sz: ASUM = ASUM + SUM(A(l:N))Here, the SUM function returns the sum of its argu-ment. A special case of a SUM is a dot product; this isimportant because many supercomputers have a dis-tinct adder and multiplier that can perform a dotproduct (a SUM of a multiplication) in the same timeas a SUM alone, thus performing the multiplicationwith almost no time penalty.Care must be taken by the compiler and the pro-grammer to ensure that the correct answer will al-ways result. Most methods to produce a sum usingvector instructions involve accumulating partialsums, then adding the partial sums together. Sincethis will add the arguments in a different order thanthe original loop, round-off errors may accumulatedifferently, and some programs may produce sub-stantially different answers. Some compilers have aswitch that will disable generation of reductions inorder to guarantee the same answers from the vecto-rized code as from the serial code.Other common reductions are the PRODUCT of avector or maxirnum or minimum of a vector. Simplepattern recognition can be used to find these opera-tions (as well as SUM) in loops. More complicated

patterns, such as a loop that finds the maximum of avector and saves the index of the maximum, are alsofrequently found:IMAX = 1AMAX = A(1)do I = 2,N

if ( A(1) > AMAX ) thenAMAX = A(1)IMAX = Iend ifend doRecognizing and generating vector code for suchmultistatement patterns enhance the overall powerof a vectorizer.At least one supercomputer has been designedwith an instruction to solve first order linear recur-rences. These recurrences can be recognized bypattern matching:do J = 2,NA(J) = A(J--1) * B(J) + C(J)end doFast algorithms for solving recurrences on parallelcomputers have been devised that may be useful forsome systems, although these suffer from the sameround-off error accumulation problem mentionedearlier. Many computer systems offer a library ofprocedures to compute certain common forms of re-currences;7 some compilers recognize these recur-rences and translate them into calls to the appropri-ate library procedure.Loop ConcurrentizationOne way to use multiple processors in a computer isto partition the set of iterations of a loop and assign adifferent subset to each processor [16, 19, 34, 38, 42,431. Two important factors that determine the qual-ity of the concurrent code are the balance of proces-sor load, and the amount of processor idle time dueto synchronization. Concurrentization, therefore,should aim at an even distribution of the iterationsamong processors and should try to organize thecode so as to avoid synchronization or, at least, tominimize waiting.Checking for independent iterations is done by ex-amining the data-dependence directions. If all thedata-dependence relations in a loop have an = direc-tion for that loop, then the iterations are indepen-dent. The only time that communication betweenThe Burrough s Scientific Processor (BSP] [30] was designed with a recur-rence instruction . but the project was canceled before the first machine wasdelivered. Among current supercomp uters. the Hitachi S-610 has a recurrenceinstruction.In particular. the STACKLIB library for the Control Data 6600 [46]. 7600.Cyber 70. Cyber 170. and Cyber 205 computers.

1188 Comnrunicafions of the ACM December 1986 Volume 29 Number 12


6/18


7/18

Special Issue

and then add the partial sums into the summationvariable ASUM (at the end of the loop. This producescompletely parallel code without synchronization,but may accumulate different round-off errors.High-Level SpreadingAnother approach to utilizing multiple processors isto spread independent operations over several pro-cessors. A fine-grain parallelism model would spreadthe computatio:n tree of an expression evaluationover several processors. For instance, the computa-tion tree for the expressionA + (B"2 - Ck1.5) - (D - 2*E)is as follows:

Level 1/\

Level 2/

+\I\Level 3 A

A D /*\Level 4 3: 2 E/ \ \ \\Level 5 B 2 c 1.5

In the computation tree at levels 2, 3, and 4, thereare two arithmetic operations that could be per-formed simultaneously. With two processors, thisexpression could be computed in four times steps(assuming + and * take the same time and assumingno time for variable fetching or processor communi-cation), instead of the seven time steps necessary fora single processor.The problem with this model is that the assump-tions are invalid for current machines; variablefetching does ta.ke time, and if several processorsshare a common memory, they can interfere witheach other. Communication between processors alsotakes time; thus, it is better to try to spread blocks ofcode that represent a relatively large workload. Aswith concurrent loops, it is best if the blocks of codeassigned to different processors are completely inde-pendent, so that no communication costs are in-curred. If the workload is spread unevenly over theprocessors, then some processors may finish earlyand be left idle while other processors are still busy.Adjacent blocks of code, such as adjacent inde-pendent recurrence loops or procedure calls, arecandidates for high-level spreading [SO]. The data-dependence relations between these high-level ob-

jects are examined, and if the blocks of code areindependent, then they can be spread over severalprocessors. If there are data-dependence relationsbetween the blocks of code, then either synchroniza-tion must be added (as in loop concurrentization) toperform spreading, or the code must be executedserially.The potential for speedup with spreading is muchlower than for loop concurrentization. Loop concur-rentization may find N independent blocks of code,where N is the loop bound. Spreading, in practice,will usually find only 2-3 independent blocks ofcode suitable for parallel execution.Trade-offs between Vectorizationand ConcurrentizationSome recent computer designs (e.g., the Cray X-MP,Alliant FX/8, and ETA) have multiple processorswith vector instructions. The techniques of vectori-zation and concurrentization must be used togetherto take full advantage of these computers. Whenonly one loop exhibits any parallelism, the compilermust decide whether to generate vector code, con-current code, or whether to split the index set intosegments, all of which can be executed concurrentlyand in vector mode. When the loop contains manyif statements that would produce sparse vectoroperations, concurrent execution may be moreefficient.When several nested loops exhibit parallelism, thecompiler must choose which loop to execute in vec-tor mode and which in concurrent mode. Severalfactors should be considered. For example, since thevector speed of some machines depends on the strideof memory accesses, the compiler may choose to exe-cute the loop that generates stride-l memory opera-tions in vector mode and some other loop in con-current mode. However, since a


8/18

Special Issue

order), and with the J loop in concurrent mode,as follows:doacross J = 1 ,N

s,: A(l:N,J+l) = B(l:N,J)+ C(l:N,J)

refers to the same array addresses as the followingloop:do I=l,N

X(N-I+l) = Y(1) + 2(2*1-l)end dosignal ( J )

if ( J > 1 ) wait ( J-l )sz: D(l:N,J) = A(l:N,J) * 2.end doacross

This may produce the best vector execution speed,but the data-dependence relation S, 6, Sz requiressynchronization in the concurrent loop. An alternatemethod to compile the loop would perform the Jloop in vector mode:doall I = l,N

s,: A(I,2:N+l) = B(I,l:N)+ C(I,l:N)

.sz : D(I,l:N) = A(I,l:N) * 2.end doall

The expression assigned to I 2 is recognized as afunction of the loop index variable, so 12 is easilyrecognized as an induction variable. The assignmentto INC is a self-decrement, which qualifies INC asan induction variable. If the last values of 12 andINC are not used later in the program, the two loopsabove may be used interchangeably. After vectoriza-tion, both loops becomeX(N:l:-1) = Y(l:N) + 2(1:2*N-1:2)Wraparound Variable RecognitionSometimes a variable may look like an inductionvariable, but does not quite qualify. The assignmentto J in the loop

Although this loop would have better concurrent ex- J=Necution speed, it would perhaps be at the expense ofslower vector execution. Balancing the different do I=l,Nmethods of compiling the loop to get the best per- B(1) = (A(J) + A(1)) / 2.J=Iformance is a tough job for the compiler. end doIMPROVING POTENTIAL PARALLELISMThe previous sections should clarify the importanceof a good data-dependence testing procedure. Ifunnecessary relations are added to the data-dependence graph, then the potential for parallelismdiscovery can be reduced dramatically. Somemethods for computing a more accurate data-dependence graph are given here.Induction Variable RecognitionVariables in loops whose successive values form anarithmetic progression are called induction variables;the most obvious example of an induction variable isthe index variable of a loop. Induction variables areoften used in array subscript expressions. Tradi-tional optimization techniques are aimed at findinginduction variables to remove them from the loopand also to optimize the array address calculation[Z]. For data-dependence tests, the array subscriptsshould be known in terms of the loop index vari-ables; therefore, discovery of induction variables isimportant. Most compilers will recognize that theloopINC = Ndo I=l,N

12 = 2x1-1X(INC) = Y(1) + Z(I2)INC = INC - 1

end do

appears to qualify J as an induction variable, but Jis used before it is assigned. In fact, the programmerused a trick to make the array A look like a cylinder.The loop takes the average of two adjacent elementsof the array A; in the first iteration, the neighbor ofA ( 1 ) is defined to be A ( N )-the J variable accom-plishes this trick. J is called a wraparound variable,since the values assigned to it are not used until thenext iteration of the loop.By peeling off one iteration of the loop, J can betreated as a normal induction variable:if (N >= 1) then

B(1) = (A(J) + A(1)) / 2.do 1=2,N

B(1) = (A(I-1) + A(1)) / 2.end do

end ifThe if is necessary to test the zero-trip condition ofthe loop. The loop may be vectorized to becomeif (N >= 1) then

B(1) = (A(J) + A(1)) / 2.B(2:N) = (A(l:N-1) + A(2:N)) / 2.

end ifSymbolic Data-Dependence TestingAs mentioned in the data-dependence section, thesimplest data-dependence subscript tests will be suf-ficient for a large number of cases, but are too lim-



9/18

Special Issue

ited for a general-purpose powerful compiler. Evensome of the more sophisticated tests have severe re-strictions, such as requiring -that the loop bounds becompile-time constants. To handle a large majorityof cases, a compiler must be able to compute precisedata-dependence relations for very general array ref-erences. For instance, in the following loopdo I = LOW,IGHS,: A(1) = B(1) + C(1)S2: D(1) = A(I--1)end do

the data-dependence relation S, & sZ can be com-puted using the simplest tests. However, in thesimilar loopdo I = :LOW,IGH,INCs, : A(1) = B(1) + C(1)s,: D(1) = A(I-INC)end do

the same relation S, SCsz holds, but is more diffi-cult to detect, since the values of LOW, IGH, andINC are all unknown to the compiler, and even thesign of INC is unknown (a negative increment wouldmake the loop go backwards). The following is an-other case in which symbolic data dependence (socalled because the subscript expression cannot bedecomposed into compile-time constants) is needed:do I = l,NS,: A(LOW+I-1) = B(1)S2: B(I+N) = A(LOW+I)end do

Here, the two references to the array A can be com-pared by canceling out the loop-invariant value LOW.This is then the same as comparing A( I-l ) toA ( I ), which can be handled by simpler tests. Thetwo B array references cause no data dependence,since the section of the array referenced by B ( I ) isB ( 1 : N), which does not intersect with the sectionof the array referenced by B ( I+N ) , namelyB(N+l:N+N).

Global Forward SubstitutionGlobal jonuard substitution is a transformation thatsubstitutes the right-hand side of an assignmentstatement for occurrences of the left-hand-side vari-able, which is especially useful in conjunction withsymbolic data dependence. In programs, temporaryvariables are frequently used to hold commonlyused subexpressions or offsets; these variablesappear later in i.he program in array subscripts.Without some kind of global knowledge, the data-dependence tests must assume that the set of sub-

script values might intersect. For example, theprogramNPl = N+lNP2 = N+2

do I = l,NS1 B(1) = A(NP1) + C(1)sz : A(1) = A(1) - 1.do J = 2,Ns3: D(J,NPl)= D(J-l,NP2)*C(J) + 1.end doend dodefines two variables, NP 1 and NP2 in terms of N.A loop, later in the program, uses NP 1 and NP2 inarray subscripts. If the compiler does not keep anyinformation about NP 1, then it must assume that theassignment of A ( I ) might reassign A ( NP 1 ) , andthus there is dependence between S, and s2. How-ever, NP 1 was defined to be N+l , and the assign-ment to A ( I ) will not ever reach A ( N+ 1 ) , so this isa false dependence. Similarly, if the compiler doesnot keep information about NP 1 and NP2, it mustassume that S, forms a recurrence. In fact, sinceNP 1 can not equal NP 2, the two references to the Darray in s3 are independent, and the J loop can bevectorized or concurrentized. Many compilers per-form constant propagation [26, 45, 511, which is aspecial case of global forward substitution.Semantic AnalysisSemantic analysis of the program can also help re-move data-dependence relations. For instance, in theloop

do I = LOW,IGHS1: A(1) = B(1) + A(I+M)end doS , can be vectorized if M I 0, but not if M < 0. Bylooking at the surrounding code, the compiler mightfind an if statement:

if ( M > 0 ) thendo I = LOW,IGHS1 A(1) = B(1) + A(I+M)end doend ifTaking the if statement into account, this loop willnot be executed unless M > 0; therefore, it can bevectorized. Note that since the iterations are not in-dependent, concurrent code for this loop may stillnot be appropriate.

1192 Communications of the ACM December 1986 Volume 29 Number 12


10/18

Special Issue

Sometimes, even if the if statement is not pre-sent, the program can be modified to achieve a re-sult similar to the previous transformation. The orig-inal loop could be transformed into the following:if ( M>=O ) thendo I = LOW,IGH

A(1) = B(1) + A(I+M)end doelsedo I = LOW,IGHA(1) = B(1) + A(I+M)end doend ifThe then part of the if statement can now bevectorized. This transformation is called two-versionloops for obvious reasons.An alternative to semantic analysis is for the com-piler to request information from the user. If the

user inputs M 1 0 as an assertion that is true imme-diately before beginning of execution of the originalloop, then the compiler can proceed to vectorize theloop. (Assertions are discussed in greater detail inthe section on interaction with the programmer,p. 1198.)Semantic analysis can also be used to generatevector code for the following loop:do I = LOW,IGHS1: A(1) = A(M)end do

Since the relative values of LOW, IGH, and M are notknown, the data dependence S, 6 S relation mustbe assumed. This loop can, however, be executedwith a vector instruction. The value A(M) is loopinvariant, even if M falls between LOW nd IGH. Thevalue A(M) will at worst be copied to itself; it willnot change. The vector code for this statement willgive the same results as the serial loop, but the en-tire statement must be examined to prove this.Interprocedural Dependence AnalysisWhen a procedure or function call appears in a loop,most compilers will assume that the loop must beexecuted serially. Analysis of the effects of the pro-cedure or function call, including which parametersare changed and what global variables are used orchanged, can allow dependence testing to decidewhether or not the procedure call prevents parallelcode from being generated. Studying other informa-tion about the parameters, such as values of con-stants, across procedure call boundaries can help thecompiler optimize the code [9, 12, 14, 491.An alternative method for handling procedure

calls is to expand the procedure in-line (also calledprocedure integration [35]. This makes possible theapplication of some transformations that simultane-ously manipulate code in the calling and in thecalled routines. Also, dependence analysis for thesubroutine body is more exact, since only the effectsof the one call must be taken into account, and theoverhead of the subroutine call is eliminated. In-lineexpansion is useful even for serial computers.However, in-line expansion should be done withcare to avoid an undue increase in the time requiredfor compilation.Removal of Output and AntidependencesOutput dependences and antidependences are, insome sense, false dependences. They arise not be-cause data are being passed from one statement toanother, but because the same memory location isused in more than one place. Often, these false de-pendences can be removed by changing variablenames or copying data.Variable Renaming. Renaming introduces new vari-able names to replace some of the occurrences of theold variables throughout the program.In the following program segments, : A=B+Csz: D=A+lS :A=D+Es,: F=A-1variable A is assigned twice. This produces the out-put dependence S, 6 S3 and the antidependenceS2 3 S3. Both of these dependences disappear if anew variable replaces A in S, and S,:S ,: A2 = B + CS2: D = A2 + 1S :A=D+ES,: F = A - 1

Renaming arrays is a difficult problem in general.Current compilers rename arrays only in very lim-ited cases, if at all.Node Splitting. Some loops contain data-dependencecycles that can be easily eliminated by copying data.The following loop

do I = 1,Ns, : A(1) = B(1) + C(1)S2: D(1) = (A(1) + A(I+l)) / 2.end doFor example. a loop surround ing the subroutine invocation and a loop in thebody of the routine could be interchanged (see the section on Loop Inter-changing].Examples of compilers that do in-line substit ution are the ExperimentalCompiling System (41. Parafrase 1241. and the Perkin-Elmer Fortran Compiler.



11/18

Special Issue

has the following data-dependence graph:

5 k

S2

The data dependence cycle can be broken by addinga temporary array:do I = l,NS1: ATEMP(1) = B(1) + C(1)S2: A(I+l) = ATEMP(1) + 2*D(I)s;: A(1) = ATEMP(1)end doThis appears to be a data-dependence cycle. How- The data-dependence graph for the modified loop isever, one of the arcs in the cycle corresponds to anantidependence: if this arc were removed, the cyclewould be broken. The antidependence relation can S1be removed from the cycle by inserting a new as- ,signment to a compiler temporary array as follows: + ,:,szdo I = l,N /.s* : ATEMI?(I) = A(I+l) ) iS 1: A(1) = B(1) + C(1) i 4 .)iisz: D(1) = (A(1) + ATEMP(1)) / 2. S:

end doThe modified loop has the following dath-dependence graph:S:,/ i

/ J4 ;iS2

which has no cycles, and can now be vectorized:S ,: ATEMP (1:N) = B(1:N) + C(l:N)Sz: A(2:N+l) = ATEMP(l:N) + 2*D(l:N)s;: A(1:N) = ATEMP(1:N)

In both of these cases the added cost is a copyeither to or from a compiler temporary array. Formachines with vector registers, however, the tempo-rary array will be assigned to a vector register. Theextra copy is just a vector register load or storethat needs to be done anyway; proper placement ofthe load or store will allow vectorization of these

The data-dependence cycle h.as been eliminated by loops without additional statements.splitting the sz node in the data-dependence graphinto two parts; the new loop can now be vectorized: OPTIMIZATIONS FOR VECTOROR CONCURRENT COMPUTERSsz : ATEMP(l:N) = A(2:N+l) Many of the optimizations in this section were de-s, : A(l:N) == B(l:N) + C(l:N) signed with vector or concurrent computers in mind.Sz: D(l:N) q = (A(l:N) + ATEMP(l:N)) / 2. These optimizations are used to exploit more paral-

A similar technique can be used to remove output lelism or to use parallelism more efficiently on thedependences in data-dependence cycles. The loop target computer.Due to space limitations, the list of methods wedo I = l,N discuss is incomplete. Among the absentees are the

s, : A(1) = B(1) + C(1) methods that deal with while loops and with recur-sz : A(I+l) = A(1) + 2*D(I) sion. Only recently in the context of developing opti-end do mizing compiler methods for Lisp has recursionhas the following data-dependence graph: been carefully considered [22, 231. While loops, onthe other hand, are hard to manipulate, and theknown transformation techniques are successful

S1 only in limited cases.4 \, -----.+ Flow dependence Scalar Expansion--++ Output dependence Vectorizing compilers will promote (or expand)scalars that are assigned in loops into temporary

1194 Communicationsf the ACM December 1986 Volume 29 Number 12


12/18

Special Issue

arrays. This is clearly necessary in order to generate do J=l,Nvector code. For instance, the loop do 1=2,Ndo I=l,NS,: X = A(1) + B(1)S2: C(1) = x ** 2end do

A(I,J) = A(I--1,J) + B(1) P.1)end doend dois equivalent to the following loop:

can be vectorized by first expanding x into a tempo- do 1=2,Nrary array, XTEMP do J=l,Nallocate (XTEMP(l:N))do I=l,Ns,: XTEMP(1) = A(1) + B(1)s2: C(1) = XTEMP(1) ** 2end do

A(I,J) = A(I-1,J) + B(1)end doend do

X = XTEMP(N)free (XTEMP)

Loop interchanging can translate either o f the pre-vious two loops to the other. Interchanging loops isnot always possible. The folIowing loop is an exam-ple of a loop that cannot be interchanged:

w

and then generating vector code:allocate (XTEMP (1:N))s,: XTEMP (1:N) = A(l:N) + B(l:N)S2: C(l:N) = XTEMP(l:N) ** 2

do K=2,Ndo L=l,N-5A(K,L) = A(K-l,L+5) + B(K)end doend doX = XTEMP(N)free (XTEMP)

The explicit allocate and free are not neces-sary on many computers, such as those with vectorregisters, since the temporary array will exist only inthe registers.When the target machine is a multiprocessor,there is another alternative. Multiprocessor lan-guages (such as Cedar Fortran and Blaze [37]) al-low the declaration of iteration-local variables. Thus,the loopdoall I=l,Nreal XX = A(1) + B(1)C(1) = x ** 2end doallis another valid transformation of the previous loop,as long as X is not used outside the loop in theoriginal program. The real X declaration in thedoall loop means that there will be a separate copyof X for each iteration of the loop. Declaring a scalarvariable as iteration local has the same effect astransforming the scalar into an array.Loop InterchangingIn a multiply-nested loop, the order of the loops mayoften be interchanged without affecting the outcome[6, 521. For instance, the loop Cedar Fortran is the Fortran dialect being designed for the Cedar multi-processor project at the University of Illinois (21. 311.

Figure 1 (p. 1197) illustrates how data-dependencegraphs are used to determine when loop interchang-ing is valid.Loop interchanging may be used to aid in loopvectorization. For instance, LI (above) computes alinear recurrence in its inner loop; interchanging itto create L2 allows vectorization of the J index:do 1=2,N

A(I,l:N) = A(I-1,l:N) + B(1)end doLoop interchanging may also be used to put a con-current loop on the outside, leading to a better pro-gram after loop concurrentization. Thus, loop L2could be interchanged into Ll, which would then beconcurrentized to become the following:

doall J=l,Ndo 1=2,NA(I,J) = A(I--1,J) + B(1)end doend doallFission by NameTechniques have been developed to handle virtualmemory systems, cache memories, and register allo-cation. The fission-by-name transformation tries tobreak a single DO loop into several adjacent loops.Two statements in the original loop will be in thesame resulting loop if there is at least one variable orarray referenced by both statements. Fission byname (originally called distribution of name parti-



13/18

Special Issue

tions [l]) is used to enhance memory hierarchyperformance.Loop FusionLoop fusion is a conventional compiler optimization[3, 351 that transforms two adjacent loops into a sin-gle loop. The use of data-dependence tests allowsfusion of more loops than is possible with standardtechniques. For example, the loops

do I = 2,Ns, : A(1) = B(1) + C(1)

end dodo I = 2,N

sz: D(1) = A(I-1)end do

would not be fused by conventional compilers thatdo not study the array subscripts. However, the loopfusion is legal since the data-dependence relationS, 6 sz would not be violated:

do I = 2,NS 1 A(1) = B(1) + C(1)sz : D(1) := A(I-1)

end doA slightly modified example shows when loop fusionis not legal:

do I = 2,Ns,: A(1) := B(1) + C(1)

end dodo I = 2,Ns* : D(1) := A(I+l)

end doIn the original two loops, the data-dependence rela-tion S, 6 s2 holds; the fused loop below, however,has the relation sz 7 S,:

do I = 2,NS 1: A(1) q = B(1) + C(1)S 2: D(1) == A(I+l)end do

Loop fusion is used in conventional compilers toreduce the overhead of loops. Likewise, fusion helpsto reduce start-up costs for doall loops. It may alsoincrease overlapping if two doall loops requiresynchronization between iterations.Since loop fusion and loop fission are dual trans-formations, any compiler that uses both of themshould do so carefully. Loop fusion should not beused to fuse the loops just created by fission. Loopfusion can be used to combine separate loops, if theyall refer to the same set of variables, with the same

goal as fission by name. For instance, by fusing sev-eral loops that refer only to the arrays (A, B , D 1, thecompiler would have a larger loop with the benefitsof loop fusion, but still have the improved memoryhierarchy performance that comes from only refer-ring to a small set of arrays in the loop.Strip MiningStrip mining [35] is used for memory management; ittransforms a singly nested loop into a doubly nestedone. The outer loop steps through the index set inblocks of some size, and the inner loop steps througheach block. As an example, consider the followingloop:do I=l,N

A(1) = B(1) + 1D(1) = B(1) - 1

end doAfter strip mining, this becomes the following:do J=l,N,32

do I=J,MIN(J+31,N)A(1) = B(1) + 1D(1) = B(1) - 1

end doend doThus, the loop is excavated in chunks, just as a stripmine is excavated in shovelfuls.The block size of the outer block loop (32 in thisexample) is determined by some characteristic of thetarget machine, such as the vector register length orthe cache memory size (see Figure 2, p. 1198, for anexample of strip mining and fission by name). Forvector machines, the inner strip loop will be vecto-rized; for parallel computers, the outer block loopcan sometimes be concurrentized. Figure 3 (p. 1199)shows how strip mining and loop interchanging canbe combined to optimize performance.Loop CollapsingLoop collapsing [44, 521 transforms two nested loopsinto a single loop, which is used to increase theeffective vector length for vector machines. There-fore,real A(5,5),B(5,5)do I = 1,s

do J = 1,5A(I,J) = B(I,J) + 2.

end doend dobecomes as follows:

1196 Coninlunicationsofthe ACM December 1986 Volume 29 Number 12


14/18

Special Issue

W

do 1=1,3do J=1,3

S: A(I,J) = A(I-l,J+l)end do

end do(4

(4do 1=1,3

do J=1,3T: A(I,J) = A(I-l,J-1)

end doend do

(4

(f) (9)The compiler uses data dependences to determine whenloop interchanging is valid. For example, the loops in (a) maynot be interchanged. To see why, consider its iteration spacein (b). The arrows in (b) show the data-dependence relationsflowing across the iteration space. The instances of state-ment S are executed by (a) in the order shown by the dottedarrows in (c). If the loops in (a) were interchanged, the new

(4

(h)order of execution would be as shown in (d). In this loopordering, s 23 would be executed before S 3 , even thoughthe S2 1 needs a value computed by ss 2; thus this state-ment ordering is invalid. As a second example, consider theloops in (e). In this case interchanging is clearly valid since nodependences are violated in the new execution order.

FIGURE1. How to Determine When Loop Interchanging Is Validreal A(25),B(25) scheduled loops. In general, this may require thedo IJ = 1,25 introduction of some extra assignment statements.A(IJ) = B(IJ) + 2. For instance, the loopend do do I=l,Ndo J=l,MA general version of loop collapsing is useful for A(I,J) = B(I,J) + 2.parallel computers where only a single doall nest end dois supported or to improve the performance of self- end do



15/18

Special Issue

do t=l,N,4(I) = B(1) + C(1)E(1) = F(1) t G(1)D(1) = A(1) t B(1)end do

(4do I=l,NA(1) = B(1) -t C(1)I)(I) = A(1) -t B(1)end dodo I=l,NE:(I) = F(1) -t G(1)end do

Wdo J=l,N,32d:o I=J,min(N,J+31)A(1) = B(I) + C(1)D(1) = A(I) + B(1)end doend dodo J=l,N,32do I=J,min(N,J+31)E(1) = F(I) + G(1)end doend do

(c)

Fission by name and strip mining are used to improve mem-ory performance. Assume a target computer with 32-clementvector registers. Before register allocation is performed, theinput loop (a) will go through the following sequence of trans-formations. First, fission by name is applied (b), then stripmining(c), and finally loop vectorization (d). The target pro-gram (d) is now a sequence of 32-element vector operations.

do J=l,N,32K = min(N,J+31)A(J:K) = B(J:K) + C(JD(J:K) = A(J:K) + B(Jend dodo J=l,N,32K = min(N,J+31)E(J:K) = F(J:K) + G(Jend do(4

do J=l,N,32K = min(N,J+31)vrl t B(J:K)vr2 +-C(J:K)vr3 c vrl + vr2A(J:K) +vr3vr2 t vr3 + vrlD(J:K) t vr2end dodo J=l,N,32K = min(N,J+31)vrl c F(J:K)vr2 -G(J:K)vr3 t vrl + vr2E(J:K) t vr3end do

(4

K)K)

K)

Registers can now be allocated as shown in (e) (vrl , vr2, andvr3 are vector registers).Fission by name is useful in decreasing the number ofregisters needed in a loop. Had the statement E ( I ) =F I ) + G I ) remained in its original position, either moreregisters or more memory read/writes would have beenneeded.

FIGURE2. An Example of Fission by Name and Strip Mining

may be transforrned intodo L=l,N*M

I = r L/M iJ = mod(L--1,M) + 1A(I,J) = B(I,J) + 2.end doregardless of the bounds of the array A.INTERACTION WITH THE PROGRAMMERSeveral user interaction strategies have been used byoptimizing compilers for parallel computers. Themost frequent approach is for the compiler to trans-late directly into object code and to provide a sum-mary specifying what was vectorized and what wasnot. When something is not vectorized, the compilergives a reason, which could be the presence of adata dependence, the need to assume a dependence

because a subscript range is not known at compiletime, the presence of a call to an unknown routine,and so on. If users are not satisfied with the out-come, they may resubmit the program after rewrit-ing parts of it or after inserting directives or asser-tions.For example, consider the following loop:do I=l,N

A(K(I)) = A(K(I)) + C(I)end doMost compilers will not vectorize this loop; insteadthey will notify the user that the compiler wasforced to assume a dependence since it did not knowthe value of vector K at compile time. If program-mers know that K is a permutation of a subset of theintegers, they may order the compiler, through adirective, to vectorize the loop. This directive usu-

1198 Commur~ications of the ACM December 1986 Volume 29 Number 12


16/18

Special Issue

do I=l,Ndo J=l,Ndo K=l,NC(I,J) = C(I,J)+ A(I,K) * B(K,J)end doend doend do(4

do J=l,Ndo K=l,Ndo I=l,NC(I,J) = C(I,J)+ A(I,K) * B(K,J)end doend doend do04

do J=l,Ndo K=l,Ndo L=l,N,64do I=L,min(L+63,N)C(I,J) = C(I,J)+ A(I,K) * B(K,J)end doend doend doend do(cl

do J=l,Ndo K=l,Ndo L=l,N,64I = min(L+63,N)C(L:I,J) = C(L:I,J)+ A(L:I,K) * B(K,J)end doend doend do(4

The translation of (a) illustrates two interesting uses of loopinterchanging. Loop (a) could be vector&d in the form inwhich it is presented, but it would require a SUM,which maycause round-off error problems. The loops can be inter-changed so that I becomes the innermost index (b). Afterstrip mining (c), the innermost loop is vectorized (d). Noticethat thanks to loop interchanging the elements of the vectorsin (d) are in contiguous memory locations. Vector registerassignment may now be performed, leading to (e). The block

do J=l,Ndo K=l,Ndo L=l,N,64I = min(L+63,N)n-1 c C(L:I,J)vr2 c A(L:I,K)sr0 c B(K,J)vr3 c vr2 * sr0vrl c vrl + vr3C(L:I,J) c vrlend doend doend do

(4do L=l,N,64I = min(L+63,N)do J=l,Ndo K=l,N

vrl +- C(L:I,J)vr2 + A(L:I,K)sr0 + B(K,J)vr3 + vr2 * vr0vrl t vrl -t vr3C(L:I,J) t-vrlend doend doend do

(9do L=l,N,64I = min(L+63,N)do J=l,N

vrl +-C(L:I,J)do K=l,Nvr2 -A(L:I,K)sr0 + B(K,J)vr3 c vr2 * sr0vrl +-vrl + vr3end doC(L:I,J) c vrlend doend do

(9)

loop L may be interchanged to become the outermost loop(9. We can now illustrate a second consequence of loopinterchanging. The vector register load vr 1 +- C ( L : I , J )and store c ( L : I , J ) +- vr 1 in (e) are loop invariant andmay be moved outside the innermost loop (g). Loop inter-changing to increase the number of loop invariant registerloads and stores is also a useful sequential optimizationtechnique.

FIGURE 3. Two Examples of Loop Interchanging



17/18

Special Issue

ally takes the form of a comment line preceding theloop.Another way for the user to supply information tothe compiler is through assertions. This is sometimesprovided as an alternative to compiler directives. Forexample, in the previous loop the programmer couldhave asserted that K is a permutation of a subset ofthe integers. Assertions have two advantages overdirectives: They are self-explanatory, and they canbe tested at run time while debugging the program.Directives, on the other hand, may be a simpler wayor even the only way to specify what the user wants.For example, directives may be the only way for theuser to request that some part of the code be exe-cuted sequentially.A second user interaction strategy is to producea restructured source program with vector and/orconcurrent language extensions. This strategymakes it possible for the programmer to learn howa program was translated without having to look atthe assembly code. These restructurers, like thecompilers discussed above, accept directives andassertions from the user.Experimental interactive restructurers, such asthe Blaze Restructurer at Indiana University and theIw programming environment at Rice University,are being developed. Some of these restructurersrely on the user to specify the transformations, andusers may specify interactively what scalars to ex-pand, what loops to interchange, which ones to vec-torize, and so on. These tools make the process ofhand-rewriting programs highly reliable and maybecome a useful tool for program development.Commercial interactive vectorizers are alreadyavailable.13 These interactive tools allow users tofind the most time-consuming parts of their pro-grams and rewrite them. They will also aid userswriting vectorizable code in order to get the bestperformance.

in

SUMMARYWhen the Cray-1 was first delivered to Los AlamosScientific Laboratories in 1976, there was a greatdeal of skepticism about whether compiler technol-ogy would ever catch up with hardware. Much re-search had been done for the Texas InstrumentsASC and for the Illiac IV on automatic detection ofparallelism, but it did not yet meet the needs of theuser community. Since that time, several vectorsupercomputers and minisupercomputers have beenannounced, each with its own vectorizing compiler.This is the approach taken by Parafrase. at the University of Illinois. KAP,from Kuck and Associates. PFC. at Rice University, and VAST from PacificSierra.a Notably, Forge from Pacific Sierra for the Cray supercompu ters. and aninteractive vectorizer facility for the Fujitsu supercomputer s [ZO].

Some of these compilers are more powerful thanothers, but all are much better than anything hopedfor in 1976.Now, a new generation of computers with multi-ple processors is coming. Compilers to detect paral-lelism suitable for spreading over many processorsalready exist, and more will follow. Many of theoptimizations that. were used for vectorization arealso useful for multiprocessing. That so many of theideas can be shared by different architectures willmake it easier to write compilers for new machinesas time passes.Acknowledgments. The authors would like tothank U. Banerjee, W . L. Harrison, A. Veidenbaum,and the referees for their many helpful commentsabout this article.Further ReadingOptimizing compiler algorifhms. See [5], [28], [29],[33], [al], and [x].Dependence Analysis. See [8]-[lo].REFERENCES1. Abu-Sufah, W.. Kuck. D.J.. and Lawrie. D.H. On the performanceenhancement of paging systems through program analysis and trans-formations. IEEE Trans. Comput. C-30, 5 (May 1981), 341-356.2. Aho, A.V., S ethi. R. . and Ullman. J.D. Compilers: Principles, Tech-ni$es, and Tools. Addison-Wesle y, Reading, Mass.. 1986.3. Allen, F.E.. and Cocke, J. A catalogue of optimizing transformations .In Design and Opfimizatim of Compilers,R. Rustin, Ed. Prentice-Hall,Englewood Cliffs, N.J., 1972. pp. l-30.4. Allen, F.E.. Carter, J.L.. Fabri, J.. Ferrante, J., Harrison, W.H.,

Loewner. P.G., and Trevillyan. L.H. The experimental compilingsystem. IBM 1. Res. Dev. 24, 6 [Nov. 1980), 695-715.5. Allen, J.R., and Kennedy, K. PFC: A program to convert Fortran toparallel f orm. Rep. MASC-TR82-6. Rice Univ.. Hous ton, Tex., Mar.1982.6. Allen. J.R., and Kennedy, K. Automatic loop interchange. In Proceed-ings of !he ACM SIGPLAN 84 Symposiumon Compiler Comtruction(Montreal, June 17-22). ACM. New York, 1984, pp. 233-246.7. American National Stan dards Institute American National Standardfor Iuformation Systems.Programming LanguageFortran.SB X3.9-198x].Revision of X3.9-1978. Draft S8, Version 99. American National Stan-dards Institute. New York, Apr. 1986.8. Banerjee. U. Speedup of ordinary programs. Ph.D. thesis, Rep.79-989, Dept. of Computer Science, Univ. of Illinois at Urbana-Champaign, Oct. 1979.9. Banerjee. U. Direct parallelization of call statements-A review.Rep. 576. Center for Supercom puting Research and Development,Univ. of Illinois at Urbana-Champa ign, Nov. 1985.10. Banerjee. U., Chen, S.C .. Kuck. D.J., and Towle, R.A. Time an dparallel processor bounds for Fortran-like loops. IEEE Trans. Compuf.C-28, 9 (Sept. 1979). 660-670.11. Brode, B. Precompilation of Fortran programs to facilitate arrayprocessing. Computer 14, 9 (Sept. 1981), 46-51.12. Burke, M.. and Cytron. R. Interprocedura l dependence analysis andparallelization. Proceedingsof the SIGPLAN 86 Symposiumon CompilerConstruction, SIGPLAN Not. 21, 7 (July 1986), 162-175.13. Burroughs Corp. Numerical Aerodynamic Simulation Facility FeasibilityStudy. Burrough s Corp., Paoli, Pa., Mar. 1979.14. Callahan. D., Cooper, K.D., Kennedy, K., and Torczon, L. Inter-procedural co nstant propagation. In Proceedingsof the SZGPLAN86Symposiuntou Compiler Construction, SIGPLAN Not. 21, 7 (July 19861,152-161.15. Chen. SC. Large-scale an d high-speed multiprocessor system forscientific applications: Cray X-MP Series. In High-Speed Computation,NATO AS1 Series, vol. F7, J.S. Kowalik, Ed. Springer-Verlag, NewYork, 1984. pp. 59-67.

1200 Communicatiom of thr, ACM December 1986 Volume 29 Number 12


18/18

Special Issue

16. Cytron, R.G. Doacross: Beyond vectorization for multiprocesso rs. InProcecdiqs of he 1986 International Conference cm Parallel Processing(St. Charles. 111..Aug. 19-22). IEEE Press, New York, 1986. pp. 836-644.17. Davies, I.. Huson. C., Macke. T., Leasure, B., and Wolfe. M. TheKAP/S-1: An advanced source-to-source vectorizer for the S-l MarkIIa supercompu ter. In Proceedirrgs of the 1986 I~rfernatiorral Corrferencc011Parallel Processi?lg (St. Charles, Ill.. Au g. 19-22). IEEE Press, NewYork, 1986. pp. 833-835.18. Davies, I.. Huson. C.. Macke, T.. Leasure, B., and Wolfe. M. TheKAP/205: An advanced source-to-source vectorizer for the Cyber205 supercomp uter. In Proceedings of he 1986 Infernafional Confer-ence DII Parallel Processiq (St. Charles, 111..Aug. 19-22). IEEE Press,New York. 1986. pp. 827-832.

19. Davies, J.R. Parallel loop constructs for multiprocessors. MS. thesis,Rep. 81-1070. Dept. of Computer Science, U niv. of Illinois atUrbana-Champaig n, May 1981.20. Dongarra. 1.1.. and Hinds, A. Comparison of the Cray X-MP-4. Fu-jitsu VP-ZOO. and Hitachi S-810/20: An Argonne perspective. Rep.ANL-85-19. Argonne National Laboratory, Argonne, Ill., Oct. 1985.21. Guzzi, M.D. Cedar Fortran Reference Manual. Rep. 601. Center forSupercom puting Research and Development, Univ. of Illinois atUrbana-Champaig n, Nov. 1986.22. Harrison, W.L. Compiling LISP for evaluation on a tightly coupledmultiprocessor. Rep. 565. Center for Supercom puting Research andDevelopment. University of Illinois at Urbana-Champaig n, Mar.1966.23. Harrison. W.L., and Padua, D.A. Representing S-expressions for the

efficient evaluation of Lisp on parallel processors. In Proceedings ofthe 1986 Irrfernafiotd Conference on Pnrallel Processing (St. Charles,Ill.. Aug. 19-22). IEEE Press, New York, 1986. pp. 703-710.24. Huson. CA. An in-line subroutine expander for parafrase. MS. the-sis, Rep. 82-1118. Dep t. of Computer Science, Univ. of Illinois atUrbana-Champ aign, Dec. 1982.25. Kamiya, S., Isobe, F.. Takashima. H.. and Takiuchi, M. Practicalvectorization techniques for the Facom VP. In information Processing83, R.E.A. Mason Ed. Elsevier North-Hollan d, New York, 1983.pp. 369-394.p. 369-394.26. Kildall. G.A. A unified approach to global program optim ization. In6. Kildall. G.A. A unified approach to global program optim ization. InConferewe Record of the 1st ACM Symposium on Principles of Program-onferewe Record of the 1st ACM Symposium on Principles of Program-ming LarlxuaXes (POPL) (Boston. Mass., Oct. l-3). ACM, New York,ing LarlxuaXes (POPL) (Boston. Mass., Oct. l-3). ACM, New York,1973, pp.-194-206.27. Kruskal. C.P.. and Weiss, A. Allocating independent subtasks onparallel processors. In Proceedings of the 1984 International Conferenceon Parallel P rocessing. R.M. Keller, Ed. IEEE Press. New York, Aug.1964. pp. 236-240.28. Kuck. 0.1. Parallel processing of ordinary programs. In Advances inComputers. vol. 15. M. Rubinoff and MC. Yovits, Eds. Academic29.30.

31.

32.

33.

34.

35.36.

37.

Press, New York, 1976. pp. 119-179.Kuck, D.J. A survey of parallel machine organization and program-ming. ACM Compuf. Surv. 9, 1 (Mar. 1977), 29-59.Kuck, D.J.. and Stokes, R.A. The Burroughs scientific processor(BSP). Special Issue on Supersystem s, IEEE Trans. Com put. C-31, 5 (May7962).363-376.Kuck. D.J., Davidson, E.S., Lawrie, D.H., and Sameh, A.H. P arallelsupercom puting today and the Cedar approach. Science 231, 4740(Feb. 28, 1986). 967-974.Kuck. D.]., Kuhn. R.H., Leasure, B.. and Wolfe, M. The structure ofan advanced retargetable vectorizer. In Tuforial on Supercom puters:Designs and Applications, K. Hwang, Ed. IEEE Press, New York, 1984,pp. 163-178.Kuck. D.J., Kuhn, R.H., Padua. D.A., Leasure, B., and Wolfe, M.Dependence graphs and compiler optimizations. In Proceedings of the8th ACM Sym posium on Principles of Programming Languages (POPL)(Williamsb urg, Va.. Jan. 26-28). 1981, pp. 207-218.Kuck, 0.1.. Sameh. A.H., Cytron, R., Veidenbaum, A.V., Polychrono-poulos. CD.. Lee. G., McDaniel, T., Leasure. B .R., Beckman, C..Davies, J.R .B., and Kruskal. C.P. The effects of program restructur-ing, algorithm change, and architecture choice on program perform -ance. In Proceedings of the 1984 lnternati onnl Conference on ParallelProcessing, R.M. Keller, Ed. IEEE Press, New York, Aug. 1984,pp. 129-138.Loveman. D.B. Program improvement by source-to-source transfor-mation. 1. ACM 24, 1 (Jan. 1977), 121-145.Lundstrom . SF.. Barnes, G.H. A controllable MIMD architecture. InProceedings of rhe 1980 Infernational Conference on Parallel Processing(BeIIaire, Mich.. Aug. 26-29). IEEE Press, New York, 1980. pp. 19-27.Mehrotra, P.. and Van Rosendale, J. The Blaze Language: A parallellanguage for scientific programming. Rep. 85-29, Institute for Com-puter Applications in Science and Engineering , NASA Langley Re-search Center, Hampton, Va., May 1985.

38.

39.

40.

41.

42.

43.

44.

45.

Midkiff, S.P., and Padua. D.A. Compiler generated synchronizationfor Do loops. In Proceedings of fhe 1986 Inlernaliomzl Conference onParallel Processirlg (St. Charles, Ill., Aug. 19-22). IEEE Press, NewYork. 1986.Miura, K.. and Uchida. K. Facom vector proce ssor VP-loo/VP-200.In HighSpeed Computafio tl. NATO AS1 Series, vol. F7, IS. Kowalik.Ed. Springer-Verlag , New York, 1984. pp. 127-138.Nagashima. S.. Inagami, Y.. Odaka. T.. and Kawabe. S. Design con-sideration for a high-speed vector processor: The Hitachi S-810. InProceedings of fhe IEEE International Conference on Computer Design:VLSI irr Compulers. ICCD 84. (Port Chester. N.Y., O ct. 8-11). IEEEPress. New York, 1984. pp. 238-243.Ottens tein, K.J. A brief survey of implicit parallelism detection.In MlMD Computation: The HEP Supercompu ter arrd its Applications,J.S. Kowalik. Ed. MIT Press, Cambridge, Mass., 1985.Padua. D.A. Multiprocessors: Discussion of some theoretical andpractical problems. Ph.D. thesis, Rep. 79-990. Dept. of ComputerScience, Univ. of Illinois at Urbana-Champ aign. Oct. 1979.Padua, D.A., Kuck. D.J., and Lawrie, D.H. High-speed multi-processors and compilation techniques . IEEE Trans. Compur. C-29, 9(Sepl. 1980). 763-776.Polychronopoulos, C.D. Program restructuring, scheduling, and com-munication for parallel processor systems . Ph.D. thesis, Rep. 595,Center for Supercom puting Research and Development, Univ. ofIllinois at Urbana-Champaig n, 1986.Reif, I.H., and Lewis, H.R . Symbolic evaluation and the global valuegraph. In Corlference Record of the 4th Annual ACM Symposium onPrinciples of Programming Languages (POPL) (Los Angeles, Calif.,Ian. 17-19). ACM, New York, 1977, pp. 104-118.46. Scarborough , R.G.. and Kolsky, H.G. A vectorizing Fortran compiler.IBM 1. Res. Dev. 30. 2 (Mar. 1986). 163-171.47. Tang, P.. and Yew, P. Processor self-scheduling for multiple-nestedparallel loops. In Proceedings of the 1986 Infernational Conference onParallel P rocessiug (St. Charles, Ill., Aug. 19-22). IEEE Press, NewYork, 1986. pp. 528-535.48. Tho rnton, I.E. Design of a Computer: The Control Data 6600. Scott,Foresman and Co., Glenview, Ill., 1970.49. Triolet, R., Irigoin. F., and Feautrier, P. Direct para llelization of callstatements. In Proceedings of fhe SIGPLAN 86 Symposium on CompilerConsfrucfio n, SlGPLAN Not. 21, 7 (July 1986), 176-185.50. Veidenbaum. A. Program optimization and architecture design is-sues for high-speed multiprocessors . Ph.D. thesis. Dept. of ComputerScience, Univ. of Illinois at Urbana-Champaig n, 1985.51. Wegman. M., and Zadek. K. Constant propagation with conditionalbranches. In Conference Record of fhe 12th Annual ACM Symposiumon Principles of Progranwing Languages (POPL) (New Orleans, La..Jan. 14-16). ACM, New York, 1985, pp. 291-299.52. Wolfe, M.1. Optimizing supercompilers for supercomputers. Ph.D.thesis, Rep. 82.1105. D ept. of Computer Science, Univ. of Illinois atUrbana-Champa ign, Oct. 1982.53. Yasumura. M., Tanaka, Y.. Kanada, Y ., and Aoyama, A. Compilingalgorithms and techniques for the S-810 vector proce ssor. In Proceed-ings of rhe 1984 lrltenmtiom d Conference on Parallel Processing, R.M.Keller, Ed. IEEE Press, New York , Aug. 1984, pp. 285-290.

CR Categories and Subjec t Descriptors: Cl.2 [Processor Architec-tures]: Multiple Data Stream Architectures (Multiprocessors)-array andvector processors: nlultiple-ins truction-s tream, multiple-da&s tream proces-sors (MIMD): parallel processors; pipelirle processors; single-inst ruc-tion-sfream, nlulfiple-da ta-strean t processors (S/MD): D.2.7 [Soft-ware Engineerin g]: Distribution and Maintenance-res tructuring; D.3.3[Programming L anguages]: Language Constructs -concurrent program-ming slrucfures ; D.3.4 [Programmin g Languages] : Processors-code gener-ation; compilers; optimization; preprocessorsGeneral Terms: Languages, Performance

Authors Present Addresses: David A. Padua. Center for Supercom putingResearch and Development, University of Illinois at Urbana-Champ aign,Urbana, IL 61801: Michael J. Wolfe, Dept. of Computer Science, Univer-sity of Illinois at Urbana-Champ aign, Urbana, IL 61801; and Kuck andAssociates, 1808 Woodfield. Savoy, IL 61874.

Permission to copy without fee all or part of this material is grantedprovided that the copies are not made or distributed for direct commer-cial advantage, the ACM copyright notice and the title of the publicationand its date appear. and notice is given that copying is by permission ofthe Association for Computing Machinery. To copy otherwise, or torepublish . requires a fee and/or specific permiss ion.

p1184-padua - advanced compiler optimizations for supercomputers

Documents