enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfto...

12
To appear at The 34th Annual International Symposium on Microarchitecture, December 2001. Enhancing loop buffering of media and telecommunications applications using low-overhead predication John W. Sias Hillery C. Hunter Wen-mei W. Hwu Center for Reliable and High-Performance Computing Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign sias, hhunter, hwu @crhc.uiuc.edu Abstract Media- and telecommunications-focused processors, increasingly designed as deeply pipelined, statically- scheduled VLIWs, rely on loop buffers for low-overhead ex- ecution of simple loops. Key loops containing control flow pose a substantial problem—full predication has a high en- coding overhead, and partial predication techniques do not support if-conversion, the transformation of general acyclic control flow into predicated blocks. Using a set of sig- nificant media processing benchmarks, drawn from Medi- aBench and contemporary telecommunications standards, we explore a compromise approach. We demonstrate a compiler using if-conversion and specialized loop trans- formations to arrange for 70-99% of fetched operations to come from a simple, statically managed 256-instruction loop buffer, saving instruction fetch power and eliminat- ing branch penalties. To complement this we introduce a “niche” form of predication specialized to permit general if-conversion with only a single bit in the encoding of each operation and to eliminate much of the hardware overhead of a predicate register-based approach. 1. Introduction Most modern high-performance DSP and media proces- sor implementations (e.g., the TI ’C6x, Lucent DSP16000, TriMedia, and StarCore 140) are based on a VLIW de- sign paradigm, with good reason: VLIW offers wide issue (today up to eight operations per cycle) with relatively lit- tle instruction issue overhead, clustering is natural and of- fers enhanced scalability [1], and compiler techniques such as software pipelining [2] effectively employ the VLIW’s many processing units in a wide variety of loop kernels. In the embedded market, where power margins dic- tate use of the lowest possible clock frequency to achieve a given processing rate, cycles cannot be wasted wait- ing for branch resolution and instruction fetch. Tradi- tional solutions such as branch predictors and instruc- tion caches are usually considered too costly and unde- pendable for inclusion in such processors. Thus, the statically-scheduled processor has difficulty dealing ef- ficiently with control flow. Both elements are often “replaced” with a dedicated loop buffer—effectively a software-controlled, straight-line-code cache with knowl- edge of counted loops—allowing efficient looping fetch. Buffering offers benefits including accurate loop-back branch prediction, reduced power consumption due to lo- calization of fetch, elimination of taken-branch bubbles, and, if data and instruction fetch share the same memory bus, reduced bus contention in key loop kernels [3]. Pro- viding a useful buffer model and utilizing it effectively are thus important design and compilation goals. A variety of techniques [4, 5] have addressed loops con- taining internal control flow, a special problem for VLIWs, but embedded implementation constraints discount several options. In general, loop buffers accommodate only sim- ple loops—straight-line blocks of code with a loop-back branch at the end—or in some restricted cases, loop nests. In general-purpose, VLIW-styled architectures such as In- tel’s Itanium, on the other hand, full predicated execution support (a predicate register file, predicate defining oper- ations, and a guard predicate operand on each operation) allows the compiler to implement general control without branches [6]. Such an approach, however, is considered impractical in embedded processors because of the encod- ing cost of the guard operands (operations on Itanium are 41 bits in length). Media and telecommunications proces- sors therefore typically incorporate a form of partial pred- ication, the addition of a conditional move operation or a few simple condition codes capable of guarding the execu- tion of a handful of opcodes. As these approaches do not support if-conversion [7], the traditional, general algorithm for generating predicated code, compilers have difficulty taking best advantage of these operations. Hand assembly coding or tuning is today often required to take advantage of the buffers. In this paper, we combine specialized compiler methods and a new predication implementation to address buffering of loops containing control flow. Our compiler transforms a set of media processing benchmarks (not kernels) such that 89.0% of operations fetched come from a 256-operation loop buffer (compared to 38.7% without the transforma-

Upload: others

Post on 13-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

To appearatThe34thAnnualInternationalSymposiumon Microarchitecture, December2001.

Enhancing loop buffering of media and telecommunicationsapplications using low-overhead predication

JohnW. Sias Hillery C. Hunter Wen-meiW. HwuCenterfor ReliableandHigh-PerformanceComputingDepartmentof ElectricalandComputerEngineering

Universityof Illinois at Urbana-Champaign�sias, hhunter, hwu � @crhc.uiuc.edu

AbstractMedia- and telecommunications-focusedprocessors,

increasingly designed as deeply pipelined, statically-scheduledVLIWs,relyonloopbuffersfor low-overheadex-ecutionof simpleloops. Key loopscontainingcontrol flowposea substantialproblem—fullpredicationhasa highen-codingoverhead,andpartial predicationtechniquesdonotsupportif-conversion,thetransformationofgeneral acycliccontrol flow into predicatedblocks. Using a set of sig-nificant mediaprocessingbenchmarks,drawn from Medi-aBench and contemporary telecommunicationsstandards,we explore a compromiseapproach. We demonstrate acompiler using if-conversion and specializedloop trans-formationsto arrange for 70-99% of fetched operationsto comefroma simple, staticallymanaged256-instructionloop buffer, savinginstruction fetch power and eliminat-ing branch penalties. To complementthis we introducea“nic he” form of predicationspecializedto permit generalif-conversionwith only a singlebit in theencodingof eachoperationandto eliminatemuch of thehardwareoverheadof a predicateregister-basedapproach.

1. IntroductionMostmodernhigh-performanceDSPandmediaproces-

sor implementations(e.g., theTI ’C6x, LucentDSP16000,TriMedia, and StarCore140) are basedon a VLIW de-signparadigm,with goodreason:VLIW offerswide issue(todayup to eightoperationspercycle) with relatively lit-tle instructionissueoverhead,clusteringis naturalandof-fersenhancedscalability[1], andcompilertechniquessuchassoftwarepipelining [2] effectively employ the VLIW’ smany processingunitsin a widevarietyof loop kernels.

In the embeddedmarket, where power margins dic-tateuseof the lowestpossibleclock frequency to achievea given processingrate, cycles cannot be wastedwait-ing for branch resolution and instruction fetch. Tradi-tional solutions such as branch predictors and instruc-tion cachesare usually consideredtoo costly and unde-pendablefor inclusion in such processors. Thus, thestatically-scheduledprocessorhas difficulty dealing ef-ficiently with control flow. Both elementsare often

“replaced” with a dedicatedloop buffer—effectively asoftware-controlled,straight-line-codecachewith knowl-edgeof countedloops—allowing efficient looping fetch.Buffering offers benefits including accurate loop-backbranchprediction,reducedpower consumptiondueto lo-calization of fetch, elimination of taken-branchbubbles,and, if dataand instructionfetch sharethe samememorybus, reducedbus contentionin key loop kernels[3]. Pro-viding a usefulbuffer modelandutilizing it effectively arethusimportantdesignandcompilationgoals.

A varietyof techniques[4, 5] haveaddressedloopscon-taininginternalcontrolflow, aspecialproblemfor VLIWs,but embeddedimplementationconstraintsdiscountseveraloptions. In general,loop buffers accommodateonly sim-ple loops—straight-lineblocks of codewith a loop-backbranchat theend—orin somerestrictedcases,loop nests.In general-purpose,VLIW-styledarchitecturessuchasIn-tel’s Itanium, on the otherhand,full predicatedexecutionsupport(a predicateregisterfile, predicatedefiningoper-ations,anda guardpredicateoperandon eachoperation)allows the compiler to implementgeneralcontrol withoutbranches[6]. Suchan approach,however, is consideredimpracticalin embeddedprocessorsbecauseof theencod-ing costof the guardoperands(operationson Itanium are41 bits in length). Mediaandtelecommunicationsproces-sorsthereforetypically incorporatea form of partial pred-ication, the additionof a conditionalmove operationor afew simpleconditioncodescapableof guardingtheexecu-tion of a handfulof opcodes.As theseapproachesdo notsupportif-conversion[7], thetraditional,generalalgorithmfor generatingpredicatedcode,compilershave difficultytakingbestadvantageof theseoperations.Handassemblycodingor tuning is todayoften requiredto take advantageof thebuffers.

In thispaper, wecombinespecializedcompilermethodsanda new predicationimplementationto addressbufferingof loopscontainingcontrolflow. Ourcompilertransformsasetof mediaprocessingbenchmarks(notkernels)suchthat89.0% of operationsfetchedcomefrom a 256-operationloop buffer (comparedto 38.7% without the transforma-

Page 2: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

tions),while achieving a 37.6%reductionin executioncy-cles.1 To complementthesetechniques,wepresentamodelof predicationthatis generallyusefulfor executionof loopkernels,thatconsumesonly a singlebit in eachoperation,andthat doesnot significantly increasethe complexity ofbypasslogic. Weproposethatthis form of predicationis asapplicable,within the benchmarksetandarchitecturalas-sumptions,asfull predication,with its predicateregisterfileandguardoperands.In thesedemonstrationswe assumeasimple, addressable-memorystyle loop buffer which canbeimplementedin avarietyof microarchitecturalstyles.Intheendaninterestingparadoxemerges:carefullyapplyingcodeexpandingtransformationswith thegoalof increasingfetchregularity allows improvedutilization of simpleloopbuffering,significantlyreducinginstructionfetchoverheadandpowerconsumption.

2. Architecture and application background

Developmentof imageand signal processingin third-generation(3G) cellularandotherhand-heldproductshasbroughtaboutsignificantchangesin thedesignof DSPandembeddedprocessors.Along with a shift from a Harvardmemoryarchitecture,with dedicateddataand instructionmemories,to a more general-purposevon Neumannar-chitecture,specially-tailoredcomplex instructionsetsarebeingreplacedwith deeplypipelined,statically-scheduledLIW andVLIW designs,allowing greaterthroughputandgeneralitywith lowerhardwareoverhead.Giventheimpactof proportionallylong (generally3 to 5 cycle [8]) branchpenaltieson theperformanceof tight DSPloops,architectshave alsobegunto includeformsof conditionalexecutionandloop supportinto their instructionsets.Consensusonthebestimplementationhasnot yetbeenreached.

TexasInstruments’’C6x line, oneof thefirst DSPfami-lies to adoptaVLIW styleanda 32-bitdatapath,architectseightissueslotsandprovidesfor operationexecutionto beguardedby conditioncodes[8]. Four bits of eachopcodeindicatewhichof fiveconditionregistersguardstheinstruc-tion andwhethera zeroor a non-zerovaluecausesthe in-structionto benullified. The’C6x line exposesfivebranchdelayslotsratherthanincorporatinga specialmechanismfor zero-overheadlooping, the ability to executecountedloopswithout incurringa loop-backbranchpenalty

The StarCoreSC140 processorincludesboth a loopbuffer andpredicationsupportin a 16-bit instructionset.Four setsof loop registersallow for four levels of nestedcountedloop execution.Using this feature,however, doesnotguaranteeeliminationof branch-backoverhead,andre-quiresthat a hostof restrictionsbe observed [3]. Suchafeatureis likely useful in hand-codedkernels. SC140hasa singleconditionregisterwhich canbeusedto guardtheodd,theeven,none,or all of theoperationsin eachbundle.

1excludingmpeg2encandjpegencfrom MediaBench.

Table 1: Benchmarks

Benchmark Description and inputadpcm� enc� dec� ADPCM codec.Input: clinton.pcmg724� enc� dec� ETSI GSM 06.60speechtranscoding[10] Input:

363frames,speechandambientnoisejpeg� enc� dec� (Mediabench) IndependentJPEG Group photo

codec.Input: testimg.jpgmpeg2� enc� dec� (Mediabench)Videocodec.Input: mei16v2.m2vmpg123 MPEG-2Layer3 audiodecoder. Input: short.mp3pgp� enc� dec� PrettyGoodPrivacy codec.Input: pgptest.plain

One of the most flexible hardware loop supportsys-tems commercially available is found on the ST120DSP/Microcontrollercore, a scoreboarded4-issueLIWwith 32-bit instructions[9]. TheST120provideshardwaresupportfor up to threeloops,which may be nested,over-lapped,or independentof oneanother. Instructionselec-tion is limited within theloop bodies.Like the’C6x, mostopcodescanbeguardedby conditioncodes;someinstruc-tions,however, have accessto only oneconditionregister,while othersmaychoosefrom amongup to 16.

Many real systemstoday incorporateboth one of theabove DSPcoresanda controlleror hostprocessor. Thisfact, combinedwith the proprietarynatureof many com-mercialalgorithmicimplementations,limits thesetof com-pletemediaandtelecommunicationsapplicationsavailablefor study in C sourcecode. Thus, we are currently lim-itedto MediaBench [11], whichfocusesonmediaandcom-municationstasks,andothercommonlyavailable“compo-nent” applications,still a significantadvanceover kernel-only studies. Table 1 lists the benchmarksusedin thisstudy. We replacedMediaBench’s g721benchmarkwitha moreup-to-dateandmorecomplex codec,which we callg724 [10]. The goal of this setof benchmarksis to rep-resentthe algorithmssupportedby near-future hand-heldwirelessdevices,or by basestationsaccommodatingcom-municationswith suchdevices, both of which are highlypower- andperformance-sensitiveapplications.Both jpeg-enc and mpeg2encpush this envelope, and are includedonly for comparison.Thebenchmarkswerecompiledandsimulatedin theIMPACT ResearchEnvironment,in whichintrinsic emulationsupportimplementsoperationssuchassaturatingarithmeticwhich would be providedon a DSP-orientedprocessor.

3. Compilation techniques

In a VLIW context, theoppositionof branchesto paral-lelism mustbeovercomeby thecompiler. Thetechniqueswedescribearenot“optimizations”in thetraditionalsense,as they canand do result in the executionof more oper-ations ratherthan fewer. On the contrary, however, theyimprove overall execution efficiency by using otherwiseidle resourcesto mitigatebranchpenalties,suchasmispre-diction latency, hard-to-fill delayslots,and control-based

Page 3: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

Hyperblock / buffer regionCode added to inner loop

A

B

C D

E

F

A

B

C D

E

F

A

B

B

B

B

A

C

B

C

A

(b) Loop collapsing(a) Loop peeling

3

Figure 1: Nested loop transformations

dependenceheight. This placesthe following transforma-tions,implementedin thecontext of theIMPACT compilerframework [12], in the generalrealm of instruction-levelparallel(ILP) transformations.

As previously stated,the useof a loop buffering tech-niquerequiresremoval of all internalcontrolflow from theloop to bebuffered. If-conversion[7] convertsany acyclicregion of control flow into an equivalent single-entry,straight-linesegmentof code,calleda hyperblock [13]. Ifthe entirebody of a loop canbe if-converted,it can thenbebuffered(providedthebuffer is of sufficient size). Thisalonesufficesfor loopswith acyclic internalcontrolflow.

For nestedloops,however, othertechniquesmustbeap-plied. Buffering theouterloop,asopposedto just theinnerloop, is beneficialwhenthe instructionandtrip countsoftheinnerlooparesmall—asituationthataboundsin mediaandtelecommunicationsapplications.Ratherthanpropos-ing a more complicatedloop buffer, which could handlesomelimited casesof multiply-nestedloops, we presenttwo compiler techniquesthat transformnestedloops intosimpleones. Figure1 illustratesthesetechniques.In (a),the loop with header“A” containsan inner loop whichis known to have four iterations. Provided that the innerloopcontainsareasonablenumberof instructions,it canbeeliminatedby peelingit completely. We heuristicallypeelany countedloopof lessthansix iterations,solongaspeel-ing wouldcreatelessthan36 instructions.

When, however, block “B” contains a substantialamountof codeor is of unknown duration, loop peelingis lessattractivebecauseof thecodeexpansioncost.Whenthe numberof instructionsin the outer loop is small rela-tive to the inner loop, andwhen the numberof iterationsof the inner loop in any giveniterationof theouterloop isnot excessive, thereexistsanotheralternative. Figure1(b)shows theeffectof predicatedloopcollapsing, whichpullscodefrom an outer iteration into an inner iteration body.Loop collapsingis a documentedtechniquefor convertinga nestedloop iteratingover a matrix into singleloopsop-eratinglinearly (in this casethe outer loop is trivially re-

ducible). The form presentedhereallows any outer loopto be pulled in by predicatingthe outer loop codeso thatit executesno more frequently than it originally did. Ifthe impactof absorbingthe instructionsof the outer loop(thosein blocks “A” and “F” in (b)) is smaller than theoverheadof enteringandleaving theloopbuffer andtakingtheouterloop-backbranch,this resultsin increasedperfor-manceandincreasedloopbufferefficiency. Giventypicallylong branchresolutionlatenciesandthe fact that therearealmostalwaysatleastafew NOPsin eventheoptimalmod-ulo scheduleof the inner loop, predicatedloop collapsingis oftenbeneficial.

Caremust be exercisedin choosingloops for collaps-ing,however, becauseincorporatinginstructionsfrom outerloops into the inner loop canengenderboth resourceanddependencepenalties.Figure2 showsadoubly-nestedloopfrom the MediaBench benchmarkmpeg2dec. The sourcecode, shown in (a), and the loop’s initial machine-levelrepresentation,shown in (b), indicatea low trip count in-ner loop with a small amountof codein its parentloop.(Note that (b) shows the loop after traditional loop opti-mizationshave beenapplied, including the promotionof*bp to a register for the durationof the loop.) Althoughsmall enoughto have beenpeeled,this instancealsopro-videsa convenientexampleof the applicationandbenefitof collapsing. Here, as in general,collapsingavoids thestaticcodeexpansionassociatedwith peeling.In collapsingandif-convertingthis loop,thefirst andthird blocksshownin (b) arepulled into the inner loop bodyandguardedun-der a predicatethat causesthemto executeonly on eacheighthiterationof theresultingsimpleloop,resultingin thebufferableform shown in Figure2(c). Thisis accomplishedasshown in Figure1(b), by replicatingtheseblocksinto anew hammock(aswasdonewith blocks“A” and“F” in thepreviousexample)prior to if-conversion.Finally, (d) showsthe codeafter further transformationsandinstallationof aspecialcountedloop branchset to 64 iterations,the totalnumberof iterationsof the inner loop. (In (c) and(d), thenew looppreheader, correspondingto thefirst block in (b),

Page 4: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

rfp++;*rfp = Clip[*bp++ + 128]

for (j=0; i<8; j++){

}rfp+=incr;

for (i=0; i<8; i++){

}

mov r2 = 0add r1 = r1, 1

add r4, r4, r6br lt r1,8 OUTER

(p1) mov r2 = 0(p1) r4 = r4 + r6

cmp p1_ut = (r2 == 8)

jump OUTER(p1) br ge r1,8 DONE(p1) add r1 = r1, 1

st [r4], r5add r4 = r4, 1add r3 = r3, 1add r2 = r2, 1

ld r5 = [r3]

OUTER:

(p1) mov r2 = 0(p1) r4 = r4 + r6

(p2) add r2 = r2, 1st [r4++], r5

cmp p1_ut, p2_uf = (r2 == 7)ld r5 = [r3++]

br.cloop 64 OUTER

OUTER:

(a) Source code

INNER:

OUTER:

add r3 = r3, 1add r4 = r4, 1

add r2 = r2, 1br lt r2,8 INNER

st [r4] = r5ld r5 = [r3]

DONE:

(c) After loop collapsing (d) After additional transformations(b) Original generation

= instruction from outer loop

Figure 2: Loop collapsing code example from mpeg2decAdd Block()

is not shown.)

It is importantthat this transformationnot severely im-pacttheresourceor recurrenceconstraintsof theloop,sincethese are important to subsequent,performance-criticaltransformationssuchas modulo scheduling. While it ispossiblethattheneedto eliminatetheinefficienciesof non-bufferedexecutionoutweighsa possibleresourceor recur-rencedegradation,this balancemight vary from architec-ture to architecture.Height-reducingtransformationssuchasthosereflectedin (d) help to ensurea benefit. Here,inparticular, we seeexpressionreassociation(allowing theupward motion of the predicatedefine)andpartial dead-coderemoval (thestoreof 0 to r2 cannow executein paral-lel with theincrementof r2, sincethey executeonmutuallyexclusive predicates).In addition,the loop-backbranchistransformedto aspecialcountedloopform, eliminatingtheinductor, anddirecting instructionfetch to fall out of theloop buffer on the last iteration. In this case,suchopti-mizationsareableto maintaintherecurrenceconstraintoftheinnerloopevenwhentheouterloop is collapsedinto it.Providedthattheinnerloopschedulecanaccommodatethetwo extra instructions,theouterloopcanbepulledinto theinner loop with no adverseperformanceimpact.Thuscol-lapsing,like peeling,is able to improve performanceandefficiency by keepingexecutionwithin the loop buffer forlongerperiodsof executiontime. Predicatedloop collaps-ingwasappliedto52doubly-nestedloopsin thebenchmarkset,makingthemcandidatesfor buffering.

Hyperblockside exits also presenta problemfor loopbuffer execution, since the machine’s branch resourcesare, appropriately, very limited. Application executionprofiles reveal that, in many cases,hyperblockside exitbranchesare numerousbut very infrequently taken. Intheseinstances,a techniqueknown as branch combiningtransformsseveralbranchesinto a singlepredicatedjump,guardedby a “summarypredicate.” The summarypredi-cate,computedusingparallelor comparetypes,is setto 1whenany exit from the loop is required;whenany oneof

thesebrancheswould have taken,a summaryjump directsexecutionto a “decodeblock” wheretheoriginally-desiredcontrolflow directionis discerned.

The IMPACT compiler’s VLIW compilation support,predication, control speculation,alias analysis, profile-guided inlining, profile-directed compilation, moduloscheduling,andother traditionaland ILP transformationswereall appliedto producehigh-qualitycodeasabasisfortheseexperiments.Executionprofiling is critical,asit helpsto distinguishimportantpathsfrom thoselessfrequentlyexecuted,a necessityfor many ILP techniques,includinghyperblockformation. Profiling also directs function in-lining, which is performedto enhanceformation of loopregions,sinceloop regionsin our implementationmaynotcontaincallsto subroutines.In theexperimentsthatfollow,profile-guidedinlining wasperformedup to an estimatedlimit of 50%staticcodeexpansion.Pointeranalysisis im-portantfor disambiguatingpointer-basedloadsandstoresin C code,and is importantto both optimizationand in-structionscheduling.Finally, it is necessaryfor the com-piler to be able to understandthe relationsamongpredi-catesto performeffectiveoptimizationonandaroundpred-ication. Thesetechniquesform the foundationof a usefulpredicatedILP compiler, andwereall appliedin theexper-imentsthatfollow.

4. Predication modelMostmodernimplementationsof full predicationarede-

scendedfrom the HP Labs HPL-PD design[14], a gen-eralizationof the model developedfor the CydromeCy-dra5 [15]. TheIMPACT model[16], onederivative,speci-fiesthepredicatedefine:������� �� ������� �� ����������� �!�#"%$'&)(*��� �+��

Here the guard predicate (��

) and comparison(��� �!�,"�$-&)(.��� �+�

, where"�$-&)(

can be�

, /� , 0 , etc.)aresharedby the operation’s two predicatecomputations,which potentiallywrite to predicateregisters

� ��and

� ��.

Table 2 shows the functionsimplementedby the variouspredicatedefinetypes,indicatinghow eachdestination

Page 5: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

Table 2: Predicate definition truth table.13254 ut uf ot of at af ct cf

0 0 0 0 - - - - - -0 1 0 0 - - - - - -1 0 0 1 - 1 0 - 0 11 1 1 0 1 - - 0 1 0

is updatedbasedonthecomputationtype�, guardpredicate� �

and condition 6 . In the table, a ’-’ indicatesthat noupdateoccurs. The two typesrequiredfor if-conversionarethe unconditional(ut/uf) type,which computessimpleconditions, and the or (ot/of) type, which is used tocomputecompoundconditions (i.e. (x<0)||(x>3)).Eachoperationpossessesa guardpredicateoperand(

� �).

Lacking the or-type predicateand the ability to predicateall operations,mostpartial implementationsof predicationarelimited to eliminatingsimplecontrol flow graphham-mocks and diamonds,but cannothandlegeneralcontrolflow.

Full predicationis typically basedon a register-storagemodel,making it easyto representandmanipulatein thecompiler. Implementedin hardware, it involvesthe addi-tion of a new registerfile, theadditionof new bypasslogicfor predicatevalues,the modificationof existing bypasslogic to allow squashingof nullified operations,and,per-hapsmostcostly, theadditionto eachoperationof a guardpredicateoperand.The IMPACT model, like the Itaniummodel [6], consumes41 bits per operation—todayunac-ceptablein anembeddeddomain.Bits in embeddedopera-tion encodingsareatapremium,asmemorysizeis limitedandILP techniquescritical to performancein mediaappli-cationsneedmany registersto expressenoughparallelismfor wide-issuecores.Adding a field for a guardpredicatereducesaddressablegeneralregisterspace—providingonlyeightpredicateregisterstakesthreebits peroperation,cut-ting the addressablegeneralregisterspacein half (assum-ing a three-operandopcodeformat). Thus,while we usethe flexible IMPACT predicationschemeduring most ofthe compilationprocess,a similarly general(i.e. allowingif-conversion)but lesscostlyschememustbeimplementedin hardware.

In the context of a VLIW, the compilerhasthe benefitof knowing to which executionunitsoperationswill issue,andin what order. Several clusteredarchitecturesalreadytake advantageof this by requiring the compiler to parti-tion operationsinto connectedsubgraphsfor executioninseparateclustersof a sparsely-connectedmachine[1, 17];othershaveappliedthisprincipleto avoid writing deadval-uesbackinto theregisterfile afterforwardinghasoccurredin thepipeline,asa power optimization[18]. Takingtheseideasa step further, it is possibleto conceive of a gen-eral predicationsystemwhereinpredicatesare explicitly“source-routed”from predicatedefiningoperationsdirectlyto theunitsand,indeed,theveryoperationsthey areto con-

trol. This concentrateschangesto theinstructionsetin thepredicatedefinesthemselves,ratherthanin theconsumers(as,for example,a register-basedsystemdoesnot). Addi-tionally, thehardwarerequiredto implementsuchasystemwouldbeeasierto incorporateinto anexistingpipelinethanwouldbeapredicatebypassingnetwork. Suchanapproach,however, is only practicalwhenthe numberof consumersper predicateis small, or whenseveral consumerscanbegroupedtogetherin an easilyaddressedunit; otherwise,astifling numberof predicatedefinesarerequired.

4.1. Benchmark predication characteristics

Examiningthe predicationsupportrequiredin the se-lectedbenchmarksguidesselectionof an appropriaterep-resentation.Loop kernels,and in particular, straight-linemodulo pipelined kernels,are of primary concernsincethesearethecoderegionstargetedfor inclusionin theloopbuffer andthereforetheregionsin which effectivepredica-tion supportis critical. Having compiledthe benchmarksfor an implementationof the IMPACT model with eightpredicateregisters,usingaggressivetraditionalandILP op-timizations, control speculation,modulo scheduling,andthe techniquesdescribedin the previous section,we ex-aminehow extensively the benchmarksusedthe freedomof full predicationand how that model could be reduced(within this applicationdomain)to anequallyeffectivebutlesscostlyform.

The studiedapplicationscontain564modulopipelinedcandidateloops, of which 122 usepredication. Figure 3showsthreemetricsof thepredicationappliedin thebench-marks; “static” metricsrefer to physicalinstancesof op-erations,while “dynamic” metricsrefer to the numberoftimes the machineencountersan operationwhile runningthebenchmarkinput. Figure3(a)showsthecumulativedis-tribution of the numberof consumersper predicatedefine(i.e. 97%of predicatesgeneratedguardthreeor fewer op-erations). More than8% of predicatedefinesissuedhavemultiple consumers;about2% of predicatedefinesissuedhave morethanfour, andsomehave asmany as16. Fig-ure3(b) illustratesanotherproblem:over 3% of predicatelive rangeslastmorethan8 cycles,indicatingthatregener-ating nullificationsat a later time for multiple consumersmay increasegeneralregister pressureas well as opera-tion count(sincecomparisonsourceoperandswould needto be preserved). Examiningthe direct-controlsysteminthis light, it appearsthat, while in the commoncaseper-formancecouldbereasonable,degradationin themultiple-consumercaseiscatastrophic.Clearly, aone-to-oneorone-to-two operationnullificationschemeis impracticalin mostcases,and encodingmore than that in a 32-bit operationformatwould bea significantchallenge.Likewise,nullify-ing all operationsin a slot for a givennumberof cycles,asimpleaddressingscheme,doesnot make bestuseof themachine.

Page 6: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

798;:<9=;>?9@;AB9C;DE FHG;I

JLKNMNOLPNQNRLSNTVUWNXYNZ[V\]N^_N`aVbcdfehgjilklmonlpoq;rhsut;vhw�xly z{ |} ~��� �� ���� ��� � ��� ��

�o�H�9�l��� ��h� �H� � �

(a) Predicate consumers (by define)

�H�; ¡H¢;£¤9¥;¦§9¨;©ª9«;¬­9®;¯° ±H²;³

´LµL¶N·N¸N¹ºN»¼V½¾N¿ÀVÁÂNÃÄÆÅÇNÈÉNÊËLÌÍNÎÏÐlÑ Ò;Ó�Ô ÕuÖ;×oØ!Ù�ÚuÛ Ü;Ý Þ ßoàá âã äåæç èé êëìí îïð ñ òóô õö

÷oøHù9úlû�ü ýþhÿ ��� � �

(b) Predicate live range duration (by define)

� �� � �� ��� ���� ���� ���� ���� �� � � ! "# $�% &

' ( ) * + , - ./10 24365 7 8�9�:�;=<?>�@ A�B C D E F4G?H I�J6K L M�N O�P

Q RS TUV WXYZ[ \]^ _`abcde fgh ijklm

n1o�p�q?rts uv1w x=y z {

(c) Predicate live range overlap (by loop)

Figure 3: Media application predication (cumulative distributions)

A compromiseapproachplacesa singlebit in eachop-erationthat indicateswhetheror not it is “sensitive” to itsguardpredicate.Predicatedefinesthenseta singlepred-icatefor eachslot which hasthe power to nullify all sub-sequentsensitive operationsin that slot. This schemeisefficientaslongasthenumberof simultaneouslylivepred-icatesis lessthanthe numberof slotsavailablein the ma-chine, and as long as the schedulerhasenoughfreedomto orderdependentusesof thesamepredicateinto oneor afew slots.To testthishypothesis,thecodewasprepass-andmodulo-scheduledgiveninfinite virtual predicateregisters,andthencoloredto eightphysicalpredicates(nospilling ofpredicateswasrequired).Thus,givena general(uniform)issuemachine,eightpredicatedslotsaresufficientto imple-mentall the program’s predication.As figure 3(c) shows,only four predicatesaresufficient to implement99%of thedynamiciterationsof the122predicatedloops. Wherein-sufficient slotsexist to maintainthe live predicates,eitherextrapredicatedefinesmustbeinsertedto regeneratepredi-catevaluesin split liveranges,or schedulingandoptimiza-tion aggressivenessmustbereduced.Sincesuchinterven-tion appearslargely unnecessaryin thesebenchmarks,thismodeleffectively balancesimplementationcostandgener-ality.

4.2. Slot-based predication

Figure4 showsonepossible32-bitencodingof thepred-icatedefiningoperations.Commonlyusedpairsof destina-tion typesareselectedasunits,muchasin Itanium. Withanothertwo bits (perhapsusingmoreof theopcodespace)it is possibleto encodeall thedestinationtypepairs,but theextra combinationsareonly infrequentlyused. The pred-icate definescan be relatively complex becausethe des-tination encodingsize is small, given that thereare onlyeightslotsin themachine.It is likely thata VLIW beyondthis size would have clusters,eachof eight slots or less,so this schemegeneralizeseasilyin thatdirection. In thatcase,a predicatedefinewould controlonly slotsin its owncluster. Sincethereis no constant-valuepredicate(usually�}|

) in which to “disposeof” unwantedresults,a single-destinationpredicatedefinespecifiesthe sameslot twice.

In sucha case,the seconddefinition is definedto have noeffect. Thepredicatedefine,asidefrom writing to slotsasopposedto standardpredicateregisters,is fairly ordinary.UnderHPL-PD/ IMPACT predicatedefinesemantics(Ta-ble 2), updatevaluesareindependentof thepreviousvalueof thedestinationpredicate;thus,defineevaluationdoesnotrequirethepreviousdestinationvalue.

Themorenovelpartof thisschemeis whathappenswiththepredicatevaluesthemselves. Ratherthanbeingplacedin apredicatebypassnetwork andbeingwritteninto apred-icateregisterfile, asin a traditionalimplementation,predi-cateupdatesherearesentdirectly to theslotsthey control.Sincethe machineis explicitly scheduledandthe compu-tationof thepredicatevalueis of known duration(presum-ably onecycle),predicateupdatesaresimply sentfrom thegeneratingunit to the consumingslot, wherethey modifythe predicatevisible to subsequentissuingoperations.Anupdate,sentonly whentheappropriateentryof Table2 iszeroor one,specifiesthe new valueto be written andim-plies thata write shouldtake place.It is allowablefor twodefinesto write simultaneouslyto thesameslot aslong asthey write the samevalue, as can occur with or-type de-fines.Thecompilerpreventstwo defineswhichcouldwrite0 and 1 to the sameslot from being scheduledtogether.Underthis scheme,eachslot hasone“standingpredicate”which remainsthe sameuntil resetby a predicatedefine;operationsin thatslot with their “predicatesensitivity bit”setareguardedon their slot’s standingpredicate.Sincethestatically-definedorderingof bundlesenforcedby VLIWissuemaintainsthedependencesbetweenpredicatedefinesandpredicateconsumers,registernamesandscoreboardingof predicatesareunnecessary.

Figure4 shows a conceptualdiagramof the pipeline’spredication“harness.” The figure depictsthe predication-relatedfeaturesof threeof themachine’s eightslots: here,slots0 and1 bothgenerateandconsumepredicates,andslot7 consumesthem.At thetopof thediagramaredecodedin-structionfieldswhich control thepredicationfeatures.Thep bit indicateswhetheror not an operationis sensitive tonullification by its guardpredicate. An operationis nul-lified (or the guardpredicateis consideredto be 0, in the

Page 7: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

typestests

Register operand form

Register−immediate form (single destination slot)

OT/OTAT/ATCT/CTUT/UT

CF/CTUF/UT

OF/ATAF/OT

...

p cmp test type src0 src1slot0 slot1

p cmp.i test type slot0 src0 imm9

eq lt

lteq ltu

predicate logic / driverunit

functionalto

p

2

2

predicate logic / driver

type test src0 src1slot0 slot1

writ

e

guardlatch

FU

p

comparator

to

C

valu

e

src1src0testtype

FU

p

to

comparator

C

slot0 slot1

slot 0 slot 1 slot 7operation fields

Figure 4: Predicate defines and predication support features

caseof thepredicatedefiningoperations)only whenthepbit is 1 and the computedpredicate,storedin the guardlatch is 0. Predicatesaredistributedon a16-bit bus,whichcontainsa value line anda write line for eachslot. Thebit on the value line is latchedwhenthe write line is ac-tive. Both theselinesaredrivenby tri-statedriversin eachpredicate-generatingunit; during an update,the predicatedefineunit activatesthe target slot’s write line anddrivesthetargetslot’s value line to thedesiredvalue.

4.3. Code generation

In thisapproach,aslot’spredicateremainssetto agivenvalueuntil it is reassigned,but nullifies only sensitive op-erations;thecompileris thusfreeto interspersepredicatedandnon-predicatedoperationsin a givenslot,but only onepredicateis availablein eachslot at any giventime. Pred-icate definesand predicateconsumersmustbe scheduledsuchthat live ranges,now tied to the issueslot of thecon-sumer, do not interfere. This differs little from allocatingthe predicatesto eight registers;however, two new con-straintsappear. First, eachphysicalpredicate,associatedwith oneslot,canguardonly oneoperationpercycle. Sec-ond, and more seriously, in a non-uniform machine,inwhich differentslotsperformdifferentoperations,it maybenecessaryto replicatealogicalpredicatetomultipleslotsif it hasdifferenttypesof consumers.To assistin provid-ing thenecessarycontrol,mostpredicatedefinescansupplytwo slot predicates,asindicatedin Figure4. As indicatedin theempiricalstudiesabove,thesenew constraintsdonotappearto beseriouslimitationsfor thestudiedapplicationsandarchitecture.The complexity of assigningpredicatedinstructionsto slots andof providing the necessarypred-icatedefining instructionsincreasessignificantlywith theasymmetryof the machineand, in the compiler, requirestightercouplingof operationschedulingto mechanismstra-ditionally partof theregisterallocator.

As is usualwith registers,it is beneficialto keeppred-icate lifetimes asshortaspossibleto enablebestreuseofresources.Onetechniquethat helpsto do this in the caseof thesepredicatesis predicatepromotion, theremoval of aguardfrom anoperationthatmaysafelybeexecutedwhenthepredicateis false(althoughtheresultis unneeded)[13].By removing the predicatesfrom all but thosethat abso-

Table 3: Buffer management operations

Operation Functionality

rec cloopbuf addr, num,count

Buffer numsubsequentoperationsat addressbuf addr,if notalreadyin buffer, andcommencecountiterations.Fall throughto operationafterbr cloop.

rec wloopbuf addr, num

Buffer numsubsequentoperationsat addressbuf addr,if not alreadyin buffer, and iterate until br wloopfails. Fall throughto operationafterbr wloop.

exec cloopbuf addr, count

Executetheloop bufferedat buf addr counttimes. Onexit, continueafterexec cloop operation.

exec wloopbuf addr, count

Execute the loop buffered at buf addr until itsbr wloop falls through. On exit, continue afterexec wloop operation.

lutely requireguards,thecompilerreducesthestressonthiscritical resource. Given theseefforts, 21.5%of dynamicoperationsin predicatedloops are sensitive to predicates(9.9%in all bufferableloops). Compilerandarchitecturalsupportfor generalspeculationanda generoussupplyoffunctionalunitsarecritical componentsof this promotion-basedmodel.

5. Loop buffers

A variety of loop buffer implementationsexist in theDSPmarket. We chooseto implementthe loop buffer asacompiler-managedcache,mappedarchitecturallyinto theinstructionaddressspace,but residingon-chipin a physi-cally different location. The compilermanagesthe bufferasa resource,schedulingloop bodiesinto segmentsof thebuffer as required. As alluded to in Section6, the goalof schedulingloops into the buffer is to minimize the to-tal numberof bundlesfetchedfrom theglobalmemory.

The compiler controlsbuffering by meansof the fouroperationsshown in Table3. To buffer a countedloop, forexample,the compilerprefacesthe loop with the (branchunit) operationrec cloop buf addr, num,count. This in-structsthe instructionfetch unit to startbuffering the sub-sequentnum-operationloop at the buffer offset buf addr,andto executethat loop counttimes. During thefirst loopiteration,the loop is bothbeingexecutedandbeingstoredinto the buffer; subsequentiterationsareexecuteddirectlyfrom the loop buffer. Whenthe loop is finishedexecutingin the buffer, control mustbe returnedto the global fetchmechanism.Sincethe size of the loop body and the ad-dressof the initial rec cloop operationareknown, this

Page 8: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

addressis easily computed—thefollowing bundlescouldevenbeprefetchedby theinstructionfetchlogic if desired.Therec wloop operationfunctionssimilarly for loopsofunknown duration,but doesnot preparethe loop buffer tocorrectlypredictloopexit like thecloop version.

The exec [cw]loop operations execute a loopknown alreadyto be storedin the buffer, returningto theoperationafter the exec [cw]loop on exit. Theseen-ableabufferedloopto bereusedfrom differentlocationsinthecode,almostlike a procedurecall, asa codesizeopti-mization.

With asmallamountof additionalhardware,a tablecanbecreatedwhich mapsbuffer offsetsof active loopsto theaddressof their rec operations,achieving additionalsav-ings.Consider, for example,anouterloopcontainingacol-lectionof bufferedloops,all smallenoughto cohabitateintheloopbuffer. Whenoneof theinner-looprec operationsis encounteredfor the secondtime, the tablewill indicatethat theloop bufferedfrom thataddressis known to be in-tact in the buffer, so re-recordingof the loop’s operationswill not occur, but the loop exit will still fall throughto alocation num operationslater, after the end of the loop’simagein global memory. It is importantto note that theloopbuffer hereis notoperatingasahardwarecache,asthecompiler is responsiblefor explicitly controlling its pop-ulation; the hardware is simply given a small memorytoavoid uselesswork. An exampleof the operationof theloopbufferingsystemis givenin thenext section.

6. Code example

To demonstrateloopbuffering,weexaminethefunctionPostFilter() from theGlobalSystemfor MobileCommuni-cations(GSM) EnhancedFull-Rate(EFR)speechdecoderg724dec[10]. After inlining andtransformations,thisfunc-tion accountsfor 49%of g724dec’sexecutioncyclesonthetarget machine—overall buffer performancethusdependsheavily on this function (seeSection7 for machinede-scription). Figure5(a) shows a control-flow graphof thefunction’s 13 loopsafter the transformationsof Section3.Backedgesarelabeledwith their traversalweights,per it-erationof the outerloop, which hasfour iterations. Afterif-conversionof “C” and“J,” bothof whichcontaininternalcontrolflow, thetwelve innerloopsaremoduloscheduled.

To the right of the function’s control-flow graph aredemonstrationsof buffer schedulingfor three differentbuffer sizes:16, 32, and64 operations.Figure5(b) showsan“executiontrace”of thefour outerloop iterations,indi-catingthe timesat which recordingandreplay take place(time runsvertically througheachiteration). To the rightof this traceis a “buffer trace” indicatingwhat loop is ac-tive andwhich loopsresidein the buffer at a given time.Any horizontalsliceyieldsthecontentsof thebuffer at thattime. Both tracesarealignedwith thecontrolflow graphattheleft; columnsto theright of thebuffer show thenumber

of eachloop’s iterationsthecompilerwasableto schedulefor buffer issue. The compilermustchooselocationsforeachbufferedloop,suchthatneededloopswill notconflictwith eachother.

The16-operationbuffer’s successis limited becauseitssizeis inadequatefor all but threeloops,andthesedisplaceeachotherover theexecutionof theouterloop. Eachtimethe rec cloop prefacing a bufferable loop is encoun-tered,recordingof theloopis performedwhile thefirst iter-ationexecutesthroughglobalfetch.As calculatedfrom the“buffered iterations” column, the 1056 operationsissuedfrom theloopbuffer representonly 1.23%of PostFilter()’s85869loop andnon-loopoperations.For the32-operationcase(c), several more loopsfit within the buffer, but be-causethe sumof the smallestoperation-countloop (“E”)andmany otherloopsis greaterthan32,by thetime “E” isreachedagainin thecontrolflow graph,its instructionsareno longerintactin thebuffer (i.e. thecompilerhasdictatedits replacementby “H”).

The benefits of targeted hyperblock formation areclearlyseenin (d), whereapproximately99%of loop exe-cution (or 98.2%of PostFilter()’s total instructionissue)occurs from the 64-operationbuffer. Without the tech-niquesdescribedin Section3 , theperformanceof thisfunc-tion wouldbesignificantlydegradedbecausethetwo high-est iterationcountregionswould no longerbe loopssuit-able for buffering, and would thus not only incur branchpenalties,but alsocausegreaterpower consumption.Thepower benefitof this mechanismwill be further exploredin Section7.2. In (d), both loops “E” and “F” were de-tectedby the compiler to be compatiblein size with the49-operationloops“C” and“J.” “F” wasselectedfor res-idence over the entire period of function execution be-causeits recordingoverhead(14operations)is greaterthanthat of “E” (12 operations). “F” is still prefacedwitha rec cloop operationso it is recordedinto the bufferwhentheouterloopfirst executes.In subsequentouterloopiterations,the hardwaretablewill indicatethat “F” is stillin the buffer, andissueof therec cloop operationwillcauseloop executionto begin from the buffer without re-recording.

7. Experiments

Our experimentalmachineis an 8-wide unified VLIWwith resourcesloosely modeledafter the TI ’C6x seriesmicroprocessors.It is assumedthat operationbundlesarestoredin memory in a compressedformat [19] in whichNOPsconsumeno space(asin TI ’C6x), andeachopera-tion is assumedto be 32 bits in length. Theprocessorhaseight integerALUs, two of which canissueintegermulti-plies; threememoryunits; onebranchunit; two floating-point units; andfour units capableof generatingpredicatevalues.Figure6 shows the fixedassignmentof functionalunitsto slots.Arithmeticoperationshavea latency of 1 cy-

Page 9: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

85869instructions

issued

Post_Filter()

loopactive

storedloop

Buffer

loopbuffer buffer

loop loopbuffer

looprecording

bufferedexecution

Execution

TraceLegend

1.23% buffer issue 6.32% buffer issue

20/24

16/20

12/16

36/40

76/80

52/56

12/16

12/16

12/16

796/800

796/800

43/44

98.22% buffer issue

20/24

40/44

16/20

0/40

0/800

0/16

0/16

0/80

0/800

0/56

0/16

0/16

20/24

40/44

16/20

0/800

0/16

0/16

0/800

12/16

36/40

76/80

52/56

12/16

iterationsbuffered

iterationsbufferedbuffered

iterations

(a) control flow graph (b) 16 operation loop buffer (c) 32 operation loop buffer (d) 64 operation loop buffer

9

19

4

199

13

3

10

5

3

199

3

33

36 ops

36 ops

49 ops

21 ops

12 ops

14 ops

20 ops

22 ops

16 ops

49 ops

27 ops

27 ops

F

I

J

E

L

K

A

B

D

G

H

C

E

F

I I

K

L

H

D

E

F

G

A

B

C

D

H

G

J

K

L

E

I

F

1 2 3 4iteration

1 2 3 4iteration

1 2 3 4iteration

Figure 5: g724dec Post Filter() loop control flow graph and buffer content traces

0 1 2 3 4 5 6 7

Pred PredIalu Ialu Ialu Ialu Ialu Ialu

BrMemMemMemPred PredImul/F Imul/F

Figure 6: VLIW issue slots for modeled architecture

cle; multiplies,2 cycles;divides,8 cycles;loads,3 cycles;andfloatingpoint arithmetic,2 cycles. Sixty-four (64) in-teger registersare provided. Generalcontrol speculationis supportedby providing all potentiallyexceptinginstruc-tions except for storeswith a speculative form. The ar-chitectureimplementsthepredicationschemeof Section4with its four predicate-generatingunits;all slotsarecapableof receiving predicates.

7.1. Effectiveness of transformationsThe direct measurementof branchoverheadreduction

and power savings dependson microarchitecturaldetailsbelow the level we chooseto model. Loop buffers todayare a ubiquitousfeaturein DSP, existing in many imple-mentationsfrom hardwarebufferscapableof holdingafewinstructionsin a single loop, to complex buffers holdingmultiply-nestedloops,to what is effectively a setof point-ersinto asoftware-managedinstructioncachewhichimple-mentsbuffering for very largeloops.Likewise,DSPshaveawidevarietyof pipelinelengthsandbranchingimplemen-tations,oftenincludingdelayslots.In general,though,onething constitutesan improvementacrossall architectures,regardlessof thesevariations:gettingmoreof theprograminto aloopbuffer. This,thereforeis themetricwechoosetomeasurethe effectivenessof our compilertransformationsin fitting loops into the simplebuffer modeldescribedinSection5.

Two compilationswereperformedfor eachbenchmark:oneappliedonly traditionalcompileroptimizations(i.e. nopredicationand no loop collapsing),and the secondag-

gressively appliedcontroltransformations(hyperblockfor-mation,unrolling, peeling,andcollapsing)intendedto en-hanceopportunitiesfor instructionbuffering. In bothcases,profile-guidedselective inlining was performedwith upto 50% codeexpansion,modulo schedulingwith modulovariableexpansionwasperformed,and loop bodieswerescheduledinto theloopbuffer by thecompiler, givenacon-trol flow profile. Figure7(a)shows, for eachof thebench-marks, the fraction of instructionaccessesthat are satis-fied by theloop buffer whenonly traditionaloptimizationsareapplied.Althoughthebenchmarkscontainsomesmallloopswith no internalcontrolflow, thebenefitclearlysat-uratesat a small buffer size. Figure7(b) shows the samemetric for benchmarksafter aggressive transformation,inwhich a muchgreaterproportionof executionis captured.Theadpcmbenchmarksresolvefor themostpartto asinglepredicatedloopwhich,oncescheduledinto theloopbuffer,accountsfor over99%of instructionissue.After aggressivecontrol transformation,even much more complex bench-markssuchastheg724codecissuebetween94and96%ofinstructionsfrom theloopbuffer.

In light of the code example given in Section 6, itis of interestthat a majority of g724dec’s executionoc-cursin the loop buffer from the64-operationsizeonward,which clearly confirmsthat the successof PostFilter() inthe buffering mechanismhasa greatimpact on the over-all benchmark’soutcome.An additionalincreaseof about20%in loopbuffer issueis realizedfor g724decat the128-operationbuffer sizebecauseimportantloopsin functionsotherthanPostFilter() arethenbuffered.

Of thebenchmarksfor which lessthanthree-quartersofinstructionissueis capturedby the loop buffer, mpeg2encis the most problematic. The MPEG-2 encodercontainsmany large,highly nestedloop structureswhich only iter-ateseveral times. Thoughloop collapsingsimplifiesloop

Page 10: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

~�� ���� ���� ���� ���� ���� ���� ���� ���� ���� ��6� �

 �¡ ¢¤£ ¥6¦ §�¨6© ª6«¤¬ ­¯®�° ±�²6³¤´ µ6¶¤·6¸ ¹6º¤»6¼½¿¾¯À Á ÂÄÃÆÅÄÇ È1É}Ê Ë¿Ì4ÍÏÎÑÐÄÒtÓÑÔtÕ4ÖÑ×ÄØ Ù�Ú Û ÜÆÝÑÞ¯ß

à áâã äåæ çèéêë ì íî ïðñ òó ôõ

ö ÷�ø�ù úÄû�ü�ý þ�ÿ����������� ����� ������ ��� ��������� � ����������� � �!�"�#�$ %&�'�(�)�*�+�,�- .�/�0�1�2�3�4 5 6�7�8�9�:�; <�=�>�?�@�A B�C�DFE G�H

(a) Traditional code optimization only

IKJ LMKN OPKQ RSKT UVKW XYKZ [\K] ^_K` abKc deKf ghFi j

k�l mon pFq r�sFt uFvow xzy�{ |�}F~o� �F�o�F� �F�o�F����z� � ������� ����� �����������¡ �¢¡£�¤�¥�¦ §K¨ © ª�«�¬z­

® ¯°± ²³´ µ¶ ·¸¹ º»¼ ½¾¿ÀÁ ÂÃ

Ä�Å�Æ�Ç È�É�Ê�Ë Ì Í�Î�Ï Ð�Ñ�Ò ÓÔ�Õ�Ö�×�Ø�Ù�Ú Û�Ü�Ý�Þ�ß�à�áâ ã�ä�å�æ�ç�è é ê�ë�ì�í�î�ïð�ñ�ò�ó�ô�õ�ö�÷ ø�ù�ú�û�ü�ý�þ�ÿ����������� ����� ���������� ���

(b) With hyperblock transformations

Figure 7: Percentage instruction issue from loop buffer

structuresin many applications,mpeg2enc’s nestedloopsdo not meetour criteriafor smallouterloop sizeandmod-erateinner loop iteration count. Hyperblocktransforma-tionsdo, however, almostdoublethefractionof mpeg2encinstructionsissuedfrom thebuffer, but becauseloopbuffer-ing is mosteffective for high-iterationcountor small,highinstancecountloops,mpeg2encis unlikely to achieve 80-90%loopbuffer issuerates,regardlessof thecodetransfor-mationsapplied.

Similarly, predication and loop transformationtech-niquesrealizea twofold buffer issueincreasefor the jpe-gencbenchmark,but this final valuesaturatesat 63%. TheJPEGencoderwas found to have significantnumbersofinner-nestloops for which the iteration countswere gen-erally small, but variedacrossdifferent loop invocations.Peelingtheseloops into an outer hyperblockcould sig-nificantly increasedependenceheights,but if buffer eli-gibility were to be given priority over dependenceheightandfunctioncodesizeexpansion,a muchmoresignificantproportionof jpegenc’s instructionscould be mergedintosingle-hyperblockloopswhich issuefrom the loop buffer.Likewise,collapsingcouldbeapplied,but sincetheseouterloopbodiescontainedmorecodethantheir innerloops,thiswould be detrimentalto performance.Althoughbufferingresultsaremoderaterelativeto otherbenchmarks,thecom-piler heuristicsbehaved correctly in leaving the loopsun-modified.

The buffer performanceof our MPEG-2Layer 3 audiodecoder, mpg123, strugglesexcept for very large (2048-operation)buffer sizesprimarily becauseits executiontimeis concentratedin functionswith small trip count loops,which,for optimalperformance,mustall remainin theloopbuffer simultaneously. Additionally, this benchmarkcon-tainsa numberof large loopswhich werevery effectivelymoduloscheduled,but which requirefour modulovariableexpansions,thus increasingtheir codesize. When mod-

ulo scheduledwith a rotating register algorithm, the sizeof theseloopswasdecreased,indicatingthatbuffer perfor-mancecould likely be further improvedthroughuseof ar-chitectedrotatingregisters.

7.2. Effects of Loop Buffering

While the loop buffering resultsare encouraging,it isimportantto considerothereffectsof theenablingtransfor-mations.Figure8(a)draws comparisonsbetweentheILP-transformedcode and traditionally optimized programs.Clearly ILP transformationstradecodesizefor improvedperformance. An averageprogram speedupof 1.81 isachievedwith our techniques,but asshown by thesecondgrouping,staticcodesizeincreasessignificantlyrelative totraditionallyoptimizedforms.Additionally, while thenum-berof instructionbundlesissueddoesnot increase,theto-tal numberof operationsfetched(“total fetch” in Figure8)does.

Finally, we examine the net effects of buffering andtransformationon the benchmarkset. Figure 8(a) showsthat for all but one benchmark(mpeg2enc), the transfor-mationsapplied for loop buffering significantly decreasetheamountof codewhich is fetchedfrom standardinstruc-tion memory. Cacti2.0 [20] powersimulationmechanismswereusedto quantify the impactof this shift in fetch lo-cation on processorpower consumption. While DSP ar-chitecturestraditionally had separateinstructionanddatamemories,wider, moregeneralarchitecturesusea single,large, unified on-chip storagespace[3]. While memorypoweris obviouslyhighly dependentupondesignstyle,it isnot uncommonto seea linear increasein power consump-tion with size.Cactiresultsfor 0.13� mtechnologyindicatethatfetchinganoperationfrom asingle-port,256-operationbuffer (assuming32-bit operations)consumes41.8 timeslesspower thana fetch from a 512KB, 2 read/writeport,non-cachememory. This result provides a meansof as-sessingthepower implicationsof theproposedtechniques.

Page 11: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

��� �

��� �

� � !

" # $

%�& '

(�) *

+�, -

.�/ 0

1�2 3

4 57698;:�< = >@? A;B C DFE;G7HJILK M N7O P Q R S9T U V;W X Y

Z [�\;] ^ _ `;a;b c�d e f9gFh7i@j k l;m n

o pqrst uvwxy z{ |}~�� ��

�;���;�7�F�J�7� �;���;�7�L���;�� � �9�7�J�7� �9�7�9�9�� ;¡¢ £ ¤ ¥7¦�§7¨ © ª7« ¬9­9®;¯°²±7³9´7µ7¶J·7¸ ¹²º7»9¼7½9¾�¿;ÀÁ7Â�Ã;Ä�Å Æ Ç È9É Ê9Ë7ÌͲÎ7Ï�Ð�Ñ Ò

Ó7Ô�Õ9Ö × Ø ÙJÚ9Û Ü Ý;Þ ß à9á9â ã7ä å7æ

(a) Performance, code size, and fetch count

ç è éê ë ìí î ïð ñ òó ô õö ÷ øù ú ûü ý þÿ � �� � ���� �

�� � ��

�� ������

����� ��

!"# $%&

' ()*+ ,-

. /01234

56789: ;<

=>?@A BCD EFG

H IJKLMNOP

QRSTUV

WYX[Z]\Y^ _ `ba cbd[eYf gbhjilknmYo[prqYs t uwvlx y zw{ |

} ~��� ��� ��� ��� ����� ��� ��� �� ����� �� �¡ £¢¥¤§¦ ¨ ©«ª­¬§®«¯ ° ±]² ³µ´

¶¡· ¸§¹«º¼» ½r¾ ¿ÁÀµÂÃ¡Ä Å§Æ«Ç¼È ÉrÊ ËÁ̵ÍÏÎ§Ð«Ñ Ò Ó]Ô ÕµÖ

(b) Estimated instruction fetch power (normalized)

Figure 8: Hyperblock and loop-transformed code vs. traditional optimizations

Sinceuseof a 256-operationbuffer allows incorporationof large op-count,high iteration loops, many loops willconsumeenoughcycles to reasonablyassumethat clockgating could be effective on elementsof the global in-structionfetch hardwarewhile the loop buffer was beingaccessed. Figure 8(b) shows the estimatedpower con-sumptionof instructionfetch,normalizedto buffer-lessis-sue of traditionally-optimizedcode. While a 256-entryloop buffer operatingon non-transformedcode reducesinstruction fetch power by an averageof 34.6% (“Base-line buffered”), the proposedtransformationsreduceit by72.3% (“Transformedbuffer”) relative to the samenon-buffered,non-transformedbaseline.

7.3. Limitations and future workWhile experimentalresults,someof whichweretoode-

tailed to includein this format,demonstratedthatbindingpredicatesto particularslotsdid notsignificantlyaffectper-formanceor codesizein thisbenchmarkset,theimplemen-tationof compilertechniquesto bestmanagetheseintrica-cies in the more complex benchmarkssuchas mpeg2encandjpegencis continuing.

As the architectural description indicated, the low-overheadpredicationschemeaddsa 1-cycle critical pathfrom theoutputof thepredicate-generatorto thepredicate-squashinput of thecontrolledfunctionalunit, aswell asasystemof global bussesfor communicatingpredicateup-dates.Thesearenot likely to causea designconstraintun-lesstheprocessoris alreadyclustered.Shouldit becomeaproblem,thepredicateharnesscoulditself beclustered,ei-ther to thesamedegreeastheVLIW, or to anevengreaterdegree.While thisfurthercomplicatesthecompiler’sjob, itseemsconsistentwith otherexistingwork in clustering[1].

Finally, the source-routingof predicatesby the simpleschemepresentedcouldbeextendedin morecomplex and

elegantdirections–possiblyto performmorecomplex op-erations,suchasqueuinga predicateto becomeactive atsomefuture time (easingthe livenessconstraint),or pos-sibly to includesomelimited-lifetime valuestraditionallystoredin generalpurposeregisters. The currentscheme,while demonstrablysufficient for the benchmarksillus-trated, may require greaterthan the optimal numberofpredicatedefinesto guardgeneralpredicatedcomputation.

8. Previous work

Alternative loop buffering mechanismsweredescribedin Section2. Thesehardware featuresarewidely imple-mentedin the DSP community, but their implementationandoptimalapplicationhavenotbeenexploredextensivelyin academicpublications. [21] studiedthe 31-instructionloop buffers in the LucentDSP16000usinga setof smallDSPkernelsanddemonstrateda 24.8%decreasein execu-tion cycleswithout compilertransformationsanda 32.7%decreasein execution cycles with compiler transforma-tions. Among thesetransformationswere function inlin-ing andtheuseof theconditionalexecutionsupportof theDSP16000to includeloopswith simplecontroldiamonds.

Varioustechniqueshave beenstudiedfor incorporatingpredicationinto anarchitecturewithout increasingtheper-instructionencodingsize,at leastfor non-predicatedoper-ations. Probablythe mostprevalentof theseis the condi-tional move [22], a single “predicated”instructionwhichoperateson the Booleanvaluestoredin a general-purposeregister, to selectfrom amongspeculatively-generatedre-sults. Unfortunately, operationswhich may not be specu-latedmustreceivespecialconditionalmovehandlingwhichrequiresmorethanasingleinstruction.[22] showeda44%dynamiccodesizeincreasefor thistechniquerelativeto fullpredication,which makes it questionablefor inclusion in

Page 12: Enhancing loop buffering of media and …impact.crhc.illinois.edu/shared/papers/micro-01-lpbuf.pdfTo appear at The 34th Annual International Symposium on Microarchitecture, December

an embeddedprocessor. Otherpapers[23] have proposedaddingguardpredicatesasinstructionprefixes,but this isnot in generalapplicableto the fixed-operation-encoding-size VLIW model typical of the latest DSPs. PA-RISCprovides instruction nullification to allow an instructionto conditionallynullify onesubsequentoperation,but thistechniquelacksthe generalitynecessaryfor if-conversion,andconsumesencodingspacein many operations.

9. Conclusions

Elimination of control flow overheadis a key compo-nent of high-performancecompilation on processorsin-tendedfor themediaandtelecommunicationssegment.Wedemonstratedtechniquessupportingtheeliminationof gen-eral internalcontrol flow usingfull predication,andtech-niqueswhich improvetheutilization of a simple,one-levelloop buffer by transformingcontrol flow into a morereg-ular form. The implicit structureof the VLIW was thenusedto developamechanismwhichprovidesthebenefitsoffull predicationwithout a predicateregisterfile or a predi-catebypassnetwork. Resultsachievedfor an8-wideVLIWin a high-performancecompilationenvironmentindicatea137.5%increasein thepercentageof instructionsissuedoutof the loop buffer, relative to traditionalcompilationtech-niques,togetherwith anaveragespeedupof 1.81. Finally,thetransformationsappliedwereshown to morethandou-ble theinstructionfetchpowerbenefitof loopbuffering,al-lowing loop buffering to reduceinstructionfetchpower by72.3%in the modeledmachine. Theseresultsencouragefurther exploration into modelsof communicationwithintheVLIW pipeline,whichmayhavethepotentialto reduceregisterpressure,savepower, andsimplify hardware.

Acknowledgments

This work was supportedby the SemiconductorRe-searchCorporation under the grant “Memory EfficientEPIC/VLIW Architecture(ID 785)” and by a grant fromStarCore.JohnSiasandHillery Hunterweresupportedbyfellowshipsfrom theIBM Centrefor AdvancedStudyandthe NationalScienceFoundation,respectively. We thankthe IMPACT contributors for their work on the compiler,particularlyMatthew Merten,whorecentlyenhancedmod-uloscheduling,andtheanonymousreviewersfor theircom-ments.

References[1] J. Sanchezand A. Gonzalez, “Modulo schedulingfor a fully-

distributed clusteredVLIW architecture,” in Proc. 33rd Int’l Sym-posiumonMicroarchitecture, pp.124–133,Dec.2000.

[2] B. R. Rau,“Iterative moduloscheduling:An algorithmfor softwarepipelining loops,” in Proc. 27th Int’l Symposiumon Microarchitec-ture, pp.63–74,Dec.1994.

[3] StarCoreDSP Technology, SC140DSP Core ReferenceManual,June2000.

[4] K. Ebcioglu, “A compilationtechniquefor software pipelining ofloopswith conditionaljumps,” in Proc.20thInt’l SymposiumonMi-croarchitecture, pp.69–79,1987.

[5] M. Lam,“Softwarepipelining:aneffective schedulingtechniqueforVLIW machines,” in Proc. Conferenceon ProgrammingLanguageDesignandImplementation, pp.318–328,1988.

[6] Intel Corporation,IA-64ArchitectureSoftwareDeveloper’s Manual,revision1.1, July2000.

[7] J. C. Park andM. S. Schlansker, “On predicatedexecution,” Tech.Rep.HPL-91-58,Hewlett PackardLaboratories,May 1991.

[8] TexasInstrumentsIncorporated,TMS320C6000CPU and Instruc-tion SetReferenceGuide, Mar. 1999.

[9] STMicroelectronics,ST120DSP-MCUProgrammingManual, Dec.2000.

[10] ETSITC-SMG,“Digital cellularcommunicationssystem;enhancedfull rate(EFR) speechtranscoding(GSM 06.60),” Tech.Rep.ETS300726,EuropeanTelecomm.StandardsInstitute,Mar. 1997.

[11] C. Lee, M. Potkonjak, andW. H. Mangione-Smith,“MediaBench:A tool for evaluatingandsynthesizingmultimediaandcommunica-tionssystems,” in Proc.30thInt’l SymposiumonMicroarchitecture,pp.330–335,Dec.1997.

[12] W. W. Hwu, R. E. Hank,D. M. Gallagher, etal., “Compiler technol-ogy for future microprocessors,” Proceedingsof the IEEE, vol. 83,pp.1625–1995,Dec.1995.

[13] S. A. Mahlke, D. C. Lin, W. Y. Chen,et al., “Effective compilersupportfor predicatedexecutionusingthehyperblock,” in Proc.25thInt’l SymposiumonMicroarchitecture, pp.45–54,Dec.1992.

[14] V. Kathail, M. S.Schlansker, andB. R. Rau,“HPL-PD architecturespecification:Version1.1,” Tech.Rep.HPL-93-80(R.1), Hewlett-PackardLaboratories,Feb. 2001.

[15] J. C. DehnertandR. A. Towle, “Compiling for the Cydra5,” TheJournal of Supercomputing, vol. 7, pp.181–227,Jan.1993.

[16] D. I. August,D. A. Connors,S.A. Mahlke,etal., “Integratedpredi-catedandspeculative executionin theIMPACT EPICarchitecture,”in Proc. 25th Int’l Symposiumon ComputerArchitecture, pp. 227–237,June1998.

[17] P. Faraboschi,G. Desoli,andJ. Fisher, “Clusteredinstruction-levelparallelprocessors,” Tech.Rep.HPL-98-204,Hewlett-PackardLab-oratories,Dec.1998.

[18] M. Sami,D. Sciuto,C. Silvano,et al., “Exploiting dataforwardingto reducethepowerbudgetof VLIW embeddedprocessors,” in Proc.Design,AutomationandTestin Europe, pp.252–257,2001.

[19] T. Conte,S.Banerjia,S.Larin, etal., “Instructionfetchmechanismsfor VLIW architectureswith compressedencodings,” in Proc. 29thInt’l SymposiumonMicroarchitecture, pp.201–211,Dec.1996.

[20] G. ReinmanandN. Jouppi,“An integratedcachetiming andpowermodel.” SummerInternshipReport, COMPAQ WesternResearchLab,Palo Alto, 1999.

[21] G.-H. Uh, Y. Wang,D. Whalley, et al., “Efficient exploitation of azerooverheadloop buffer,” in Proc.WorkshoponLanguages,Com-pilers,andToolsfor EmbeddedSystems, pp.10–19,May 1999.

[22] S. A. Mahlke, R. E. Hank,J. McCormick,et al., “A comparisonoffull andpartialpredicatedexecutionsupportfor ILP processors,” inProc.22ndInt’l SymposiumonComputerArchitecture, pp.138–150,June1995.

[23] D. Connors,J. Puiatti, D. August, et al., “An architectureframe-work for introducingpredicationinto embeddedmicroprocessors,”in Proc.5th Int’l Euro-Par Conference, Aug. 1999.