instruction-level program and processor modeling

Processor design and program behavior are inseparable. It is thisrelationship that allows us to determine how a component

will performn before it is implemented.

Instruction-Level Program andProcessor Modeling

M rof IH. MacI)ougall, Amdahl Corporation

Ns a.p1poccs'soI is dcx clopcci 1o0i 11 hcii-cxlcicsign to

lbc' it dclsig ii aIcti illnlpicIncIltatl s101 Iaiai c nnLmtci- of dec,silldcoi 101s1.% aic CIlstaIltlv1ben iliadc--each ot xx}hich inl-o: c'\%alU at i t 'hC a ic0 1 aillli pci- 1011rman1cc ot at sct

o, dclsila.lltt ilatixvs. \hat c%ci the I 10-1ormiliciC nicti IC

o cd1, It ICitet thct}aI-ltt. lhiti1 xxoik is hc'ingL a12Comll-li-JICcl, a1tnd thlc JCWicI dICCiSIOIi, inI tLiI 1i, iS n1itdc w ith anlI\o5a)i\cli tic xx 01k to hc pci 101'mcd. TIhis assocjialtion

hci \x cci %\ooik and ctcsignl IS -ct'tcctcd in t tic act'ectixcs api.Ilct1) c c at --i pcisc cat-t imc, Signal,

01 cictilitit -1aiic coiltililLnS aIs dcsiugn obJcctivcs cvolvthii1oiCcicl'iat tO spCiC tic. IaLlt i), as thic dcssicnl ic% Cl ofcictalil ciHcascs, so docs tiic c\cl of dctail at xxlich work is

d1. itcctd Liicti 101ronancc C nLntiLic ti 1 01niCperformancccx ailnLIti(llp101ocss, tlicii, iixlivcs t\\c) iitci-lockingt hicia-a

liiclal iioidcls: a iniocicl of t tic dcsic,n ind a iiicidcl oft ticxx 01' kioaci11

I us Iclationiship ot desig-n and wxork is illustrated hereIIi aL IIstIIclctioc-level workiload roiodel anid its deconilposi-lclli III dcsigi-i analtysis. Alttiough detailts of this discussionpcitaii to a specific architeCture-the IBM/ 370 and com-patiblc systens-and to a particulai iniplementationi, thegcitical approach is applicable to imiost tiigh-performiiancehai dxx arc anid software desigin cnvironiiiients. In eitherkiiid of' cixnirolmlellnt, thle pertormance issue is tiow tast aSpccilitc piccc of software runs on a specific piece of hard-xx ale. Foi all anialysis of that performnanice, it largely is ir-icltcarit xhtithei we approach fiomii a tiardware perspec-lix c, sott(Atwarc perspectixe, o0 botti- the analysis is essen-nlatty tht. .anle; what imiattei-s is the directioni we take to im-pic)c per mtoiIance.

System,-level workload characterization anddecomposition

hCS0uesonic processors are desigiied for a single, specificapplidtionl, such as signal pi-ocessinig, most are inteiidedtoi It Laicc of applicatioins. C oFisequeiitly, we begin the

developmeiit ot the instruction-level workload ilmodelpresented here with a survey of current and projected usci-enlviroinnients, tollowed by a breakdown ot key en-Viroinmeiits in terms of processor use aiid prograni or ap-plication type. Samples of each applicatioii type are col-lected; they may comprise prograiiis aiid data files whenithe application can be easily separated trom the rest ot thesystem, or may be a statistical description ot databaseti-aiisactionis 01' timesharing systeni comiiiands.

This process is illustiated in Figure 1. In this exaniple, asysteiii-level aiialysis ot a secoiid-shift operatioIi in aii in-suiance company (a "coniniercial batchl" enviironmenit),showed that 47 percent ot processor tinie was opeiatingsystem executioni tinie, 30 percent was Cobol batch pro-gianii executioIi, 14 percent was sort prograni executioii,aIid the reniainiing time was associated witli variousutilities such as prograiii loading. To anialyze these com-poIiecits of the systeni's workload, progranis represeii-tative of the Cobol execution portion were identified andcollected.

Eiiviroiinient definition is, for the most part, judgmeti-tal. Some environments are clearly uniique, such as airlinereserx ation systemiis; niany niore, however, have essenitiallysinilar applications but in different proportions. Charact-erizatioii and decompositioni of the work pertornied in agiven environment is based largely on statistical analysesof standard system accounting data augmented by hard-ware measurements. When feasible, saniples of applica-tionis of the sanie type collected in different environmentsare nierged to form a composite. For example, Cobol pro-graiiis from varied eiivironments collectively form arepiesentative sample of such programs.The application sarnple sets are then used, directly or ii-

directly, to defiiie instruction-level workloads; these, inturii, are used to estimate processor performance at the in-struction level. To estimate performance at the systemlevel, the entire process is simply reversed. Program esti-niates are combined to estimate component performance,and component estimates are combined to estiniate system-

(4) 8 9 62 84 (0 0-014501 ()() 1984 Ii COMPUT14 COM PUTER

level performance for various environments. (Note that thecomponent proportions originally measured may no longerdirectly apply because of differences in relative componentperformance.)

Instruction-level workload characterization

The method used to collect data for characterization ofan application sample set at the instruction level dependson the nature of the set. The basic methods used in our en-

vironment are illustrated in Figure 2. When the set is a col-lection of programs, instruction-level data can be collectedby interpretively executing and tracing each program.When the sample set is a description of, for example,database transactions, these descriptions are used todefine scripts, which are then used with a terminalsimulator to generate a workload for the database system.Program trace and hardware monitor tools then are usedto obtain data on the instruction-level behavior of thedatabase system. Much the same approach is used for theoperating system; a workload of the appropriate type

Figure 1. The workload decomposition process. In our example, Cobol execution programs were collected as therepresentative program set.

Figure 2. The measurement environment used in characterizing an application sample set at the instruction level.

July 1984 15

(eg, hatclh, tirnisharijg) is cornstructed, eenerallv tromprograms collectecd in the workload characterization pro-cess, which prosides t he desirecd system load point.Ifnstruct ion-lescl mneasuremilenit data tfor applicationi anidopei-atinig systenii Conilpollelits thenl is collected durilin ex-ecutioll ot this workload.

I ools. 1 he primalrs collectioni tools used are a programiitracer, a sNStem tracer, and a hardware mcasuremC1elnt inter-face with its associated hardwaire i0onitOi equipment.Both thIC pl-Ogramrn and sVstelll tracer-s interpretiselv exeCuLtCmachiell islllntlctiolns .Illd produce a trace reco-d f'or eachinlstInct ionl Ihe trace record inClIudtes the instr uctioni ad-di-ess, operationi co)d, c)perand addr-esses and lenigtlhs,and, when specitiecd, operand salues. For branch instructioris, the recoi-d also inclldes the taiget address and abranch-taken indcidator0 Tle program tracer is used totr-ace the problenm staLte portion of indiVidIual proglatms (innioost cases, thie doirtinant portion ot the program's e\ecu-tion tinc). This tralame operationl CeXCtLItes as a simplebatcth job thait can he ruFn Uat ari timie )rI aln of our Cor1n-pUtcr celIter sV sterIs. -1-le ssstein trcer I-CquLirCs a dLedicated S stelI bhut cani trace both pr oblen ailnd supersisorstate portions ot a Workload.

I)ur iIng the design ot Anidaltl proccssors, signals ct) inl-terest tfr orii a performance viewpoint aire CdeIntitfied ailidpiovisions Imlltdc to rOLute tlhese sieilals to the hardwvureMeUSIesementii teruace. In somi1e inistanices, thiese sigialsexist iII thie basic desigil; iii other cases, additional logic is

incorp)rorated to genlerate therim ThelINlI provides a varie-ti ot tfacilities tor- sihal m-anipullationi (includcinVg coulItilnL,scahliii, siglntal stretchine, anid samipli) as well aisc an elec-tiical initertface compatible wti oiiccirnmercial hardware

A iLurinher- of tactical issuLeS m-ulList bewweihed when usillUtihese toOls. The data obtalined hbs t r-aciri is coinpr-elel-sive, but the process is relatively expensive aind, inoperating systern tracing, significanils pcirtur bs thIebehaviotr ot t}le systern. p'ert ur bat ion is iot gerierally ap1roblermi whienl tracirt iridisidual procr ais; towesver,sinice a proeramn may CXCLutC billions of inistrtuctionis, a

C-,)

= 0.20C)

CR,

z 0.10C-

= 0.10

cn

- 0.05

0

0.30

0.25MEAN FREQUENCY FOR MIXPROGRAM FREQUENCIES

Ir

A - c

-i I 0 o 1 I 9,igij1

2 4 6 8 10 12 14 16 18INSTRUCTION RANK IN MIX

Figure 3. Variation in intramix instruction frequency.

sampling- mnechanismn is used to keep trace VLitpIUt to areasonable vsolume. (The nuIlimber ot instructioins that musthe traced to get aIn aCcurate represenltation ot a program sbehavior depends on the cormplexity ot the progr am. Aspart ot ouir researclh inito program beha ior, we liavedeveloped trace redclution algorithmis that essentiall1revenerate ani instruction-level program graph tromii thetrace; we hope to build oni this work to deselop heuListicsfor trace reduction "'on the fly.") Sarirpline- is also used toredcLve the pertulrbation inherCent in operatilng svsternl trac-inc. Har-dware measuIremFIelIt is etfficieit anld cloes iot per-tirllb the .systern, huit it limits the kinds otf data that call beollectecl One importFant application ot har dw arcMeasurement data is the salidation otf saimipled trace dlata.

Source-lev el instrumIlCntationi (for collecting, data onistatemilenit eXcuCLtionl fleqUenclies and timles, O for analvi -

in SoiIICe stAtem1enlt flow ) is an important tool ill ell-Vironmients concernied with instruction set .specification oricompile- desion. II (tI eliVironmnient, source-lesel analvsisis lsectd primailyI to ideritify iinstrtictioni sequecearCtiid aid-diress referenlce patternIs cenerated tfor commnion operationssucth as procedLure callS and data tipe consversi)li.

Instruction mix descriptions. Tthe data obtaiied tfruornilnstruction trace aralvsis anid hardsaremi-eaSurIIIernelt isused to generate instrUctiOnl imix descriptions tor eacth keyapplication ty pe. TILus, to create ani inisti uctioi mi.xciescription tor C obol eXCcutioll, each program irt cUFsainple set tor that type is traced, anid the trace is arialyszedtco obtain data onI illstirirCtiol frequenccies, oper-arid lengthdistr ibIrtions, anicd taken bhrancl tFreCquenCieCS. l)ata is thencombibled to torirn a composite ot the C obol proerarLns.I-lie detail of these inistruCtion-mi.>;x descriptionis expands astbe desihen] esolses and clesign questions ariSC. F1-rexalni-plC, desiericrs concer neil sswitth bhraui instrlcttioll )CIrfcl'mance rnay sant t) knoss thos ofteni tthe coniditionl codie isset bh the iistr uctioll irnamediatelv preceding the braricli;wAlien told ''96 percent" thlies tleniswant to kiloss wsshict ii-StlLlrtioios aLre tlie iimost t'fequcrit hr-a c ipmredecessor s.HIIudFreifs ot such Cdiiestiolns arise dLIiiIg the design pro-cess, otftenl accoripariied h pcrt'orimianice qLIeStiOllS lik'e"'Wtat's it wor thl to set thle coInditioIn code a cvc1c carls iIItlis instruction?"

1 hese instruCtionr-llitX dseriiptioris are illuist rated iiI'I'ables 1 thirought 4. 1ahble 1 shows an iiistrrltionn t'reqiLren-cv tabulation tfOra Cohol executioni (CF3L/X) inStiruCtiorimilx.his arid simiilar tabhilation.s cncompass all ttie in-strirctioris in thle programs f'roit which the mix is dlerised,noit just thle 20 Or 25 m11OSt tfrequClt. (Cel-tain instruCtiions,

ssbile loss inI hrequene, COnItr ibLite sigrificantly to pro-"FanilCamex Itiion ti me.) The graph ot Figure 3 prosvides as isual indication ot how inlistctioll freFqiercies V airam11ong1 ttC progrFaimis composing thie imiix.

Table 2 shios s a tabulation of the operancd lengthchistir but ion tor tle mose character instruCtioll in theC 131 X irisirtictioll iii.x. (Thi.s data is tor tthe noiosVellap-ped operand, or NIOVE, case; the MVC instructioll sith aone-byte operarid oxerlap is trequelitlV used to clear oiblank-fill data tields.) The MIVC in1structioni has a rclatsVcfreqienci otf 0.0422 in ttis ll]X; approxirmiately 96 petrcciitot tieCse atrc .\E()\ FS. rmillthe C uilanisL disibLt Iilti,

n'(t tI tat "ss t liClh tof t beCSC I uStFLitiishasvc operiand

1 6 COMPUTE R

lengths of eight bytes or less. However, this high propor-tion of short moves is only part of the picture. Figure 4shows the proportion of MVCs with operand lengthsgreater than b together with the proportion of total bytesmoved by this instruction that are moved by MVCs withoperand lengths greater than b. Clearly, most bytes movedby this instruction are moved by a small proportion ofMVCs with relatively long operands. This behavior is achallenge to the ingenuity of the designer: to handle thefrequent short moves efficiently, the MVC algorithm musthave minimal initialization time; at the same time, thealgorithm must be sophisticated enough to handle longmoves at the maximum rate permitted by path widths, forexample.

Branch instructions can represent up to one third of thenstructions in some mixes and are thus an important con-sideration in any design-especially in a pipelined design.Table 3 shows part of the instruction description for theBranch on Condition in an instruction mix representingIBM's MVS SP 1.3 operating system. The BC instruction isthe single most frequent instruction in this mix; the se-quences TM-BC and LTR-BC account for 25 percent ofthe instructions in this mix, and so draw a great deal ofdesign attention. Other data in branch instruction descrip-tions may include branch direction and distance statistics,the distribution of the number of instructions in the targetblock (word, double-word, etc.), branch repeat frequency(how often a branch is taken or not taken on successive ex-

Table 1. Mean instruction frequencies: Cobol executioninstruction mix.

OPRANK CODE f E(f)1 L .17116 .171162 BC .10217 .273333 BCR .09220 .365534 ST .04643 .411875 LA .04470 .456576 MVC .04220 .498787 PACK .03928 .538068 LH .03667 .574739 CLC .03033 .6050610 CLI .02720 .6322611 LR .02543 .6576912 TM .02493 .68263

44 EX .00264 .9723945 SH .00257 .9749646 STC .00243 .9773947 ED .00224 .9796348 ALR .00199 .9816349 AL .00198 .9836050 MP .00175 .9853651 XC .00153 .9868852 CL .00147 .9883653 DP .00140 .9897554 MVO .00117 .9909255 C .00099 .9919156 ICM .00098 .9928857 LCR .00091 .99380

89 SRDL .00000 1.0000090 LNR .00000 1.00000

ecutions), and target repeat frequency (how often suc-cessive executions ol a taken branch go to the same targetaddress).

Table 2. Distribution of operand lengths for Move Char-acter instructions (MVCs) with nonoverlapped operandsin Cobol execution Instruction mix: pr (instrs) = 0.4220;pr (MVCs) = 0.96188; mean length = 32.56 bytes.

LENGTH INBYTES (b) pr (i,b) Epr (i,b)

1 0.10182 0.101822 0.11899 0.220813 0.13007 0.350894 0.11932 0.470305 0.04411 0.514416 0.04879 0.563207 0.05449 0.617698 0.05513 0.672829 0.04502 0.7178410 0.02816 0.74600

256 0.03837 1.00000

Table 3. Branch on Condition (BC) in an IBM MVS SP1.3operating system instruction mix.

CONDITIONAL BCBC PREDECESSOR FREQUENCY SUCCESS RATIOTM 0.3578 0.5049LTR 0.1599 0.4554CLI 0.0770 0.5000CR 0.0666 0.6143C 0.0572 0.4690CLC 0.0507 0.5577CL 0.0334 0.1864CLR 0.0325 0.7042CH 0.0278 0.4727OTHER 0.1100 0.5579

pr(INSTRS) TAKEN TAKEN TOTALConditional 0.1652 0.3527 0.5386 0.8913Unconditional 0.0201 0.1081 0.0006 0.1087Total 0.1853 0.4608 0.5392 1.0000

1.0

0.4 MVCs WITH OPERAND LENGTH > bar

PRO PROPORTION OF MVCs WITH

0.2- OPERAND LENGTH > b

o0 32 64 96 128 160 192 224 256

OPERAND LENGTH (BYTES)

Figure 4. Proportions of Move Character Instructions(MVCs) and bytes moved versus operand length in aCobol execution instruction mix.

17July 1984

Taken branches divide the instruction stream into aseries of disjointed execution sequences. An execution se-quence is a set of sequentially located instructions begin-ning with the target of a taken branch and ending with thenext taken branch. The distribution of execution se-quences is of interest in fetching pipelined instructions anddesigning instruction buffers. Figure 5 shows the distribu-tion of execution sequence lengths for the SPI.3boperating system instruction mix. While the mean execu-tion sequence length is approximately 7.5 instructions, 70percent of the sequences are seven or fewer instructionslong, and almost eight percent are one instruction. Se-quences of one instruction represent a taken branch to ataken branch; they may result from the use of branch vec-tor tables by the system or from procedure call protocols,for example. The high proportion of short sequencesmeans that the taken branch "hole" must be made assmall as possible to maintain pipeline efficiency. On theother hand, more instructions are executed in the long se-quences. Figure 6 shows the proportion of sequences oflength greater than s with the proportion of total instruc-tions represented by sequences of length greater than s.While more than 70 percent of the sequences in this mixare shorter than the mean of 7.5 instructions, sequences

Figure 5. Distribution of execution sequence lengths forthe SP1.3b operating system instruction mix.

Figure 6. Proportions of sequences and instructions ex-ecuted versus execution sequence length.

greater than the mean account for more than 65 percent ofall instructions executed. Thus, while frequent branchesare of concern to the pipline designer, they do not preventa pipelined design from being effective. (Kobayashi I

discusses the characteristics of instruction sequences ingreater detail.)

As a final example of the types of data composing aninstruction-mix description, one kind of interinstructiondependency is shown in Table 4. There are several kinds ofinterinstruction dependencies; an address generationdependency exists when the result register of one instruc-tion (the setting instruction) is used to form the operandaddress or target address of a following (using) instruc-tion. When this situation occurs, operand or target ad-dress generation for the instruction must be delayed untilthe contents of this register become available, whichcreates a "hole" in the instruction pipeline. For the Cobolexecution instruction mix, an address generation depen-dency exists for almost 12 percent of instructions; about 88percent of these dependencies are caused by Load (L) orLoad Register (LR) instructions. In this instruction mix,about two thirds of the Load instructions are used to loadan index or base register for operand or target addressgeneration. One frequent case is the sequence L-BCR,where the load sets the target address for the RR-typebranch.

Using knowledge of what causes dependencies, thepipeline designer can reduce, and in some cases eliminate,address generation delays by incorporating bypasses at ap-propriate points in the pipeline. A bypass is a data paththat routes a value to the address generator from a stage inthe pipeline earlier than that at which the value is actuallystored in its destination register. Software designers needto consider the impact of interinstruction dependencies incode optimization. Rymarczyk2 discusses this and relatedtopics.

Models. The static model of program behaviorrepresented by instruction mix descriptions is useful foranswering questions about the execution of individual in-structions and for analyzing some aspects of instructioninteraction. Performance questions dealing with the con-current execution of instructions and interactions amongprocessing units are generally analyzed in relation todynamic models. These range from simple analytic modelsused to "size" queueing delays at various points in thesystem (bus and memory conflicts) through a hierarchy oftrace-driven simulation models that, as the design pro-gresses, approach the register-transfer level very closely.Cycle-by-cycle simulation of a large set of programs, eachrepresented by a trace of several million instructions, re-quires a substantial amount of computation time. Thereare a variety of ways to reduce these requirements, in-cluding decomposition, multilevel simulation, and tracereduction and synthesis. The full program set can be com-pletely simulated only at points in the design process thatrequire overall, absolute predictions of processor perfor-mance. While a variety of models may be used to analyzedifferent aspects of performance, all resulting estimatesare combined in the instruction-level model discussedbelow.

COMPUTER18

An instruction-level performance model

Performance measures. The common measure of per-formance for 370-compatible systems is the mean instruc-tion execution rate, usually stated in millions of instruc-tions per second, or MIPS. As a measure of processor per-formance, instruction execution rate is often misused(and, consequently, manufacturers tend to specify pro-cessor performance in relative terms). On any particularprocessor, the execution rate of applications can varygreatly. Thus, it is not uncommon for a Cobol compiler toexecute its instructions at a rate twice that of the objectprogram it compiles. Minor changes within a given ar-chitecture may decrease instruction rate performance andincrease (or, for that matter, decrease) system perfor-mance as measured by throughput. For example, IBM hasextended the 370 architecture several times over the lastfew years by incorporating various operating system func-tions into the processor (usually via microcode). Aninstruction-rate comparison of different versions of theoperating system is not valid. For an invariant instructionstream, the mean execution rate is a useful tool for com-paring different processors or different processor designs.

Mean instruction execution time. The reciprocal of themean instruction execution rate in MIPS is the mean in-struction execution time in microseconds. This, in turn,is-at least for synchronous machines-the product of theprocessor cycle time C and the mean instruction executiontime in cycles I. That is,

MIPS - ( I*C)Most of our analysis work uses the processor cycle as theunit of time, rather than microseconds or nanoseconds.Machine organization decisions made early in the designprocess establish a cycle-time objective, but the exact cycletime may not be determined until late in the design, whenchip layout is complete and path timing can be analyzed.More important, however, is that machine designers tendto think in terms of cycles, while software designers areconcerned with the total number of cycles required to per-form some function. Conversion from instruction execu-tion time to instruction execution rate is generally madeonly to compare performances of processors with dif-ferent cycle times.

Decomposition of instruction execution time. The meaninstruction execution time I is assumed to be the sum ofthree basic components, such that

I = E+D+S (2)where E is the mean nominal instruction execution time,that is, the mean instruction execution time if the pipelinecould be kept fully busy; D is the mean pipeline delay perinstruction, representing pipeline "holes" caused by pathconflicts, register use dependencies, and taken branchdelays; and S is the mean storage access delay per instruc-tion representing the delays when an instruction or

operand is not found in the buffer and must be fetchedfrom mainstore. E, D, and S are expressed in cycles per in-struction.

Equations (1) and (2) define a simple macroscopicmodel of CPU performance: Each term of (2) is, in turn,defined by submodels of increasing detail.

Mean nom11inal instruction execution timie. For a pipe-lined processor with serial execution, E is the meannumber of cycles per instruction spent in the execute stageof the pipeline. (A processor with parallel execution unitswill have overlapped execution cycles and requires a morecomplex model of E than that considered here.) For agiven instruction mix, E is the weighted sum of thenominal execution times of all instructions in the mix. Iff(i) is the relative frequency of instruction i (where i canbe the rank of a given operation code in the mix), and e(i)is the mean nominal execution time of that instruction,then

E = Ye(i) *f(i) (3)wheref(i) e(i) is the contributioin of instruction i to themcain nonminal instructioin execution time of the mix. It isassunmed that the nominal execution time of each instruc-tion can be computed indepeildenitly of execution or delaytimes for any other instruction. Each e(i) represents asubmodel of E. Many of these are trivial, such as e(i) = 1.Others, such as the move character instruction are morecomplex. The MVC instruction has two uses, moving andclearing, which have different operand length distribu-tions and different frequencies of buffer line and memorypage crossings. Consequently, a fair amount of arithmeticis needed to compute e for this instruction.The complexity in computing e for the MVC instruction

arises from the large number of subcases involved: Designanalysis to determine the execution time for a given sub-case generally is simple. For some instructions, it is thedesign analysis which is complex. To determine e for the370's Start 10 Fast Release (SIOF) instruction, centralprocessor and channel processor designs and their inter-communication must be analyzed. For the Start 10 (SO)instruction, we also need to know the timing characteris-tics of the device controllers to which various proportionsof 1/0 operations are directed.The values of e(i) computed from various instruction

submodels are multiplied byf(i) for the particular instruc-tion mix, and the resulting products are summed to com-

pute the mean nominal execution time for the mix, as il-lustrated in Table 5.

Mean pipeline delay time. There are a variety of situa-tions in which the execution of an instruction in a pipelinedprocessor may be delayed. For example, an instruction

Table 4. Address generation dependencies in a Cobol ex-

ecution instruction mix: pr(dependent instrs)= 0.1179.

OPCODELLRLAA/SRA/Sother

PROP. OFDEPENDENCIES

CAUSED BY OP CODE0.79790.08160.01680.05500.03420.0146

PROP. OF OP CODECAUSING

DEPENDENCY0 66840 25440.08670.40150.2235

19July 1984

may have to wait for an address or operand value to becomputed by a preceding instruction, or an instruction'soperation fetch may be blocked because the buffer is busyperforming a store for a preceding instruction. Most suchdelays result from interactions between instructions,which can extend across several instructions. As instruc-tion execution times decrease, these interactions can in-crease because instructions are executed more closelytogether; at the same time, the proportionate effect ofdelays on performance increases. To counter this effect,current pipeline designs incorporate a variety of delayreduction mechanisms, including multiple copies ofregisters, special data paths (bypasses), multiple instruc-tion stream buffers, and various kinds of branch predic-tion algorithms. In software design, code optimization forpipelined processors includes loading address and operandregisters as early as possible before their use, making themost f'requently followed paths corresond to not-takenbranches, for example. Rymarczyk 2 describes variousdesign tactics.

Delay analysis is not simple. An instruction incurring adelay may be several instructions away from the instruc-tion causing the delay, and potential delays may be partlyor entirely masked by other- delays or by the execution timeof intervening instructions. A single instruction may be in-volved in several different delays; for example, a branchmay have a potential wait-for-condition-code settingoverlapped by a wait-for-computation-of-a-target-address-register-value, and there may be a delay betweenthe cxecution of the branch and the execution of its target.('onsequently, delay analysis involves examining a se-quence of instructions in terms of their relative positions inthe pipeline and determining exactly which resources andpaths are used by each instruction as the sequence ad-vances from stage to stage.

l)elays can be divided into three classes: register access,buf'fer access, and branch/mi.scellaneous. A delay sub-model is associated with each delay condition, the mostimportant of which are

Table 5. Nominal instruction execution times in an IBMSP1.3 MVS operating system instruction mix.

RAN K

456789101 1121 3

104

OpCODEBCLTMSTLRLABCRLTRMVCICLHBALRSTM

SCKC

0.185320.144470.062890.051220 .048640 .046450 .030350.029030.022260.017900 .017650.016470.01563

0 .00001

e

4.667

ll

5.543

4

e *f

0.185320 .144470 .062890.051220.048640.046450 .030350.029030.103890.017900.017650.016470.08664

0.00004E =Lef

* Register Access DelaysCI (generate interlock) delay. This delay can occurwhen the result register of one instruction is an indexor base register of a following instruction.El (execute interlock) delay. This delay can occurwhen the result register of one instruction is anoperand register of a following instruction.

* Buffer Access DelaysP1 (priority interlock) delay. This delay occurs whenan instruction attempts to fetch an operand from thebuffer at the same time a preceding instruction is stor-ing an operand into the buffer; the store is givenpriority, and the fetch is delayed.SFI (store-fetch interlock) delay. This delay occurswhen an instruction accesses a storage location beinestored to by a preceding instruction; the fetch accessis delayed until the store is complete.

* Branch and Miscellaneous DelaysTFCH (target,fetch) delay. This delay represents thepipeline "hole" that may occur between a takenbranch instruction and its target.CCI (condition code interlock) delay. Conditionalbranches are almost always preceded by conditiollcode setting instructions. Some of the latter can bcimplemented to set the condition code at anl earlystage in their execution, but some set the conditioncode in a late stage, caLising a delay in the execution ofthe subsequent branch instruction.STIS (store in instruction stream) delay. This delayoccurs when a store address falls within the addrecssrange of the instructions already in the pipeline andinstruction fetch btuffer; these instructions arediscarded, and instruction fetching is reinitiated whenthe store is completed.

The submodels themselves may be further decomposed ac-cording to catising-instruction type and incurring-instruc-tion type. For example, a key submodel of the GI delay isthe load-branch instruction sequence.Rough estimates of these delays can be obtained from

analytic models using instruction mix description data.However, since the type, frequency, and length of delaysdepend very much on the details of a particular design, ac-curate estimates require a detailed model of pipelineoperation driven by instruction traces. The computationalcost of this modeling can be substantially reduced by tracereduction methods. For example, a program loop maygenerate repetitive sequences of trace records, each se-quence representing one iteration of the loop. Frequently,these sequences differ only in operand addresses andvalues. From the standpoint of pipeline delays anid,generally, execution times, the set of sequences can berepresented by a single sequence and a loop count.Pipeline delays are typically one cycle, and rarely morethan two or three cycles. Any individual pipeline delaycontributes very little to the pipeline delay per instruction(D). Consequently, low-frequency sequences can be ig-nored without introducing much error in D. The net resultis that a program trace comprising millions of instructionscan be represented, for pipeline modeling, by a relatively

COMPUTER20

small number of instruction sequences totaling a few thou-sand instructions.

Delay estimates are reported in several different forms:by cause, by causing-instruction type, and by incurring-instruction type. An example of the last form is shown inTable 6. Here, f is the relative frequency of each op code,and d is the mean number of delay cycles incurred by all in-stances of that op code for all delay conditions. (D, themean pipeline delay time for the instruction mix, is

E,d(i) *f'(i), (4)

where i denotes the ith-ranked op code.) In analyzing soft-ware, execution and delay cycles are reported instructionby instruction so that delay cause and effect can be easilypinpointed.

Mean storage access delay timne. A storage access delayoccurs when an instruction or operand access references aline not in the buffer, and the pipeline must wait for theline to be fetched from mainstore. A reference to a line notin the buffer is called a buffer miss, or simply a miss. S isthe product of the mean number of misses per instruction(miss rate) u and the mean delay per miss (mean misspenalty) M in cycles. That is,

S= u*M (5)The miss rate u and buffer hit ratio h are related, such that

h= I - u/a (6)

where a is the mean number of buffer accesses per instruc-tion; h increases as a increases; and a is (among otherthings) a function of the width of the path between thepipeline and the buffer. Everything else being equal, asystem with a four-byte buffer path, for example, will havea higher hit ratio but lower performance than a systemwith an eight-byte buffer path. Even for the same pathwidth, a can vary from one design to another because ofdifferences in operand alignment handling. Consequently,the miss rate is a more revealing measure of buffer perfor-mance than the buffer hit ratio.While E depends on individual instructions and D

depends on instructions sequences, S-and especiallyu-generally depends on the overall system environment.Current large-scale processors typically have buffer sizesof 64K bytes, and that figure is likely to increase by a fac-tor of four to eight in future systems, most likely in theform of a buffer hierarchy. (The Fujitsu M-380 alreadyhas a two-level buffer hierarchy with 64K bytes at the firstlevel and 256K bytes at the second.) A typical applicationtask may reference the equivalent (in buffer lines) of 20Kto 30K bytes in an execution interval before control is

switched (voluntarily, because the task issued an 1/0 orsome other system request, or involuntarily, because of aninterrupt) to another task. If control is quickly returned tothe first task, it will find many of its lines still in the buffer.Conversely, if the delay before resumption of the task islong, (in terms of the life of buffer lines, a disk request ser-vice time is very long), the task will find few, if any,residual lines, and will spend a good part of its executioninterval bringing lines back into the buffer. The miss raterealized by a particular task depends not only on its intrin-sic characteristics (number of instructions executed be-tween 1/0 and other system requests, buffer line locality,

address reference patterns, etc.) but also on the character-istics of all other concurrently executing tasks. Thus, byitself, analyzing the buffer behavior of individual applica-tion -programs does not provide sufficient basis to estimatemiss rates in a system environment. For example, in thetypical 370 environment, two thirds or more of all buffermisses occur in supervisor state. Wihat we require is asystem view of buffer behavior.

Although we are focusing on mean overall values of uand M, equation (5) can be further decomposed in termsof instruction, target, and operand reference behavior,such that

S = u(i) *M(i) + u(t) *M(t) + u(o) *M(o) (7)

wheie u (i), u (t), and u (o) are, respectively, the meanmiss instruction fetch, branch target fetch, and operandfetch (and store) miss rates; and M(i), M(t), and M(o)are the associated mean miss penalties. Generally, u (i) <u (t) < u (o); a sequential instruction fetch is less likely tocause a miss than a branch target fetch, and the locality ofoperand references is usually less than that of instructionreferences. The mean miss penalties also differ, and eachmiss penalty can be further decomposed as

M(j) = m(j) + p(j) * 7r (8)wherej = i, t, or o; in (j) is the cost of fetching the instruc-tion, target, or operand line itself; p(j) is the proportionof misses of type j requiring an entry to be made in thetranslation lookaside buffer (typically from 0.1 to .01);and Tr is the TLB miss penalty, which may includemainstore accesses for page table and/or segment table en-tries.

Major components of the mean miss penaltyM include

* Buffer miss processing time, the time required by thebuffer to determine that the referenced line is notpresent; issue a mainstore read request for the missingline and, if the incoming line is to be replaced, issue amainstore write request for a modified line. Typical-ly, the line to be written is buffered and sent tomainstore after the read is initiated.

* Mainstore processing time, including memory bankqueueing and access times. Some bank contention is

Table 6. Pipeline delay times by incurring instruction inan IBM MVS SP1.3 operating system instruction mix.

RAN K

2345678910111213

OP CODEBCLTMSTLRLABCRLTRMVCICLHBALRSTM

0.185320.144470.062890.051220.048640.046450.030350 .029030.022260.017900.017650.016470.01563

d1.0290.4280.4080.5800.0870.3131.4370.7730.9820.8020.6251.6010.559

d *f

0.19070.06180.02560.02970.00420.01450.04360.02240.02190.01440,01 100.02640.0087

D0 ' d*f

July 198421

caused by 1/0 operations, but most memory traffic is line store, instruction fetch, instruction prefetch, andgenerated by the central processor (so that Mdepends replaced instruction line store.on u). Some buffer designs may permit as many as six * Buffer interference time, which occurs when the miss-memory requests to be in process at any time: ing line is sent to the buffer from mainstore. The.operand fetch, operand prefetch, replaced operand originally referenced portion of the line may be

directly sent to the pipeline so execution can continue,and the line then is stored in the buffer. The numberof buffer cycles required for the store depends on thepath width and line size; these store cycles may in-terfere with buffer accesses from the pipeline andcause pipeline delays. (Since these delays are a func-tion of u, they are included in S rather than D. )

Other contributors to M include, as noted earlier, TLBmiss processing and, in a multiprocessor system, inter-processor communication to determine if a buffer linereferenced by one processor exists in modified form in thebuIffer of another processor. Part of the time required toprocess a miss may be overlapped by the execution time ofinstructions preceding the instruction causing the miss.Consequently, the effective value of M may be less thanthe sum of these components; penalties for missing a se-quential instruction fetch will usually be less than operand

ire 7. Intermiss interval distribution measured during or target miss penalties.ervisor state in a timesharing environment (without Given the miss rate, proportion of modified lines, andetch). pipeline, buffer, and mainstore operation details, we can

obtain a reasonable approximation to M using analytic-methods. However, because of the "bursty" nature ofmiss generation, the number of units involved (pipeline,buffer, and mainstore), and the high degree of concurren-cy in the processing performed by these units, we need adetailed simulation of the entire process to get an accurateestimate. Figure 7 shows the distribution of intervals be-tween misses measured during supervisor state executionin a timesharing environment. The average interval be-tween misses was greater than 14 instructions, but 45 per-cent of intermiss intervals were less than eight instructionslong. This type of behavior causes designers to focus onmiss response time, rather than simply buffer-mainstorebandwidth.The buffer miss rate u depends on many buffer design

parameters, including buffer size, line size, set size, andire 8. Buffer miss rate compared to buffer size during address mapping, replacement, and prefetch algorithms,ervisor state in a batch environment (without for a specific workload. These parameters have tradi-etch). tionally been investigated with trace-driven models.

However, it is extremely difficult to reproduce thereference stream seen by a buffer in actual system opera-

*tion as control switches among various application andoperating system tasks, so modeling results must becarefully interpreted. For example, investigations of linesize based on application program traces suggest that longlines are more efficient than short lines. On the other

*hand, similar investigations based on operating systemtraces suggest that short lines are more efficient than longlines. The operating system tends to reference only a smallportion of each line, so long lines dilute the effectivecapacity of the buffer. Since the system generates mostmisses, short lines are preferable; the final choice dependson the relative values of buffer-miss processing time andmainstore access time.

_ The different conclusions are due to behavioral dif-ire 9. A stack growth function model. ferences. The typical batch application program tends to

COMPUTER

Figtsup,pref

Figtsuppref

Figu

22

have relatively strong spatial locality; line utilization (bytesreferenced divided by line size in bytes) is high even withlong lines, and a long line size expedites buffer reloadingafter a long suspension interval. The operating system is acollection of different tasks providing a variety of servicesand exhibits relatively little spatial locality either in its in-struction or operand behavior: envision, for example, along linked list search. Studies of other parameters, suchas set size and address mapping, may also yield differentconclusions, depending on the data used.

Because miss rate depends on system-level behavior, it isextremely difficult to model the relationship between missrate and buffer size. For workloads and buffer designsthat are similar except for size, empirical relationships be-tween buffer size and miss rates have been determinedthrough hardware measurement studies. These relation-ships have the form u = a*B-k (9)where B is the buffer size in kilobytes and a and k are con-stants for a particular workload or workload component.Typically, k is from 0.35 to 0.50, depending on theworkoad. Figure 8 shows an example of equation (9) forthe supervisor state component of a batch workload.Smith3 provides a more elaborate form that includesproblem state data and the effect of set size.One approach to modeling this relationship (now under

investigation) involves two submodels: a model of linereference behavior for individual tasks and a model of taskswitching behavior. The first model relates the number ofunique lines referenced in a task's execution interval to thelength of the interval in instructions. This model, calledthe stack growth function model,4 is based on LRU stack

behavior determined by instruction trace analysis. Inmany cases, an excellent fit to a trace-derived model isgiven by L = r*Id (10)where I is the number of instructions executed, L is thenumber of unique lines referenced by these instructions,and r and d are parameters for each program or task.Typically, d is from 0.35 to 0.65 for application programsand from 0.65 to 0.80 for operating system tasks. Figure 9shows equation (10) plotted together with the trace-derived model for an operating system instruction trace.The task switching model being developed is based on datafor a system event trace and gives transition frequenciesfor switches between both operating system and applica-tion tasks together with task execution interval lengths.The next step is to integrate the stack growth functionmodels for individual tasks into the task switching modeland validate the behavior of the composite model againstsystem measurements.

While our discussion of buffer behavior has been from aprocessor design perspective, software performance canalso benefit from an awareness of buffer organization andoperation. Tactics are similar to those used for improvingpaging behavior-separation of modified and unmodifiedparts and aggregation of frequently referenced code andoperands-but on a smaller scale.

Starting with mean instruction execution rate as ameasure of instruction-level performance, we have tracedseveral decomposition levels of a performance model(Figure 10), which encompasses a wide range of program-

Figure 10.Summary of instruction-level model decomposition.

July 198423

W&re looking for peoplewhocan see beyond ite obvious.

If Christopher Columbus had been content to shipcargo around the Mediterranean, he would have missedthe opportunity to discover the New World.

If LINKABIT engineers weren't thinking about whatcould be, instead of what is, we wouldn't be at theforefront of the telecommunications industry.

Thanks to a cadre of conceptual achievers, however,LINKABIT has continued to set the standard in diverseand complex projects such as MILSTAR terminals,video scrambling equipment, domestic satellitesystems, modems, codecs, advanced processors andfault-tolerant systems.

Now, we're looking for more of the same kinds ofthinkers to join our ranks in the following areas:

* Satellite Data Communications* Satellite Network Technologies* Information and Network Security* Speech Coding and Compression* Local Digital Switching Systems* Modulation and Coding Techniques* Synchronization Techniques* Advanced Digital Signal Processing* RF & Analog DesignThe creative, free-thinking atmosphere at

LINKABIT promotes excellence and is a reflection ofour physical environment. San Diego, America's FinestCity in location, climate, cultural and recreationalfacilities, offers you and your family an unsurpassedlifestyle. This invigorating setting, combined with thechallenge, satisfaction, and reward of a career atLINKABIT, provides an unbeatable opportunity to fulfillyour goals. Opportunities are also available in theWashington, D.C. area and Boston.

If you see your opportunity here, send your resumeto: Dennis Vincent, M/A-COM LINKABIT,3033 Science Park Road, San Diego, CA 92121.

You'll discover a world of obvious possibilities.

# M/A-COM LINKABIT, INC.Equal Opportunity/Affirmative Action Employer

processor interactions. Although this model is orientedtoward a specific architecture and a particular implemen-tation view, the underlying concepts have a much broaderapplication. A model of this kind can provide a taxonomyfor identifying and describing performance components atsuccessive levels of decomposition. It provides a consistentframework in which to incorporate estimates from dif-ferent and possibly dissimilar submodels, includingmeasurement as well as analytic and simulation models.When possible, its decomposition should follow lines ofminimum component interaction. Decomposition is donenot only in terms of processor design, but also in terms ofprogram behavior. For current large-scale processors, per-formance depends on-literally-hundreds of differentlow-level interactions between program and processor,and we should be able to represent each of them at somelevel of the model. While performance analysis may be ap-proached from a hardware or a software perspective, bothrequire a performance model that is a composite of pro-gram behavior and processor design models. *

Acknowledgments

The work described in this article represents workloads,systems, tools, models, and methods developed overyears. Many individuals have made important contribu-tions to this effort and I feel privileged to describe some oftheir accomplishments.

References

1. M. Kobayashi, "Dynamic Profile of Instruction Sequencesfor the IBM System 370," IEEE Trans. Computers, Vol.C-32, No. 9, Sept. 1983, pp. 859-861.

2. J. W. Rymarczyk, "Coding Guidelines for Pipelined Pro-cessors," Proc. Symp. Architectural Supportfor Program-ming Languages and Operating Systems, 1982 (also in Com-puter Architecture News, Vol. 10, No. 2, Mar. 1982, pp.12-19).

3. A. J. Smith, "Cache Memories," Computing Surveys, Vol.14, No. 3, Sept. 1982, pp. 473-530.

4. M. H. MacDougall, "The Stack Growth Function Model,"tech. report 820228-700A, Amdahl Corp., Sunnyvale,Calif., Apr. 1979.

Myron H. MacDougall is director ofSystems Performance Architecture for Am-dahl Corporation. Prior to joining Amdahlin 1976, he worked for 12 years at ControlData Corporation. He began working withcomputers at RCA in 1959. His interests en-compass all aspects of computer perfor-mance analysis and include a long-term in-terest in system-level simulation.MacDougall attended Rutgers Universi-

ty. He is a member of the ACM and IEEE Computer Society. Hisaddress is Amdahl Corporation, M/S 139, PO Box 470, Sun-nyvale, CA 94086.

COMPUTER

instruction-level program and processor modeling

Documents