flexible design of sparc cores: a quantitative study

6
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/3806036 Flexible design of SPARC cores: A quantitative study Conference Paper · February 1999 DOI: 10.1109/HSC.1999.777389 · Source: IEEE Xplore CITATIONS 5 READS 24 2 authors: Some of the authors of this publication are also working on these related projects: AGATE-325630-2 View project Tomás Bautista Universidad de Las Palmas de Gran Canaria 41 PUBLICATIONS 202 CITATIONS SEE PROFILE Antonio Nunez Universidad de Las Palmas de Gran Canaria 141 PUBLICATIONS 468 CITATIONS SEE PROFILE All content following this page was uploaded by Tomás Bautista on 15 May 2014. The user has requested enhancement of the downloaded file.

Upload: others

Post on 16-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Flexible Design of SPARC Cores: A Quantitative Study

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/3806036

Flexible design of SPARC cores: A quantitative study

Conference Paper · February 1999

DOI: 10.1109/HSC.1999.777389 · Source: IEEE Xplore

CITATIONS

5READS

24

2 authors:

Some of the authors of this publication are also working on these related projects:

AGATE-325630-2 View project

Tomás Bautista

Universidad de Las Palmas de Gran Canaria

41 PUBLICATIONS   202 CITATIONS   

SEE PROFILE

Antonio Nunez

Universidad de Las Palmas de Gran Canaria

141 PUBLICATIONS   468 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Tomás Bautista on 15 May 2014.

The user has requested enhancement of the downloaded file.

Page 2: Flexible Design of SPARC Cores: A Quantitative Study

Flexible Design of SPARC Cores:A Quantitative Study

TOMAS BAUTISTA AND ANTONIO NUNEZ

CAD Division, IUMA – Applied MicroelectronicsResearchInstitute.Universityof LasPalmasdeGranCanaria.

Tel. no.: +34928451250 Faxno.: +34928451243E-35017LasPalmasdeGranCanaria,CanaryIslands.

E-mail: [email protected]

�������������

In this paperwe presentexperimentalresultsobtaineddur-ing themodelling,designandimplementationof a full setofversionsof SPARC v8 Integer Unit coreaimedfor embed-dedapplicationsin digital mediaproducts.VHDL hasbeenthe descriptionlanguage,Synopsistools thosefor the logi-cal synthesis,andDuetTechnologies’Epochhasbeenusedfor thephysicallayoutof thefinal circuits. They have beenmappedto a0.35 m, threemetallayersprocess.Thequan-titative resultsgiven characterizesuitablepoints in the de-signspace.They show how muchmicroarchitecture,design,datapathgranularity and module decisionsaffect perfor-manceandcost functions. Designspaceexplorationdownto physicallayoutsismadepossiblebymodellingtechniquesbasedon configurableVHDL descriptions.

� ��� ���������������� �

Codesigntechniquesfor problemsolving rely on algorith-mic partitioningandassignmentto hardwareandsoftwarebasedtasks.Tasksmappedto hardwareareaimedfor speedandoften requiredevelopmentof fixed specificprocessors.In additiona new market hasemergedwith many compa-niesoffering IntellectualPropertydesignsin theform of ei-ther hardcoresor soft cores,trying to usesocket interfacedraftstandards.IP coresaswell asproprietarycoresarealsointendedfor their reusein new productdevelopmentswithshorttime to market.

Howeverin recentyearstheadvantagesof programmablesolutionshave alsobeenhighlighted. Thesesolutionsrelyon standardprocessorsavailableascoresfor embeddedsys-tems. This approachhelpsin software developmentsincethey are basedon well establishedprocessorarchitecturesandefficient optimisingcompilers.Processtechnologyad-

vancesarealsobringingtheseprocessorsto speedmarksthatmakesoftwaresolutionseverattractive.

As a combination of these trends general-purposeprocessorarchitecturesare evolving introducing familyderivations including better support for domain-specificproblems. One of thesedomainsis signal processingandin particular the digital media domain including audio,video, graphics and signal processing. Variations andderivationsfrom establishedarchitecturesincludeadditionalscalar units, additional extendedprecision accumulators,multiply and accumulate(MAC) units, additionalfloatingpoint (FP) units, different data type supportand formatsin registers and register files —for instanceto supportlimited SIMD operations—,and architecturalextensionsto the instruction set supportedby those units. In thedigital media domain, examplesof this are provided byMIPS, SPARC, HP PA-RISC, or HITACHI and ARMprocessorfamiliesamongothers.A comparisonof thestateof the art featuresoffered by theseprocessorsare givenin <http://www.mips.com/Documentation/-isa5 tech brf.pdf>, including alsothe Intel Pentiumfamily MMX extension.

PentiumandHP PA-RISC areproprietarytechnologies.MIPS andARM areopentechnologiesin thesensethat thearchitectureis aimedto belicensedto differentmanufactur-ers. SPARC is an openarchitecturebasedon architecturecomplianceand designcertification. MIPS hasproven itssuccessin thedigital mediadomain,for low andmediumendapplications.SPARC architectureshave proven an equiva-lent potentialto MIPS architecturesin their respective evo-lution.

However theimplementationparameterswill remainkeyaspectsfor success,for giventechnologies.Embeddedsys-temsandmany low-endgeneralpurposesystemsare verycostsensitive. It is critical to find waysto reducecostwhileincreasingperformance.Performancemustnotonly bemea-suredin cycles/task,oftenoptimisedatpipelineandarchitec-tural levels,but mustalsoincludeclockfrequency andpowerconsumptionachieved. Physicalaspectratio is alsoessen-tial for optimisingembeddedcoreswith thoseapplication-

Page 3: Flexible Design of SPARC Cores: A Quantitative Study

specificmemory, coprocessoror peripheraldevice modulesthatha

�veto fit eachotheron-chip.

The purposeof our researchhasbeento study, for thedigital mediadomain,the impacton realstate,clock speed,powerconsumptionandotherphysicalimplementationfunc-tions,of thefollowing variablesandparameters:

� ISA extensionsincludingMAC,FP, registertypes,andspecialoperators

� Microarchitecturaland designdecisionsfor a givenISA

� Control of physicaldesignby tool managementwithemphasisin modulesand datapathgranularity deci-sionsat synthesistime andat placementand routingtime.

With thisaimtheopenSPARC architecturehasbeencho-sen.Stagesconductedhavebeen:

� Developmentof VHDL versionsof SPARC v8 ISAwith IU, FPUandVIS extensions

� Developmentof aVHDL basedsynthesisandcompila-tion techniquewith easilyconfigurableVHDL options,for flexible synthesisof experimentalversionsof eachdesignandits parameterizedvariations.

Thispaperreportsonresultsobtainedsofar from thede-signexperiencesconductedfor the IntegerUnit of SPARC.Section2 commentson pertinentSPARC featuresandmainstreamcompiler technologyto addressSPARC variationsandextensions.Section3 givesa few relevantdetailsaboutthe strategy andtechniqueusedfor modellingof processorcores. Section4 presentsthe layout of the experimentsforassessingthe impactof microarchitecturalanddesigndeci-sions,andthatof modulesandgranularityasphysicaldesigndecisions.Section5 presentsasummaryof mainresultsandtheir discussion.Section6 concludesthepaperandconclu-sions.

� � � �"!$#%&��(')� �+*,�����-�.*-/0����132-��4 *(��5 � �768*(� � *(49�

SPARC is a RISC style instruction-setarchitecturethatde-fines the instructions,register structureand datatypesfortheintegerunit (IU) andanIEEE754standardfloatingpointunit. It allocatesopcodesanddefinesastandardinterfacefora coprocessorunit. It assumesa linear32-bit virtual addressspacefor userapplicationprograms[New91].

The reasonsfor choosingSPARC1 asa coreareduetoour purposeto useit in differentapplications,andin partic-ular in low-costmultimediaapplications[BMCN96]. Withthisapplicationin mind,therewereseveralreasonsfor mak-ing this decision:

1. Architecturerelatedreasons::It hasbeenchosenmorespecificallytheSPARC 32-bit version8 specification

� SPARC is an open and scalablearchitecture,which leavesthe designerwith a greatdegreeoffreedom.Therearea lot of possibledesignalter-nativesdependingon theperformancedesired.

� SPARC is very well-suitedfor real-timeandem-beddedapplications.Thereareeven specificex-tensionsuponthebasicSPARC specificationses-peciallydevelopedfor usingSPARC coresin em-beddedsystems.

� Another interestingfeaturefor using SPARC inembeddedsystemsis that the architecturepre-sumesthe presenceof a windowed register file.This is attractive in applicationswhereprocessesof different naturecoexist and where fast trapshandlingis required.Supportfor this is achievedwith thewindowedregisterfile sinceit allowsfastcontext switching.

2. Compilerandkernelrelatedreasons:

� For the processorcores in embeddedsystemsa software must be residentin order to handleproperstart-upmechanisms,device intercommu-nication, interruptsmanagement,and main pro-cessexecution. For producingmachinecodefortheprocessortheGNU C compilerhasbeenused.By taking the GNU C compiler as a reference[Sta91, Dar96], theview a compilerhasis a Reg-isterTransferLevel view of themachineandwhatarethevalid RegisterTransferLanguageexpres-sions.That is, thecompiler’sview of a processoris that of the InstructionSet Architecture. Thisincludesany architecturalextensions.From thisknowledgethecompileris capableto produceanoptimizedmachinecodeprogramthatwill corre-spondto a certaintaskdescribedin theprogram-ming language.

� Therearesoftwaredevelopmenttoolsalreadycre-atedfor SPARC thatcanhelp in thedesign,effi-cient compilationanddebuggingof the softwareto berunon thesecores.

� The introductionof Windows CE opensthe areato new multimediaapplicationsfor hand-heldde-vicesbasedon non-Intelprocessors.Thereis ex-tensive supportgivenby otheroperatingsystemsincludingreal-timeonesfor theSPARC platform.

; < ���-*(4�4=� ��> �����(�+* >-?A@ ���B�C'�*D2(�����*&�������B�����.*

By makinga traditionalcustomizedpartitionof oneof thesesystems,a high degreeof freedomhasresultedfor decid-ing themain featuresof theprocessorcore. In this way ar-chitecturalandimplementationparticularitiescanbe betterexploited.

Page 4: Flexible Design of SPARC Cores: A Quantitative Study

Themodellingprocess[BMCN97] is highly basedontheuseofE a graphicalschematicstool with which a systemcanbedescribedasaninterconnectionof differentelements.Atthe sametime, eachelementcanbe describedby meansofanotherschematicsor by a HDL description.By makingasuitablepartitioningonthetypeof basicbuilding blocksthatshouldbepartof everyschematics,thewholesystemcanbeeasilyconceivedanddebugged.

For a processorcore, the first partition to be madecor-respondsto the traditionalview of a processorassplit intoa DataPathanda ControlUnit. On a secondpartitionstep,the DataPath is baseduponan RTL view. In this view theDataPathisconceivedasbuilt of differentelementsfor mak-ing thecalculationsanddatastoringneededin theexecutionof thesetof implementedinstructions.Additionally, specialregistersareinsertedto actaspipelineelementsfor bothdataandcontrolsignals.

On theotherhand,theControlUnit is split into two sub-units. The first one is for managingthe decodinganddis-patchingexecutionof instructions,aswell as for sequenc-ing management.The secondoneis the main heartof thewholemachine,with thecentralFiniteStateMachinewhichrulesall theactionsperformedandtheelementsfor produc-ing propercontrol signalsfor thedifferentelements.Thesesignalsdependon the FSM, InstructionRegister contents,trapsmanagement,externalsignals,andotherelements.

As shown in [BMCN97], this centralizedalternative isbettersuitedthan the one basedon distributed FSMs (seefigure1) becauseit allowsa bettercontrolof theinstructionset,in aflexibledesignworkbench.Thisstrategyoffersaddi-tional significantbenefitswhich resultin a well suitedenvi-ronmentfor makingneededmodificationson theinstructionsetin whatrespectsto thedecodinglogic.

Modifying the instruction set can imply the modifica-tion, removal or appearanceof specificprocessorresourcesin the architecture. This operationis flexible indeeddueto the facilities providedby the graphicalenvironmentandthe featuresof the hardware language. This also appliesto implementationand architecturaldecisionsupon local-ized elementsof the architecture. Other parameterizationoptionsarealso possibledue to the capabilitiesof the de-scriptionlanguage,in which,by meansof constants,specificbehaviouroptionscanbeenabledor disabledin thesynthesissteps.

F G�H�2I*(�J�=1K* � �+��L � 1M2����� � @ 1N�����-4 *&�O����*P � �Q1R�9�,��)&�TS�,'-� �+*,���C�)�U�4��-*,�&�9�C�9� � �

Basedon a defined Instruction Set Architecture a largeamountof processorscanbe built. Even betweentwo pro-cessorswith thesamemicroarchitecture,just a differenceintheimplementationtechniqueof asubpartcanmakeamajordifferencefor goodor for bad.

Theenvironmentsetup for our experimentsallows that,for thesameinstructionset,differentdesignalternativescan

V W X Y

Z [ W \ ]^

_ ` Y [ a ` bc X \ Y W b c

Z [ W \ ]d Z [ W \ ]e Z [ W \ ]Y

_ ` Y [ a ` bc X \ Y W b c _ ` Y [ a ` bc X \ Y W b c _ ` Y [ a ` bc X \ Y W b c

f f f

(a)With distributedFSMs

Main

Stage 1ControlSignals

Stage 2ControlSignals

PipeliningRegisters

PipeliningRegisters

PipeliningRegisters. . .

. . .

. . .

Stage 3ControlSignals

Stage nControlSignals

(b) With acentralizedFSM

Figure1: Two possibleorganizationsfor generatingcontrolsignals

be evaluatedandcompared.Theseare introducedinto themodelsby working on special-purposeelementsof the ar-chitectureor by reorganisingsomeof themandtheir inter-connections.Taking this into account,a setof versionshasbeendevelopedwith the following distinguishingparame-ters2:

1. Ratiosof datapathmodulesvs.standardcells.

2. Branchprediction:

(a) alwayspredicttaken,

(b) predicttakenonly if backwards.

3. Checking of possiblemodification of the conditioncodesby thepreviousinstructiononbranches.As seenin fig. 2, at the decodestageof the branchinstructionin cycle ‘c’, when fetching the instructionin the de-lay slot, the addressfor the next instructionto fetchhasto becalculated.Theproblemis that Instruction1couldchangethe conditionalsoin ‘c’, so the addresscalculatedin this cycle is predicted3. A parametertoevaluatefor makinga predictionor not is to checkiftheconditioncodescouldbemodifiedby Instruction1or not.g

Thesedesignreferenceswill beusedasalternativesin tables1, 2 and3.hIf we wait for theconditioncodesto besetby Exec.1 for decidingtheaddressto

setin ‘d’, thecycle time increases.

Page 5: Flexible Design of SPARC Cores: A Quantitative Study

4. Calculationof theaddressfor thenext fetch

(a) with just anaddoperationin a cycle

(b) with two additions.Thisway, thetwo alternativesare calculatedat a time, the properone is cho-sendependingon the sequencingdecisionsandtheotheroneis savedin a register. If at any timethe decisionwaswrong (for instance,in mispre-dictions),thenthesavedpreviously is retrieved.

5. Bypassmechanismset

(a) fromtheendof theexecutionstageof thepreviousinstructionto thedecodestage,or

(b) from thebeginningof thewrite-backstageof thepreviousstageto theexecutionstage.

Instruction 1

WB. 1Ex. 1Dec. 1F. 1

Instruction 2 (Bicc)

WB. 2Ex. 2Dec. 2F. 2

a b c d e f g h

Post-prediction InstructionF.

D+1Dec.D+1

Ex.D+1

WB.D+1

Predicted Instruction

WB. DEx. DF. D Dec. D

Instruction 3 (delay slot)

WB. 3Ex. 3Dec. 3F. 3

Figure2: Timing of conditionalbranches

i !j*&�k�)49�l�

Within thisbenchof versions,asetof implementationshavebeencarriedout. Thetechnologychosenhasbeena3-metal1-poly 0.35 m CMOS process.The physicalsynthesisofthis subsetof implementationshasbeendevelopedin auto-matic mode—asthe tools canbe further forcedto achievespecificphysicaldesignconstraints—.This way, benefitsofmicroarchitecturalanddesigndecisionscanbebetterevalu-ated.

Tables1, 2 and3 summarizequantitative datafrom theexperiments.Thefull setof implementations,their physicallayouts,andothermeasurementsandratiosobtainedcanbeseenin [Bau99]. Fig. 3 givestwo samplelayouts.

Thenameof thetablessummarizecombinationsof sev-eral physicaldesignparameters.Option ‘2.a, 4.a, 5.a’ intable1 meansan implementationwith ‘predict taken’, ‘oneaddpercycle’ and‘by-passfrom endof executionstage’andreferto section4.

Many qualitative considerationscan be analyzedfromthesedata,sincethey show significantperformancevaria-tions.Thisanalysisis bestcarriedout jointly with datafrom

otheron-chipcircuit blocksrequiredby theembeddedappli-cationunderconsideration,andstressesthe convenienceoftheflexible designstrategy setup for theexperiences.

Table1: ‘Lowest’amountof modulesOptions Area Power Freq. Transistors

(sq.mm) (mW/MHz) (MHz)

2.a,4.a,5.a 15.2719 10.21932 53.38 1846022.a,3, 4.a,5.a 18.7347 10.68084 66.27 1875642.a,4.b,5.a 15.3189 10.07378 58.28 1860882.a,3, 4.b,5.a 13.5150 10.68283 59.36 1878842.a,4.b,5.b 15.6237 10.52504 61.70 1840352.b,4.b,5.a 14.8836 10.55471 67.19 1875802.b,3, 4.b,5.a 13.2435 9.98221 70.93 185909

Table2: ‘Mean’ amountof modulesOptions Area Power Freq. Transistors

(sq.mm) (mW/MHz) (MHz)

2.a,4.a,5.a 16.9508 10.35614 54.95 1735832.a,3, 4.a,5.a 11.4791 9.42461 58.67 1698962.a,4.b,5.a 12.0548 9.37507 89.18 1716342.a,3, 4.b,5.a 7.3983 9.17711 90.63 1721042.a,4.b,5.b 10.3395 9.56579 75.65 1697242.b,4.b,5.a 7.8112 9.14947 90.44 1722282.b,3, 4.b,5.a 12.6047 9.63190 82.77 173816

Table3: ‘Highest’ amountof modules

Options Area Power Freq. Transistors(sq.mm) (mW/MHz) (MHz)

2.a,4.a,5.a 12.3987 10.10201 65.99 1719782.a,3, 4.a,5.a 11.5498 10.09086 70.84 1718462.a,4.b,5.a 11.8015 9.96977 94.18 1742852.a,3, 4.b,5.a 10.2570 10.09608 88.43 1752402.a,4.b,5.b 13.1659 9.99910 76.78 1736312.b,4.b,5.a 12.6801 9.64188 96.94 1746672.b,3, 4.b,5.a 10.9568 9.79798 94.27 174292

As ageneralconclusion,themorestandardcellsusedthelessoptimal thedesignresults.As far asmoremodulesareinsertedinsidethedesign,speedandotherfunctionsusuallyget improved. In someoccasionsthis is not the rule. Somereasonsfor thiscanbefound,in part,in thedetailedconnec-tion mechanismsof thedesigntools,whichcouplinghasnotbeenfully achieved.

On the other hand, area is an item somehow unpre-dictable when modulesand standardcells coexist on thesamedesign.Fromtheresultsit is clearhow muchdesignerinterventionandguidanceis adviceablewhenanarea/shape-optimizeddesignis desired.

m #O� � �&4=���C��� � �

In thispaperwehaveshown someexperimentalresultsfromthemodelling,designandimplementationof setof versionsof SPARC v8 Integer Unit. As it hasbeendemonstrated,the impactof microarchitectureanddesignfeaturesaswell

Page 6: Flexible Design of SPARC Cores: A Quantitative Study

(a)c33 a

(b) c34 a

Figure3: Two differentversionsof theSPARC v8 IU

as the impact of the useof custommodules,can be deci-sive in the performanceobtainedfrom a design. This is inadditionto the plenty of architecturaldecisionsthat canbeincorporatedinto large embeddedprocessordesignsat dif-ferentlevels.

Furtherwork will reporton similar quantitative studiesincluding Register File sizesand formats,FPUs,and VISextensions.

n �o�,6 � �&p54 *,� > *�1N* � �l�

Theauthorsthankthesupportobtainedfrom projectDSIPS,CICYT no.TIC970953.

!q* @ *��.* � ��*&�

[Bau99] T. BAUTISTA. “Flexible GenerationandMod-

elling of SPARC Coresfor Specific-ApplicationDomains”. PhDthesis,Schoolof Telecommu-nicationEngineers,Universityof LasPalmasdeG.C.(1999).

[BMCN96] T. BAUTISTA , G. MARRERO, P.P. CAR-BALLO, AND A. NUNEZ. Towards a low-cost processor architecture for multimedia.In J. FIGUERAS, editor, “XI ConferenceofDesign of Integrated Circuits and Systems”,pages 445–450. Universidad Politecnica deCatalunya, CPDA, Diagonal, 647, E-08028Barcelona,Spain(November1996).

[BMCN97] T. BAUTISTA , G. MARRERO, P.P. CAR-BALLO, AND A. NUNEZ. Rapid-prototypingof high-performanceRISC coreswith VHDL.In CAPT. GREG PETERSON AND DR. PHIL IP

WILSEY, editors,“Rapid SystemsPrototypingwith VHDL”, pages43–52,Arlington, VA (Oc-tober1997).VHDL International,IEEE Com-puterSociety.

[Dar96] G. DART. An investigation into proces-sor descriptionswithin GCC. TechnicalRe-port, CMA, University of Las Palmasde G.C.(1996).

[New91] M. NEWMAN. “SPARC Strategy andTechnol-ogy”. SunMicrosystems,Inc. (1991).

[Sta91] R.M. STALLMAN. “Using and Porting GNUCC” (June1991).

View publication statsView publication stats