domain-specific accelerator design & profiling for deep ...bartolo/assets/dsa-design-dl.pdf ·...

19
Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits to Architecture Andrew Bartolo & William Hwang {bartolo, hwangw}@stanford.edu Introduction Where We’ve Been The field of computer architecture is exiting an era of predictable gains, and entering an era of rapid change. For decades, the trends of Dennard scaling and Moore’s law improved energy-delay product [Gonzalez96] while allowing for ever-higher numbers of transistors to be integrated on a single chip. These trends set the pace for the entire hardware industry, and ultimately drove the economics of computation. Year over year, computer users could count on hardware that was faster, and oftentimes cheaper. If an application didn’t work well on existing hardware, there was a decent chance that next year’s refresh would bring about a processor that was up to the task. From the 2 MHz Intel 8008 in 1974, to the 3.8 GHz Pentium 4 Prescott in 2005, CPU clock speed increased nearly 2000X over three decades. (Dennard scaling enables smaller, faster logic to fit within existing power and thermal envelopes. [Dennard74]) From an architect’s point of view, device scaling provided an increasing number of transistors to play with. The architect’s challenge thus became one of how to organize these transistors cleverly, so as to increase performance. In the 1980s, schemes such as superscalar and out-of-order execution gained popularity, and have remained in general-purpose architectures ever since. Superscalar issue – i.e., issuing multiple instructions at once – promotes increased utilization of a processor’s functional sub-units, and out-of-order execution allows processors to hide a good deal of memory latency. The imbalance between compute and memory remains a serious problem – perhaps the most fundamentally important problem facing computer architects today. Following the development of superscalar and OoO, techniques such as branch prediction and speculative execution became popular. As clock speeds ratcheted up, execution pipelines needed to be decomposed into more stages, so that each stage’s critical path would not exceed the clock period. Architects correctly surmised that keeping the pipeline full – even with instructions that weren’t guaranteed to be the “right” ones – would lead to more instructions processed per cycle, and with less energy wasted idling. Thus, branch prediction and speculative execution aimed to keep the pipeline as full as possible. By the mid-2000s, Dennard scaling had come to a halt. However, Moore’s law had granted architects such an abundance of transistors that it became possible to build two high-performance, superscalar and out-of-order cores together on a commodity chip. Intel’s Pentium D shipped two such cores on a multi-chip module [MCM], and its successor, Core Duo, integrated these two cores onto a single die. By simply clone-stamping multiple cores onto one die, these CPUs did something unprecedented – they shifted the burden of extracting increased performance to the software layer. No longer could the average computer user buy this year’s Pentium and hope for better performance – without a software rewrite for the multi-core paradigm, there was no performance increase to be had! Like the first multicore machines, new domain-specific designs will require enhanced software and compiler support for efficient use. Designs such as GPU, FPGA, CGRA, TPU, tiled manycore, and others demand a fundamental rethinking of the software-hardware interface. Frequently, an intermediate representation such as

Upload: others

Post on 06-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

Domain-SpecificAcceleratorDesign&ProfilingforDeepLearningApplications

FromCircuitstoArchitecture

AndrewBartolo&WilliamHwang{bartolo,hwangw}@stanford.edu

IntroductionWhereWe’veBeen

Thefieldofcomputerarchitectureisexitinganeraofpredictablegains,andenteringaneraofrapidchange.Fordecades,thetrendsofDennardscalingandMoore’slawimprovedenergy-delayproduct[Gonzalez96]whileallowingforever-highernumbersoftransistorstobeintegratedonasinglechip.Thesetrendssetthepacefortheentirehardwareindustry,andultimatelydrovetheeconomicsofcomputation.Yearoveryear,computeruserscouldcountonhardwarethatwasfaster,andoftentimescheaper.Ifanapplicationdidn’tworkwellonexistinghardware,therewasadecentchancethatnextyear’srefreshwouldbringaboutaprocessorthatwasuptothetask.Fromthe2MHzIntel8008in1974,tothe3.8GHzPentium4Prescottin2005,CPUclockspeedincreasednearly2000Xoverthreedecades.(Dennardscalingenablessmaller,fasterlogictofitwithinexistingpowerandthermalenvelopes.[Dennard74])

Fromanarchitect’spointofview,devicescalingprovidedanincreasingnumberoftransistorstoplaywith.Thearchitect’schallengethusbecameoneofhowtoorganizethesetransistorscleverly,soastoincreaseperformance.Inthe1980s,schemessuchassuperscalarandout-of-orderexecutiongainedpopularity,andhaveremainedingeneral-purposearchitectureseversince.Superscalarissue–i.e.,issuingmultipleinstructionsatonce–promotesincreasedutilizationofaprocessor’sfunctionalsub-units,andout-of-orderexecutionallowsprocessorstohideagooddealofmemorylatency.Theimbalancebetweencomputeandmemoryremainsaseriousproblem–perhapsthemostfundamentallyimportantproblemfacingcomputerarchitectstoday.

FollowingthedevelopmentofsuperscalarandOoO,techniquessuchasbranchpredictionandspeculativeexecutionbecamepopular.Asclockspeedsratchetedup,executionpipelinesneededtobedecomposedintomorestages,sothateachstage’scriticalpathwouldnotexceedtheclockperiod.Architectscorrectlysurmisedthatkeepingthepipelinefull–evenwithinstructionsthatweren’tguaranteedtobethe“right”ones–wouldleadtomoreinstructionsprocessedpercycle,andwithlessenergywastedidling.Thus,branchpredictionandspeculativeexecutionaimedtokeepthepipelineasfullaspossible. Bythemid-2000s,Dennardscalinghadcometoahalt.However,Moore’slawhadgrantedarchitectssuchanabundanceoftransistorsthatitbecamepossibletobuildtwohigh-performance,superscalarandout-of-ordercorestogetheronacommoditychip.Intel’sPentiumDshippedtwosuchcoresonamulti-chipmodule[MCM],anditssuccessor,CoreDuo,integratedthesetwocoresontoasingledie.Bysimplyclone-stampingmultiplecoresontoonedie,theseCPUsdidsomethingunprecedented–theyshiftedtheburdenofextractingincreasedperformancetothesoftwarelayer.Nolongercouldtheaveragecomputeruserbuythisyear’sPentiumandhopeforbetterperformance–withoutasoftwarerewriteforthemulti-coreparadigm,therewasnoperformanceincreasetobehad! Likethefirstmulticoremachines,newdomain-specificdesignswillrequireenhancedsoftwareandcompilersupportforefficientuse.DesignssuchasGPU,FPGA,CGRA,TPU,tiledmanycore,andothersdemandafundamentalrethinkingofthesoftware-hardwareinterface.Frequently,anintermediaterepresentationsuchas

Page 2: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

TensorFlowXLAisusedtoencodedataflowdependenciesbeforedatacanactuallybeprocessedbyhardware[Abadi16].Therefore,itseemslikelythattomorrow’scomputerarchitectwillneedtobeaswell-versedinsoftwareassheisinhardware.WhereWe’reGoing Themid-late2010swillberememberedforits“Cambrianexplosion”ofnewcomputerarchitectures.However,itturnsoutthatmany“new”architecturesreallyaren’tsonewafterall.Designsfirstintroducedinthe‘70sand‘80s,andthathavelanguishedsince,arenowpoisedtomakeacomeback.Forinstance,Google’sTensorProcessingUnit(TPU)is,atitscore,alargesystolicmatrixmultiplier–aschemethatdatesbacktoworkdonebyH.T.Kungin1982.[Kung82]NEC’snewAuroravectorprocessorheavilyresemblesthevectorunitsoftheCray-1from1975[Bell78,NECAurora].Thefundamentalreasonforthesearchitectures’resurgenceisthatgeneral-purposeCPUSareill-equippedtoprocessdatainparallelatscale.And,withthedawnofmachinelearningandmassivedatasetscollectedfromcheapandabundantsensors,thedemandforparallelcomputeresourceshasneverbeenhigher. Oneotherreasonforthesearchitectures’newfoundsuccessistheirsimplicity–atleast,comparedtomodernsuperscalarCPUs.FlawssuchasMeltdownandSpectre[Mangard18,Kocher18]provethatCPUdesigncarriesanunsustainableamountoftechnicaldebt.Bymovingtosimpler,yethighlyparallel,hardwareexecutionunits,thefieldofcomputerarchitectureaccomplishestwothings:1.)itshiftsagooddealofdesigncomplexityfromhardwaretosoftware,whichenjoysmuchmorerapiddevelopment,and2.)openstheplayingfieldtoahostofsmaller,innovativeparticipants.

Figure0:Trendsinchipmanufacturingandtestcosts

Ononehand,itseemslikelythatcheapgeneral-purposecoreswilldisplacesimplermicrocontrollersinall

butthelowest-costandlowest-powerdevices.(WhybuyanArduinowhenyoucanhaveaRaspberryPiforthesameprice?)However,inareaswhereperformance,orperformance-per-watt,iscrucial,domain-specificacceleratorsarepoisedtobecomethearchitectureofchoice.

Forthesereasons,ourprojectfocusesondomain-specificacceleratorsfordeeplearningapplications.

PartI:MACandmaxpoolcircuit-levelanalyses Ourprojectfirstconsidersdomain-specificacceleratorsatthecircuitslevel.Todothis,weaskedthefollowingquestion:ifweweretounrollthedataflowgraphofacontemporaryneuralnetwork(say,VGG-19),howmuchparallelismcouldweextractfromthisgraphintheabsenceofenergyandareaconstraints?

Experimentally-calibratedstudiesofconventionalmanycoreprocessorarchitectures(e.g.,XeonPhi)haveshownthatamajorityofenergyandexecutiontime(greaterthan~90%)isspentaccessingmemoryacrossarangeofabundantdataapplications(e.g.,PageRank)[Aly15]duetolimitedconnectivitybetweencomputelogicandoff-

Page 3: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

chipmemories(generallyDRAM).Forparticularapplications,fixedfunctionaccelerators(e.g.,Eyeriss[Chen16],EIE[Han16], etc.) can improve overall system energy efficiency through optimized dataflow implementations thatmaximize memory reuse while limiting off-chip memory accesses. Such implementations provide an isolatedsnapshot of the full architecture design space, and are not necessarily optimized to fully utilize all availablecompute and memory resources. As such, a key question remains: How does one design energy-efficientaccelerators in the abundant-data era, which fully utilize all available compute and memory resources whilemaximizing computational throughput? Figure 1 graphically illustrates the cruxof thedesignproblemusing therooflinemodel,where the y-axis refers to the computational throughput (in operations per second) of a givenacceleratorarchitecture,andthex-axisindicatestheoperationalintensityofanapplication(inoperationsperbyteofdataaccessed).

Figure1:AgraphicalillustrationofthedesignproblemusingtherooflinemodelDesignspaceexploration

In order to understand the design space with greater depth, we constructed a simple analytical model toexplore the design space. We first noted that many popular convolutional neural networks (e.g., VGG-19[Simoyan14])arecomprisedofaseriesofconvolutional,fullyconnected,pooling,andReLU(rectifiednonlinearity[Krizhevsky12])layers.Inthisdesignspaceexploration,wefocusedontheconvolutionalandfullyconnectedlayersforthefollowingreasons:

1. Multiply-accumulate (MAC) operations comprise the bulk of the arithmetic operations. The underlyingarithmetickernelforconvolutionalandfully-connected(e.g.,matrixmultiply)layersistheMACoperation.The size of the MAC kernel can be parameterized in terms of the size of the convolutional or fully-connectedlayerassummarizedinFigure2.

2. ReLU operation implement the following function: 𝑅𝑒𝐿𝑈 𝑥 = max(0, 𝑥). In hardware, this can beimplementedasabitwiseANDoperation,whereone input is𝑥,andthenegationofthesignbitof𝑥 isbroadcastedtotheotherinput.Inthisway,theoutputofthebitwiseANDoperationis𝑥if𝑥 ≥ 0and0otherwise.Typically,everyMACoperationisfollowedbyaReLUoperation,andtheenergyandexecutiontimeoftheMACdominates.Thisassumptionislatersubstantiatedwithdetailedphysicaldesignstudies.

3. Max pooling layers implement the following function:𝑚𝑎𝑥𝑝𝑜𝑜𝑙 𝑥5, … , 𝑥7 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑥5, … , 𝑥7). Forpopularnetworks(e.g.,AlexNet,VGG-19,ResNet-152),𝑛istypically4or9,resultinginareductiontreeofdepth2or4,respectively.TheenergyandexecutiontimeofsuchoperationissmallrelativetothatoftheMACoperations.Thisassumptionislatersubstantiatedwithdetailedphysicaldesignstudies.

0

5

10

15

20

0 2 4 6 8 10 12 14 16 18 20

Performan

ce(O

P/s)

OperationalIntensity(Operations/Byte)

Compute-Bound(inefficientuseofmemoryresources)

OptimalPoint(compute&memoryresourcesfullyutilized)

Page 4: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

Figure2:ThesizeoftheMACkernelsforbothconvolutionalandfullyconnected(e.g.,matrixmultiply)layers.

Wemakeseveralsimplifyingassumptionstomodelthelowerboundoftheenergyandexecutiontimeofa

convolutionalneuralnetworkinferenceaccelerator,tounderstandtheinherentparallelismavailableintheinferencephaseofconvolutionalneuralnetworks:

1. Memoryaccessis“free”(e.g.,zeroenergyanddelay),astheseparametersarenotinherenttothecomputelogic,butrathertheintegrationtechniqueusedtoconnectlogicandmemorycomponents(e.g.,off-chipmemories,interposer-based2.5Dintegration[Volta],die-stacked3D-TSVintegration[HMC],monolithic3D[Aly15],etc.).

2. Computelogicconsistsofmultipliersandaddersonly,duetothereasonsstatedabove.3. OnehardwareMACunitisinstantiatedperMACoperationintheneuralnetwork.4. Alloperationsare8-bitfixedpointoperations,consistentwith[Jouppi17].

Toestimatetheminimumexecutiontimeandenergyoftheaccelerator,wefirstperformeddetailedphysical

designstudiesofthemultiplierandaddercircuits.Followingthis,weanalyticallyexpresstheminimumsystemenergyanddelayusingtheMACenergyanddelayderivedfromthecircuit-levelsimulations,ignoringroutingoverheads.

Reduction-TreeMAC:

Weexplorethereduction-treebasedMACasourfirstMACtopology.Suchanimplementationrepresentsthe maximum parallelization that can be achieved at the circuit level, at the cost of increased chip area. Areduction-treebasedMACofsize𝑛iscomprisedof𝑛multipliersandareductionadderofdepthlog> 𝑛with(𝑛 −1)adders,asshowninFigure3.

Figure3:Dataflowofareduction-treebasedMAC.

Theenergy,executiontime,andareaofsuchaMACtopologycanbemodeledwiththefollowingequations,where𝑡,𝐸CD7,and𝑃FGHI,refertothedelay,dynamicenergy,andleakagepower,respectively:

1. 𝑑𝑒𝑙𝑎𝑦 = 𝑡LMFN + (log> 𝑛)×𝑡HCC2. 𝑎𝑑𝑑𝑒𝑛𝑒𝑟𝑔𝑦 = (𝑛 − 1)×(𝐸HCC

CD7 + 𝑑𝑒𝑙𝑎𝑦×𝑃HCCFGHI), where the add subscript denotes the delay, dynamicenergy, and leakage power of a single adder. 𝑎𝑑𝑑𝑒𝑛𝑒𝑟𝑔𝑦 refers to the total energy of the reductionadder.

Page 5: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

3. 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟𝑒𝑛𝑒𝑟𝑔𝑦 = 𝑛×(𝐸LMFNCD7 + 𝑑𝑒𝑙𝑎𝑦×𝑃LMFNFGHI)

4. 𝑎𝑟𝑒𝑎 = 𝑎𝑟𝑒𝑎LMFN×#LMFN + 𝑎𝑟𝑒𝑎HCC×#HCC,wheretheaddsubscriptdenotes theareaandnumberofasingleadder.

We performed detailed physical design studies using an industry 28nm process development kit (PDK) to

obtainpost-layoutvaluesfor𝑡,𝐸CD7,and𝑃FGHI.WeusedSynopsystoolstoperformsynthesis(DesignCompiler),place&route(ICCompiler),andpowersimulation(PrimeTime).Morespecifically,weperformedRCextractiononthe circuits post place& route prior to power simulation, such that the delay, energy, and power numbersweobtained account for effects present in realistic VLSI circuits (e.g., wire parasitics, etc.). The results aresummarized in Figures 4 and 5 below. The Pareto frontier is obtained by specifying timing constraints duringsynthesis,placeandroute,andpowersimulation,allowingDesignCompilertooptimizethecircuittopology(e.g.,ripple-carry vs. carry-look-ahead adders) to meet the specified timing constraint. We select the energy-delay-product(EDP)minimumpointontheParetofrontierastheoptimalcircuitimplementation,tradingoffenergyanddelaywithequalweight.

Figure4:TheParetofrontieroftheenergy-delaylandscapefora28nm8-bitadder,derivedfrompost-layout

powersimulationsafterRCextractionatthecircuitlevelwithanindustryPDK.

Figure5:TheParetofrontieroftheenergy-delaylandscapefora28nm8-bitmultiplier,derivedfrompost-layout

powersimulationsafterRCextractionatthecircuitlevelwithanindustryPDK.

Using the results from our circuit-level simulations, we can model the convolutional neural networkinference accelerator as a fully combinational circuit and estimate the energy and execution time of the

Page 6: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

accelerator.Wenote that thisapproachdoesnot take routingoverheads intoaccount, though theseoverheadscanbeignoredfornow,aswearetryingtounderstandtheboundsoftheproblem.Thetotalsystem-levelenergyandexecutiontimecanbeestimatedbasedonthenumberofMACunitsperlayer,andthenumberofmultipliersperMAC.WeshowVGG-19[Simoyan14]asanexampleapplicationinTable1below.

Table1:EstimatedminimumenergyandexecutiontimeofVGG-19usingthemodeldiscussedabove.

ItisimportanttonotethattheselargeEDPbenefitscomeatalargeareacost.Weestimatetheareaofasinglemultiplierandaddertobeapproximately270μm2and50μm2,respectively,basedonpost-place-and-routeareaestimates.Toachievethelevelofparallelismdescribedhereusingthereduction-treebasedMACtopology,theareacostisontheorderof6.28meters2,orapproximately90industry-standard300mmsiliconwafers!Suchanareacost isastronomicalanduntenable in real-lifesystems. Thus,weexplore thesequentialMACtopology,whichismoreareaefficientcomparedtothereduction-treebasedMAC.SequentialMAC:

ThesequentialMACtopologyinstantiatesasingleadderandmultiplierperMAC,andtimemultiplexestheinputtomultiplyandaccumulatetheinputargumentovermultipleclockcycles. ThesequentialMACistypicallymoreareaefficientthanthereduction-treebasedMAC,aseachsequentialMACiscomprisedofasingleadderandmultiplier(vs.𝑛 − 1and𝑛inareduction-treebasedadder,respectively)andflip-flopsontheinputsandoutputs(Figure6).

Figure6:DataflowofasequentialMACunit.

Theenergyandexecution timeofa sequentialMACcanbemodeledwith the followingequations,where𝑡TUV ,𝐸TUVCD7 ,and𝑃TUVFGHI,refertothedelay,dynamicenergy,andleakagepowerofthesequentialMAC,respectively:

Page 7: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

1. 𝑑𝑒𝑙𝑎𝑦 = 𝑛×𝑡TUV ,where𝑡TUV isdefinedasthetimeittakestopropagateasignalfromtheinputofthemultipliertotheoutputoftheoutputflip-flop(e.g.,oneclockcycle).

2. 𝑒𝑛𝑒𝑟𝑔𝑦 = 𝑛×(𝐸TUVCD7 + 𝑑𝑒𝑙𝑎𝑦×𝑃TUVFGHI)

WeperformeddetailedphysicaldesignstudiesofthesequentialMACunitusinganindustry28nmPDK,similar

tothatdescribedpreviously.Forbrevity,wereferthereadertothedescriptioninthereduction-treeMACsection.WeselecttheEDPminimumpointontheParetofrontierastheoptimalcircuitimplementation(Figure7).

Figure7:TheParetofrontieroftheenergy-delaylandscapefora28nm8-bitsequentialMACunit,derivedfrom

post-layoutpowersimulationsafterRCextractionatthecircuitlevelwithanindustryPDK.

We thenmodeled the energy and execution time of an accelerator based on the sequentialMAC unitsimilartotheanalysisintheprevioussection.Thetotalsystem-levelenergyandexecutiontimecanbeestimatedbasedonthenumberofMACunitsper layer,andthenumberofmultipliersperMAC. Weshowtheresults forVGG-19 [Simoyan14] as an example in Table 2 below. We estimate the area of the sequential MAC to beapproximately444μm2basedonpost-place-and-routeresults,whichwouldcorrespondtoasystem-levelareacostofapproximatelyof6,600mm2,orapproximately0.1industry-standard300mmsiliconwafers.

Table2:EstimatedminimumenergyandexecutiontimeofVGG-19usingthesequentialMAC.

Page 8: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

TakeawaysandRevisitingAssumptions:

Ourdesignspaceexplorationhasledustothefollowingconclusions.First,thereisalotofparallelizationleft on the table. That is, neural network inference applications (e.g., VGG-19) have a significant amount ofinherent parallelism.Without area constraints, theminimum bound for energy and execution time is very lowwhen an accelerator is constructed with a fully-combinational network of reduction-tree based MAC units.However, it is important to note that area is an important constraint. A fully-parallelized network with thereduction-tree based MAC unit produces an equivalent area of ~90 industry-standard 300 mm silicon wafers,whichissimplyunobtainableusingtoday’stechnology.Asequential-MAC,whichtimemultiplexestheinputsandreusesthesamemultiplierandadderovermultipleclockcycles,achievesasignificantly lowerareacomparedtothe reduction-tree basedMAC (~890× lower). However, this area reduction comes at the cost of ~874× higherexecutiontime,and~675×higherenergy.Thehigherexecutiontimeoriginatesfromthetimemultiplexing,andtheincreasedenergyconsumption isa resultof the leakagepower integratedoveran~874× longerexecutiontime.Thus,we can see that it is important to understand the cost of scheduling neural network applications onto alimited number ofMAC units, as the naïve, fully-combinational implementation (which is easier to schedule) isuntenableintermsofareacost.

We revisit an earlier assumption where we assumed that the maxpool layers were an insignificant

contributor to the energy andexecution timeof a deep learning inference accelerator. WeperformaphysicaldesignstudyofthemaxpoolkernelfromVGG-19,wherethemaxpoolkernelhasa2x2windowsizewithstride2.Morespecifically,thisisa4-to-1maxpoolfunctionthatcomputestheargmaxover4arguments.Weperformthephysicaldesignstudyusingthesametoolflowasdescribedinprevioussections.TheresultsareshowninFigure8.

Figure8:TheParetofrontieroftheenergy-delaylandscapefora28nm8-bitmaxpoolunitwithawindowsizeof2x2,derivedfrompost-layoutpowersimulationsafterRCextractionatthecircuitlevelwithanindustryPDK.

Atthenetwork-level,wecanestimatetheeffectofthemaxpoollayerswiththefollowingequations:

1. 𝑑𝑒𝑙𝑎𝑦 = #LHWXYYFFHDGZ[×𝑡LHWXYYF,where#LHWXYYFFHDGZ[referstothenumberofmaxpoollayers,and𝑡LHWXYYFreferstotheEDPoptimaldelaythroughasinglemaxpoolkernel.

2. 𝑒𝑛𝑒𝑟𝑔𝑦 = #LHWXYYFYXGZHN\Y7[×(𝐸LHWXYYFCD7 + 𝑑𝑒𝑙𝑎𝑦×𝑃LHWXYYFFGHI )put

ForVGG-19,themaxpooldelay,energy,andareacorrespondto~5ns,~25µJ,and80mm2,respectively,forthesequentialMACtopology,whichamounttolessthan~0.1%ofthesequentialMACdelayandenergy,andlessthan~1%ofthesequentialMACarea.Forthisreason,ourinitialassumptionthatthemaxpoollayerswereinsignificantcontributors was correct. We also assumed that the ReLU activation function was insignificant. Earlier, weassertedthattheReLUactivationimmediatelyfollowsaMACoperationandcanbeconstructedinhardwarewithan8-bitbitwiseANDgateaninverter.Thistranslatesto9logicgates.Weinspectthepost-place-and-routenetlist

Page 9: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

of the sequentialMACunit (whichhas less gates than the reduction-treebasedMAC), andnote that there are~400 gates. As such, the activation function accounts for less than 0.25% of the logic gates, and would notcontributesignificantlyenergy,asenergyroughlycorrelatestothenumberofgatesinacombinationalcircuit.ThedelaywouldnotincreasesignificantlyastheReLUactivationwouldadd~1gatedelaytothecriticalpath,whichisnotverysignificantconsideringtheserialnatureofthesequentialMACtopology.

PartII:Top-downDeepLearningsimulationInsearchofaDeepLearningAcceleratorsimulator Moderncomputerarchitectureisundeniablycomplex.Furthermore,thetremendoustimeandfinancialcostsassociatedwithtapingoutachipdictatethatmostdesignspaceexplorationmustbedonewithinsimulations.Withinthearchitectureresearchcommunity,itisacommonly-heldbeliefthat,duetotheoften-unpredictableinteractionsbetweensystemsubparts,analyticalmodelsbythemselvesareinsufficient[Miller10,Sanchez13].Whilesomeopinionsdiffer[Nowatzki14],whatisgenerallyrequiredaresimulatorsthatdirectlymodeltheunderlyingcomponentsofthesystem,andtheinteractionsthatcanoccurbetweenthem. Cycle-levelsimulatorssuchasGEM5[Binkert14]sufferfromtwofundamentalproblems.Thefirstisthattheyareprohibitivelyslow.Thesesimulatorsdohaveaplace–theycanmodelindividualfunctionalunits–butaresosluggishastobealmostuselessinmodelingchip-levelbehaviorofentireapplications(ratherthanjusttraces).ThesecondisthattheysupportonlyCPU-likearchitectures,andextendingthebasearchitecturesinvolveswritingandvalidatingsubstantialamountsofSystemCorC++.

Weneedasimulatorthat,likezsim,takesintwothings:1.)ahigh-levelacceleratorconfiguration,and2.)anoff-the-shelfprepackagedapplication.Forzsim,1.)takestheformofasimple.configfile,and2.)takestheformofastandardLinuxELFbinary.Forourdeeplearninginferencesimulator,1.)shouldalsobea.configfile,and2.)shouldbeahigh-levelinputrepresentationsuchasapre-trainedTensorFloworCAFFEmodel.

Inlightoftheserequirements,wesuccessfullybroughtupNvidia’snewNVDLAdeeplearninginference

acceleratorandbuiltalightweightsimulator,DLISim,topofit.Oursimulatorrunsentirelyinsoftware(thoughopportunitiesforhardwareaccelerationexist)andspitsoutricharchitecture-levelstatistics,suchas:“Howmuchdatawasusedacrossallconvolution(matmul)stages?”,“Howmanyoperationsofeachtype(matrixmultiply,activationfunction,pooling,etc.)occurred?”,and“Howwerelargeoperationssubdividedandscheduledonthehardware?”NVDLA+DLISimis,tothebestofourknowledge,thefirstsuchopen-sourcesimulatortoaccomplishthegoalsoutlinedinthepreviousparagraph.

NVDLA–whatitgivesus TheNVDLAproject[NVDLA]iscomprisedofthreedistinct,complementarycoderepositories.Thefirstoftheseishw,thehardwarerepository.ItcontainsVerilogRTLandaSystemCgoldenmodel,aswellastestbenches.(TheprovidedVerilogpasseditstestbencheswhenwerantheminVCS.)Thesecondissw.ItcontainsthetwodriversnecessaryforNVDLAtointeractwithitshostsystem:auser-modedriver,UMD,fortaskssuchas.jpegimagedecompression,andakernel-modedriver,KMD,responsibleforsettingcontrolregistersandtransferringdata.Inaddition,swcontainsthe(asofnow)black-boxbinarynvdla_compiler.Finally,vpprovidesanARMQEmuexecutionenvironmentfortheswdriverstorunin.Theoverallsystemlayoutisbelow.

Page 10: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

Figure9:NVDLAplatformoverview

Thisgivesusquiteabittoworkwith.Itshouldbenoted,however,thattheentireplatformisverymuch

inanalpha-releasestage,andthatsourcecodeforsomecomponentsissimplynotavailableyet.Forinstance,Nvidia’sopen-sourceroadmapindicatesthatsourcecodeforthemodelcompiler(NVDC,i.e.,nvdla_compiler)willbeavailableeventually,thoughnoexactdateisgiven–“Thecompilerwillinitiallybeonlyreleasedinbinaryform,withasourcereleasecominglater”[Roadmap].ThispresentedaproblemwhenseveralpopularDLmodelsfailedtocompile–wehadnomeansoffixingthecompilerourselves.Nonetheless,Nvidiahas,asofnow(mid-March2018),stucktothereleasescheduleathttp://nvdla.org/roadmap.html,andhascompleted2ofthe3proposedstages(2017-Q3and2017-Q4,with2018-H1forthcoming).ThehwandswcomponentswerefirstmadeavailabletothepublicinOctober2017,andthevpcomponentinDecember2017.

Likeanyprocessor,NVDLAcontainsdifferenttypesoffunctionalunits.ThefollowingtabledetailsthesixtypesoffunctionalunitspresentinNVDLA,theiracronyms/abbreviations,andthespecificoperationwithinaneuralnetthateachsupports.

Table3:NVDLAfunctionalunitsAcronym/Abbr. FunctionalUnit DeeplearningfunctionalityCONV ConvolutionalMAC Matrix-matrixmultiplicationSDP ScalarDataProcessor Appliesactivationfunctions(sigmoid,tanh)PDP PlanarDataProcessor Appliespoolingfunctions(max,min,average)CDP ChannelDataProcessor Appliesnormalization(centersmean,variance)Rubik Tensorreshape Splits/slices/mergesdatatofitintocomputeunitsBDMA BridgeDirectMemoryAccess TransfersdatafromDRAMtoSRAMcache(nottypicallyused)

Page 11: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

Thediagrambelowshowsthelayoutoftheindividualfunctionalunits(lefthandside),aswellashowtheNVDLAcoreconnectstotheSRAMbuffer(“SecondDBBinterface,”)DRAM(“DBBIF,”)andCPU(“IRQ”fora1-bitsynchronousinterruptsignal,andCSBfor32-bitsynchronouscontrolsignaling).ImageandweightdataisDMA’edovertheDBBIFinterface.[Primer]

Figure10:NVDLAcoreoverview

TheConvolutionBuffercanbethoughtofasakintoanL1(-D)cacheinatraditionalCPU,andtheSRAM

buffer(“SecondDBBinterface”)asanL2cache.Finally,theDBBinterfaceprovidesDMAaccesstoDRAM. Note that the first four functional units are arranged so that, dataflow schedulingpermitting, data canmovedirectly fromone stage to another, and not have tomake a round-trip to SRAM/DRAMandback. In theNVDLAliterature,thisisreferredtoas“fused”mode,incontrastto“independent”mode.Physically,fused-modepipeliningismadepossiblewithasetofsmallFIFOsbetweeneachofthefunctionalunits[Primer].However,inourexperiments, the NVDLA KMD scheduler never pipelined any operations, even when it could have (a“fused_parent = [ dla_consumer =>”annotationshouldhaveappearedinthelogs,butneverdid).Intheneuralnetworkswetested,activationfunctionsnearlyalwaysfollowedconvolutions,etc.,soitseemsthatthisearlyimplementationoftheKMDschedulerleavesalotofperformanceonthetable.

Regardingthefunctionalunits,CONVisreallyageneral-purposematrixmultiplyunit.Inpractice,it’softenused to perform convolutions, but also supports more general matrix-matrix multiplication for fully-connectedlayers.SDP currently supports sigmoidand tanhactivations,aswell asP/ReLU,but, since it’s implementedasasimple LUT, could bemodified to handle other types of activations by simply changing the LUTROM.BDMA iscapableoftransferringmemorydirectlyfromthe“Primary”DRAMinterfacetothe“secondary”SRAMcache.Noneof the fournetworkswe lookedatmadeuseof theBDMA FU,and it’spossible thatnvdla_compilerdoesn’tsupportityetanyway.Likewise,nonetworkwetestedmadeuseoftheRubikreshapeengine.

DLISim–whatNVDLAlacks Despitetheamountofcodeavailableinthethreerepos,NVDLA’spre-compileddriversgrantverylittlevisibilityintotheruntimeworkingsofthesystem.AfterfeedinginanNVDLAloadableandinput.jpegimage,theprovidednvdla_runtimebinarysimplyindicateswhetherthenetworkrantocompletionorencounteredanerror.However,sincethedriverisopen-source,wewereabletoinstrumentittoprovidealotofadditionalinformation.

Page 12: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

Togetusefulprofilingdata,wemodifiedtheKMDdrivertoenableaseriesofop-triggeredprintouts.EachtimeoneofthesixoperationsrunsontheSystemCmodel,amessageispassedtothekernelviadmesg.Eachofthesemessagescontainsinformationsuchasthenumberofinputbytesprocessed,dataprecision,andstridelengththroughtheinputarray(s).Overthecourseofexecution,thesemessagescollectinthekernel’sringbuffer,andcanbedumpedasalogfilewithtimestampsuponcompletion.(Foradriver,dmesgloggingisconsideredcleanerthanprintingtostdout/stderr.)

Aftermodifyingthesourcecodetoenabletheprintouts,weneededtorecompilethedriver.TheKMD

codecompilesintoakernelmodule,whichmustbeloadedintheQEmuemulator’sBusyBoxLinuxenvironmentviainsmodbeforesimulationcanbegin.Unfortunately,sincetheQEmusimulatorrunsAArch64,themodifiedKMDmodulemustbecross-compiled.Furthermore,tocross-compileakernelmodule,thekernelitselfmustbecross-compiledfirst.Fortunately,theLinarotoolchain[Linaro]wasabighelpingettingboththekernelcompiledandmodifiedKMDrunningonAArch64. InadditiontotheKMDsourcecodemodifications,wealsocreatedsomePythonscriptstoparsethelogsoutputbyKMD,andboilthestatisticsdownintoamoredigestibleformat.Thescriptscanbefoundathttps://github.com/andrewbartolo/dlisim. SeebelowforaschematicofourmodifiedNVDLAsetup,withDLISIMattached.

Figure11:NVLDAoverviewwithDLISimattached

Togetaclearsenseofthesimulatorandarchitecture’scapabilities,wetested13networksfromCAFFE’s

ModelZoo[ModelZoo],whichareshowninthetablebelow.Unfortunately,manyofthemdidn’tworkwiththegivennvdla_compiler.SinceNvidiahasn’topen-sourcedthecompileryet,wewereunabletodomuchtoremedythisissue.Whatlittleinsightwehaveintonvdla_compilerisgrantedbythefactthattheprovidedbinaryisnotstripped.AfterrunningGDB,weatleasthaveastacktraceofwhatcausedtheerroruponthesegmentationfault.Ongoingdiscussiononthenvdla/swGitHubissuestracker(https://github.com/nvdla/sw/issues)indicatesthattheNVDLAdevelopmentteamisawareoftheissues,andisworkingtofixthem.

Page 13: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

Table4:NeuralnetsandNVDLAcompilerstatusNetwork InputDataset Status NotesAlexNet ILSVRC2012 WORKS CONV+SDP n/a WORKS R-CNN ILSVRC2012 WORKS GoogLeNet ILSVRC2014 FAIL Compilercouldnotresolvedependencyindataflow

graph(CONVlayer)HybridCNN ILSVRC2012 FAIL Compilersegmentationfault(inparseCaffeNetwork())LeNet MNIST WORKS MobileNetv1 ILSVRC2012 FAIL Compiler.prototxtparsererror(Poolinglayer)MobileNetv2 ILSVRC2012 FAIL Compiler.prototxtparsererror(Poolinglayer)Network-in-Network ILSVRC2012 FAIL Compilersegmentationfault(inparseCaffeNetwork())SqueezeNetv1.0 ILSVRC2012 FAIL Compiler.prototxtparsererror(Poolinglayer)SqueezeNetv1.1 ILSVRC2012 FAIL Compiler.prototxtparsererror(Poolinglayer)VGG-16 ILSVRC2014 FAIL Compilersegmentationfault(inparseCaffeNetwork())VGG-19 ILSVRC2014 FAIL Compilersegmentationfault(inparseCaffeNetwork())

Notethatalloftheseareconvolutionalneuralnetworks(CNNs).Unfortunately,therearen’tasmanyCAFFEmodelsavailableforRNNs,suchasLSTMsandGRUs.Inthefuture,aTensorFlowfrontendmayfacilitaterunningthesenetworksonNVDLA.Google’sTPUsupportsrecurrentneuralnetworks(infact,Googleclaimsthat90%ofitsdeployedTPUsarerunningMLPsorLSTMRNNs,ratherthanCNNs[Jouppi17]).SinceNVDLAisverysimilartotheTPU,itshouldbepossibletoaddsupportforthesenetworks.Alsonotethatallofthesearepre-trainednetworks,withtrainedweightsprovidedin.caffemodelformat.(Inthisproject,wedidn’tconsidertheproblemofNNtraining,thoughHWsupportfortrainingremainsanareaofactiveresearch.)

OnefinalNVDLAlimitationisthatitcurrentlyonlysupportsoneacceleratorconfiguration,“nv_full”–see

https://github.com/nvdla/hw/issues/94.Perhw/spec/defs/nv_full.spec,thisconfigurationfeatures2048multiply-accumulatorsintheCONVFU,andoperatesin8-bitfixed-point(integer)modeforallfunctionalunits.Theprimaryandsecondarymemorybusesareeach512bitswide.DiscussiononthehwGitHubissuetrackerindicatesthatdifferentconfigurations,suchasnv_small,willbesupportedinfutureNVDLAreleases.

Results

Forallevaluations,LeNetwasrunona28x28MNIST.pgminputimage.AlexNetandR-CNNwererunona227x227.jpeginputimage,whichwasdecompressedbythehostbeforebeingsenttotheaccelerator.TheCONV+SDPloadablewasrunwithitsintegratedinputdatamatrix.Allinferenceoperationswereperformedon8-bitfixed-point(integer)representations.Opcounts

Togetahigh-levelsenseofthenetworks’compositions,wefirstconsiderthenumberofNVDLAoperationsnecessarytoruneachofthemtocompletion.Thetableandchartbelowshowtherelativemakeupofeachnetwork,withthetotalnumberofoperationslistedatopeachbar.RecallthatnoneofthenetworkswetestedusetheBDMAorRubikfunctionalunits.

Page 14: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

Table5:NeuralnetopcountsNetwork CONV SDP PDP CDP TotalAlexNet 15 22 3 2 42R-CNN 15 22 3 2 42LeNet 4 5 2 0 11CONV+SDP 1 1 0 0 2

Figure12:Neuralnetopcounts

Atleastintermsofnumberofoperations,CONVandSDPopsdominateallotherkinds.Thismakessense,perthenetworks’namesake(CNN),andthefactthatactivationsalmostalwaysfollowconvolutions.RecallthattheCONVunitisalsousedforfully-connectedlayers,butthatSDPactivationsusuallyfollowthese,too.WhatismoreinterestingisthatthenumberofSDPopsiscommonlygreaterthanthenumberofCONVops.Uponinvestigatingthelogs,wefoundthat,forallthreeofthe“real”networks,theschedulerwassubdividingsome–butnotall–SDPstagesintotwoseparateops,oneimmediatelyfollowingtheother.Thisbehavioroccurredmostoftenintheearlystagesofthenetwork,wheretheamountofdataoutputbytheprecedingCONVlayerwasgreatest.

ThisindicatesthatawiderSDPunitcouldpotentiallyprocessallCONVoutputsinoneoperation.Physically,thiswouldbemanifestedinNVDLAasaddingportstotheSDPelement-wiseLUT.TheSDP_EW_THROUGHPUTparameterinthehw/spec/defs/nv_full.specfileindicatesthatSDPwidthwillindeedeventuallybeadjustable.Fornow,however,itsvalueissetat4ports,duetothefixednv_fullconfig.So,eithertheSDPunititselfistoonarrow,orsomeotherbottleneckexistsinthepipelinebetweenCONVandSDP.Wesurmisethat,ifsuchanexternalbottleneckexists,itislikelyrelatedtothepipelinecomponentsoperatinginindependent,ratherthanfused,mode,withdataneedingtogobackandforthontheDRAMinterfaceinsteadofthroughthehigher-throughputinter-stageFIFOs. WesawthattheschedulersometimessubdividesopstomakethemfitintheFUs.However,thismeansthatsomeoperations,evensubdividedopsofthesame“macro-op,”mayhavedifferentinputdatasizes.Toresolvethispotentialimbalance,wenowfocusonthetotalamountofdataprocessedbyeachFUthroughoutthecourseofexecutionoftheentirenetwork.FUdataflowanalysis

Thenumberofdatabytesconsumedbyeachfunctionalunit,acrosstheexecutionofanentirenetwork,isprovidedintabularandgraphicalformatsbelow.Becausethenetworkshavevastly-varyingamountsofdata(R-

Page 15: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

CNNandAlexNetmuchmorethanLeNetandCONV+SDN),tomaintainreasonabley-axisranges,theyareplottedseparately.

Table6:BytesprocessedbyeachNVDLAFU

Network CONVinput CONVweight SDP PDP CDP TotalAlexNet 5,478,592 122,188,416 2,635,104 1,040,576 954,048 990,752R-CNN 5,478,592 115,634,816 2,635,104 1,040,576 226,496 132,296,736LeNet 37,376 861,184 47,136 45,056 0 125,015,584CONV+SDP 16,384 73,728 0 0 0 90,112

Figure13:BytesprocessedbyeachNVDLAFU

Oneinterestingthingtonoteaboutthesegraphsisthat,comparedtotheopcountschart,CONV(duetoitsweights)dominatesanevenlargerportionoftheoverallpercentages.Thisaddscredencetothehypothesisthatnv_full’sSDPunitisundersizedcomparedtotheCONVunitthatprecedesit.Specifically,wenoticethatforCONV,thereisalotmoreweightdatathanthereisinput(i.e.,activation)data.Thismakessense,asinputdimensionsforR-CNNandAlexNetare227x227,whileeachoftheir.caffemodel(weights)fileswerearound200megabytesin

Page 16: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

size.Notealsothatthefinallayersofallthree“real”networksarefully-connected.Incontrasttoatrueconvolution,thismatrix-matrixmultiplicationintheFClayerscanblowuptheoutputmatrixdimensionsbymultiplyingasmallerinputmatrixbyamuchlargerweightsmatrix,furtherincreasingtheamountofweightsdataconsumed.

Finally,weobservethatinCONV+SDP,theSDPoperationisreallyano-op.ThismightberelatedtoaschedulerissuethatrequiresanSDPtofollowaCONV,eveniftheSDPdoesn’tdoanything(identityfunction).Here,however,itdoesn’tevenactasanidentityfunction–itsimplytakesinnodataatall.CONV+SDPisapre-packagedloadableintendedtotestonlytheCONVunit,sothisisn’tanissue.Networksimulationruntime

Unfortunately,theNVDLASystemCvirtualplatformdoesn’tcomewithanytoolsforestimatingcyclesconsumed.IntheNVDLAliterature,theauthorssuggestthatanFPGAmaybeusedinsteadfor“limitedcycle-countingperfomanceevaluation”[Primer].However,FPGAsupportfornv_fullisstillforthcoming(seehttps://github.com/nvdla/hw/issues/90).ItmayalsobepossibletoinstrumenttheSystemCinthevirtualplatformtocount“virtual”cycles;wedidnotexplorethisoption.

However,inordertogetsometime-baseddata,wesimplylookedatthesimulationtimesrequiredforthe

fournetworks.Obviously,simulationtimecorrespondsverycrudelytorealexecutiontime,butdoesgivesomesenseoftheamountofcomputationnecessarytorunallthestagestocompletion.

Figure14:Networksimulationtimes

CONV+SDPandLeNetaresignificantlyshallowernetworksthanAlexNetandR-CNN,andexhibit

correspondinglylowerruntimes.AlsorecallthatAlexNetandR-CNNoperateon227x227-pixelimages,whileLeNet’sinputis28x28,andCONV+SDPusesanintegrated8x8inputimage.AsAlexNetandR-CNNbothrequire42ops,inapproximatelythesameorder,itisreasonablethattheirruntimesaresosimilar.

Themostinterestingtakeawayfromthedataisthatsimulationtimeseemstoscalefairlywellwith

networksize.I.e.,therewasn’tagreatdealofunavoidableoverheadpresentforthesmallernetworks.WeexpectthatmovingthesimulationtoanFPGAcandrasticallyreducesimulationtimes,andthusgrantustheopportunitytosweepanevenbroaderrangeofdesignparametersoncethenv_fullrestrictionislifted.

Page 17: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

Futuredirections

OurprimarylimitationwiththeNVDLAplatformisthat,forthetimebeing,itdoesn’tgiveuscycleortiminginformation.Unfortunately,thispreventedusfromunifyingthecircuit-levelworkwiththesimulatorworktothedegreethatwewouldhaveliked.Nonetheless,wewereabletoobtainvaluableinsightsfromeachphaseoftheproject,suchasMACparallelismvs.energy/area,andhowneuralnetexecutiononarealhardwareacceleratormightbescheduled.OncetheNVDCcompilerisopen-sourcedandmademorerobust,wecanlookatawidervarietyofnetworks;specifically,wecouldsimulateVGG-19onNVDLAtocomparetooursequentialMACanalysis.Finally,weeventuallyhopetosynthesizeandplace-and-routetheNVDLAdesign;thiswouldgiveaccurateclockcycletimeestimates,aswellaspermittingustoexperimentallychangeoutthedefault28nmSiCMOSlogic+DRAMforCNFET+RRAM,whichmayprovidesignificantEDPbenefits.[Aly15].

Conclusion

The riseof abundantdata computing,where amassive amountof structuredandunstructureddata isanalyzed,hasplacedextremedemandsontheenergyefficiencyoftoday’scomputingsystems.Withstate-of-the-artCMOScircuitspushing the limitsofDennardScaling [Dennard74], improving theenergyefficiencyofgeneralpurpose processor systems has become increasingly challenging. Several alternate hardwaremodels have beenexplored as potential successors to general purpose processor cores, including field-programmable gate arrayswith bit level configurability [Putnam14], domain-specific accelerators with limited programmability [Volta], orfixed-functionaccelerators[Jouppi17],eachwithitsowntradeoffs.Forexample,field-programmablegatearraysarehighlyprogrammable,atthecostofhigherenergy/executiontimeperoperationduetoroutingoverheadsandmappinglogicprimitives(e.g.,ANDs,ORs,etc.)tomulti-purposelookuptables.Ontheotherhand,fixed-functionacceleratorsprovidelowerenergy/executiontimeperoperationduetooptimizedcircuitimplementations,atthecost of limited programmability. For this project, we focused on fixed-function, deep learning inferenceacceleratorsforconvolutionalneuralnetworks,astheyrepresentaboundontheminimumenergy/executiontimethatcanbeachievedinhardwareforapopularclass(e.g.,deeplearninginference)ofabundant-dataapplications.

Ourcircuit-leveldesignspacegaveusadeepunderstandingofthecruxofthedesignproblem.Wefound

that deep learning inference applications such as convolutional neural networks can be parallelized to a highdegree. In particular, the area of the resultant accelerator becomes a difficult problem to manage as theapplicationisaggressivelyparallelized.WefoundthatasequentialMAC-basedtopologyproducesadesignwithamorereasonablearea,atthecostofincreasedexecutiontimeandleakageenergy.Weshowthatnaïveschedulingviaaggressiveparallelization(e.g.,designingalargecombinationalcircuitforthewholeneuralnetworkinferenceapplication) is impractical,andscheduling isan importantandnecessaryoverheadthatmustbecharacterizedinorderdevelopacceleratorsthatrunapplicationsattheoptimalpointontherooflinecurveshowninFigure1.Wealsoshowthatpoolingoperations(e.g.,maxpool)andactivationfunction(e.g.,ReLU)donotcompriseasignificantfractionoftheenergyandexecutiontimeofaneuralnetworkaccelerator.GivenbetterNVDLAcompilersupport,wewouldhavebeenabletoanalyzeawiderrangeofneuralnetworkinferenceapplications,andcharacterizetherooflinemodel for the accelerator provided by theNVDLA framework. Using this information,wewould havebeneabletocharacterizetheschedulingoverheadoftheNVDLAdataflow.Intheend,wewereabletogetsomeusefuldataflowinformationoutofNVDLA,eventhoughwewerenotabletomeasureaccuratecountcycleswithit. In the future,we hope to add hooks intoNVDLA to improve the simulation accuracy, validate the simulatoragainstphysicaldesignlayoutsofaccelerator,andunifythebottom-upandtop-downapproachesinPartsIandII,respectively.

Attributions

Williamperformedin-depthcircuit-levelanalysisandobtainedadder,multiplier,andMACenergy-delayresultsinPartI.AndybroughtupNVDLAandimplementedDLSimontopofitinPartII.Bothteammembers

Page 18: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

contributedequallytotheIntroductionandConclusion,andtotheprojectingeneral.Allworkdoneanddatacollectedfortheprojectwasperformedoverthe10-weekWinter2017-2018academicquarter. DLISimsourcecodeisavailableathttps://github.com/andrewbartolo/dlisim.Inthisrepository,you’llfindbothacollectionoflogsfromthefournetworkswewereabletorun,andPythonscriptstoprocessandannotatethelogs.

Citations[Abadi16]“TensorFlow:Asystemforlarge-scalemachinelearning.”M.Abadiet.al.12thUSENIXSymposiumonOperatingSystemsDesignandImplementation(OSDI).2016.[Aly15]“Energy-efficientAbundant-dataComputing:TheN3XT1,000x.”M.M.S.Alyetal.ComputerMagazine.2015.[Bell78]“TheCRAY-1ComputerSystem.”G.Bellet.al.CommunicationsoftheACM.1978.[Binkert14]“Thegem5simulator.”N.Binkertet.al.ACMSIGARCHComputerArchitectureNews.[Chen16]“Eyeriss:AspatialArchitectureforEnergy-EfficientDataflowforConvolutionalNeuralNetworks.”Y.-H.Chenet.al.IEEEInternationalSolid-StateCircuitsConference.2016.[Dennard74]“Designofion-implantedMOSFETswithverysmallphysicaldimensions.”R.Dennardet.al.IEEEJournalofSolid-StateCircuits.1974.[Gelsinger00]‘TheInternationalTechnologyRoadmapforSemiconductors(ITRS):“Past,Present,andFuture.”’http://ieeexplore.ieee.org/document/906261/.RetrievedMarch2018.[Gonzalez96]“EnergyDissipationinGeneralPurposeMicroprocessors.”R.Gonzalez,M.Horowitz.IEEEJournalofSolid-StateCircuits.1996.[Han16]“EIE:EfficientInferenceEngineonCompressedDeepNeuralNetwork”.S.Hanet.al.InternationalSymposiumonComputerArchitecture.2016.[HMC]“HybridMemoryCube(HMC).”J.T.Pawlowski.HotChips23.2011.[Jouppi17]“In-DatacenterPerformanceAnalysisofaTensorProcessingUnit.”N.Jouppiet.al.InternationalSymposiumonComputerArchitecture.2017.[Kocher18]“SpectreAttacks:ExploitingSpeculativeExecution.”P.Kocheret.al.https://spectreattack.com/spectre.pdf.2018.[Krizhevsky12]“ImageNetClassificationwithDeepConvolutionalNeuralNetworks.”A.Krizhevsky,I.Sutskever,andG.Hinton.NeuralInformationProcessingSystems.2012.[Kung82]“WhySystolicArchitectures?”H.T.Kung.IEEEComputer.1982.[Linaro]“Linaro–LeadingsoftwarecollaborationintheARMecosystem.”https://linaro.org/downloads.RetrievedMarch2018.[Mangard18]“MeltdownandSpectre.”S.Mangardet.al.https://meltdownattack.com/meltdown.pdf.2018.[MCM]“Multi-ChipModule”.Techopedia.https://www.techopedia.com/definition/11836/multi-chip-module-mcm.RetrievedMar.2018.[Miller10]“Graphite:ADistributedParallelSimulatorforMulticores.”J.Milleret.al.InternationalSymposiumonHigh-PerformanceComputerArchitecture.2010.[ModelZoo]“BVLCCAFFEModelZoo.”https://github.com/BVLC/caffe/wiki/Model-Zoo.RetrievedMarch2018.[NECAurora]“AdeepdiveintoNEC’sAuroravectorengine.”T.Morgan.TheNextPlatform.https://www.nextplatform.com/2017/11/22/deep-dive-necs-aurora-vector-engine/[Nowatzki14]‘gem5,GPGPUSim,McPAT,GPUWattch,“Yourfavoritesimulatorhere”ConsideredHarmful.’T.Nowatzkiet.al.2014.[NVDLA]“TheNVIDIADeepLearningAccelerator.”http://nvdla.org.RetrievedMarch2018.[Primer]“NVDLAPrimer.”http://nvdla.org/primer.html.RetrievedMarch2018.[Putnam14]“AReconfigurableFabricforAcceleratingLarge-ScaleDatacenterServices(Catapult).”InternationalSymposiumonComputerArchitecture.[Roadmap]“NVDLAOpenSourceRoadmap.”http://nvdla.org/roadmap.html.RetrievedMarch2018.[Sanchez13]“ZSim:FastandAccurateMicroarchitecturalSimulationofThousand-CoreSystems.”D.SanchezandC.Kozyrakis.InternationalSymposiumonComputerArchitecture.2013.

Page 19: Domain-Specific Accelerator Design & Profiling for Deep ...bartolo/assets/dsa-design-dl.pdf · Domain-Specific Accelerator Design & Profiling for Deep Learning Applications From Circuits

[Simoyan14]“VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition.”K.SimoyanandA.Zisserman.InternationalConferenceonLearningRepresentations.”2014.[Volta]“NVIDIATeslaV100GPUArchitecture.”http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.RetrievedMarch2018.