domain-specific accelerator design & profiling for deep ...bartolo/assets/dsa-design-dl.pdf ·...

Domain-SpecificAcceleratorDesign&ProfilingforDeepLearningApplications

FromCircuitstoArchitecture

AndrewBartolo&WilliamHwang{bartolo,hwangw}@stanford.edu

IntroductionWhereWe’veBeen

Thefieldofcomputerarchitectureisexitinganeraofpredictablegains,andenteringaneraofrapidchange.Fordecades,thetrendsofDennardscalingandMoore’slawimprovedenergy-delayproduct[Gonzalez96]whileallowingforever-highernumbersoftransistorstobeintegratedonasinglechip.Thesetrendssetthepacefortheentirehardwareindustry,andultimatelydrovetheeconomicsofcomputation.Yearoveryear,computeruserscouldcountonhardwarethatwasfaster,andoftentimescheaper.Ifanapplicationdidn’tworkwellonexistinghardware,therewasadecentchancethatnextyear’srefreshwouldbringaboutaprocessorthatwasuptothetask.Fromthe2MHzIntel8008in1974,tothe3.8GHzPentium4Prescottin2005,CPUclockspeedincreasednearly2000Xoverthreedecades.(Dennardscalingenablessmaller,fasterlogictofitwithinexistingpowerandthermalenvelopes.[Dennard74])

Fromanarchitect’spointofview,devicescalingprovidedanincreasingnumberoftransistorstoplaywith.Thearchitect’schallengethusbecameoneofhowtoorganizethesetransistorscleverly,soastoincreaseperformance.Inthe1980s,schemessuchassuperscalarandout-of-orderexecutiongainedpopularity,andhaveremainedingeneral-purposearchitectureseversince.Superscalarissue–i.e.,issuingmultipleinstructionsatonce–promotesincreasedutilizationofaprocessor’sfunctionalsub-units,andout-of-orderexecutionallowsprocessorstohideagooddealofmemorylatency.Theimbalancebetweencomputeandmemoryremainsaseriousproblem–perhapsthemostfundamentallyimportantproblemfacingcomputerarchitectstoday.

FollowingthedevelopmentofsuperscalarandOoO,techniquessuchasbranchpredictionandspeculativeexecutionbecamepopular.Asclockspeedsratchetedup,executionpipelinesneededtobedecomposedintomorestages,sothateachstage’scriticalpathwouldnotexceedtheclockperiod.Architectscorrectlysurmisedthatkeepingthepipelinefull–evenwithinstructionsthatweren’tguaranteedtobethe“right”ones–wouldleadtomoreinstructionsprocessedpercycle,andwithlessenergywastedidling.Thus,branchpredictionandspeculativeexecutionaimedtokeepthepipelineasfullaspossible. Bythemid-2000s,Dennardscalinghadcometoahalt.However,Moore’slawhadgrantedarchitectssuchanabundanceoftransistorsthatitbecamepossibletobuildtwohigh-performance,superscalarandout-of-ordercorestogetheronacommoditychip.Intel’sPentiumDshippedtwosuchcoresonamulti-chipmodule[MCM],anditssuccessor,CoreDuo,integratedthesetwocoresontoasingledie.Bysimplyclone-stampingmultiplecoresontoonedie,theseCPUsdidsomethingunprecedented–theyshiftedtheburdenofextractingincreasedperformancetothesoftwarelayer.Nolongercouldtheaveragecomputeruserbuythisyear’sPentiumandhopeforbetterperformance–withoutasoftwarerewriteforthemulti-coreparadigm,therewasnoperformanceincreasetobehad! Likethefirstmulticoremachines,newdomain-specificdesignswillrequireenhancedsoftwareandcompilersupportforefficientuse.DesignssuchasGPU,FPGA,CGRA,TPU,tiledmanycore,andothersdemandafundamentalrethinkingofthesoftware-hardwareinterface.Frequently,anintermediaterepresentationsuchas

TensorFlowXLAisusedtoencodedataflowdependenciesbeforedatacanactuallybeprocessedbyhardware[Abadi16].Therefore,itseemslikelythattomorrow’scomputerarchitectwillneedtobeaswell-versedinsoftwareassheisinhardware.WhereWe’reGoing Themid-late2010swillberememberedforits“Cambrianexplosion”ofnewcomputerarchitectures.However,itturnsoutthatmany“new”architecturesreallyaren’tsonewafterall.Designsfirstintroducedinthe‘70sand‘80s,andthathavelanguishedsince,arenowpoisedtomakeacomeback.Forinstance,Google’sTensorProcessingUnit(TPU)is,atitscore,alargesystolicmatrixmultiplier–aschemethatdatesbacktoworkdonebyH.T.Kungin1982.[Kung82]NEC’snewAuroravectorprocessorheavilyresemblesthevectorunitsoftheCray-1from1975[Bell78,NECAurora].Thefundamentalreasonforthesearchitectures’resurgenceisthatgeneral-purposeCPUSareill-equippedtoprocessdatainparallelatscale.And,withthedawnofmachinelearningandmassivedatasetscollectedfromcheapandabundantsensors,thedemandforparallelcomputeresourceshasneverbeenhigher. Oneotherreasonforthesearchitectures’newfoundsuccessistheirsimplicity–atleast,comparedtomodernsuperscalarCPUs.FlawssuchasMeltdownandSpectre[Mangard18,Kocher18]provethatCPUdesigncarriesanunsustainableamountoftechnicaldebt.Bymovingtosimpler,yethighlyparallel,hardwareexecutionunits,thefieldofcomputerarchitectureaccomplishestwothings:1.)itshiftsagooddealofdesigncomplexityfromhardwaretosoftware,whichenjoysmuchmorerapiddevelopment,and2.)openstheplayingfieldtoahostofsmaller,innovativeparticipants.

Figure0:Trendsinchipmanufacturingandtestcosts

Ononehand,itseemslikelythatcheapgeneral-purposecoreswilldisplacesimplermicrocontrollersinall

butthelowest-costandlowest-powerdevices.(WhybuyanArduinowhenyoucanhaveaRaspberryPiforthesameprice?)However,inareaswhereperformance,orperformance-per-watt,iscrucial,domain-specificacceleratorsarepoisedtobecomethearchitectureofchoice.

Forthesereasons,ourprojectfocusesondomain-specificacceleratorsfordeeplearningapplications.

PartI:MACandmaxpoolcircuit-levelanalyses Ourprojectfirstconsidersdomain-specificacceleratorsatthecircuitslevel.Todothis,weaskedthefollowingquestion:ifweweretounrollthedataflowgraphofacontemporaryneuralnetwork(say,VGG-19),howmuchparallelismcouldweextractfromthisgraphintheabsenceofenergyandareaconstraints?

Experimentally-calibratedstudiesofconventionalmanycoreprocessorarchitectures(e.g.,XeonPhi)haveshownthatamajorityofenergyandexecutiontime(greaterthan~90%)isspentaccessingmemoryacrossarangeofabundantdataapplications(e.g.,PageRank)[Aly15]duetolimitedconnectivitybetweencomputelogicandoff-

chipmemories(generallyDRAM).Forparticularapplications,fixedfunctionaccelerators(e.g.,Eyeriss[Chen16],EIE[Han16], etc.) can improve overall system energy efficiency through optimized dataflow implementations thatmaximize memory reuse while limiting off-chip memory accesses. Such implementations provide an isolatedsnapshot of the full architecture design space, and are not necessarily optimized to fully utilize all availablecompute and memory resources. As such, a key question remains: How does one design energy-efficientaccelerators in the abundant-data era, which fully utilize all available compute and memory resources whilemaximizing computational throughput? Figure 1 graphically illustrates the cruxof thedesignproblemusing therooflinemodel,where the y-axis refers to the computational throughput (in operations per second) of a givenacceleratorarchitecture,andthex-axisindicatestheoperationalintensityofanapplication(inoperationsperbyteofdataaccessed).

Figure1:AgraphicalillustrationofthedesignproblemusingtherooflinemodelDesignspaceexploration

In order to understand the design space with greater depth, we constructed a simple analytical model toexplore the design space. We first noted that many popular convolutional neural networks (e.g., VGG-19[Simoyan14])arecomprisedofaseriesofconvolutional,fullyconnected,pooling,andReLU(rectifiednonlinearity[Krizhevsky12])layers.Inthisdesignspaceexploration,wefocusedontheconvolutionalandfullyconnectedlayersforthefollowingreasons:

1. Multiply-accumulate (MAC) operations comprise the bulk of the arithmetic operations. The underlyingarithmetickernelforconvolutionalandfully-connected(e.g.,matrixmultiply)layersistheMACoperation.The size of the MAC kernel can be parameterized in terms of the size of the convolutional or fully-connectedlayerassummarizedinFigure2.

2. ReLU operation implement the following function: 𝑅𝑒𝐿𝑈 𝑥 = max(0, 𝑥). In hardware, this can beimplementedasabitwiseANDoperation,whereone input is𝑥,andthenegationofthesignbitof𝑥 isbroadcastedtotheotherinput.Inthisway,theoutputofthebitwiseANDoperationis𝑥if𝑥 ≥ 0and0otherwise.Typically,everyMACoperationisfollowedbyaReLUoperation,andtheenergyandexecutiontimeoftheMACdominates.Thisassumptionislatersubstantiatedwithdetailedphysicaldesignstudies.

3. Max pooling layers implement the following function:𝑚𝑎𝑥𝑝𝑜𝑜𝑙 𝑥5, … , 𝑥7 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝑥5, … , 𝑥7). Forpopularnetworks(e.g.,AlexNet,VGG-19,ResNet-152),𝑛istypically4or9,resultinginareductiontreeofdepth2or4,respectively.TheenergyandexecutiontimeofsuchoperationissmallrelativetothatoftheMACoperations.Thisassumptionislatersubstantiatedwithdetailedphysicaldesignstudies.

0

5

10

15

20

0 2 4 6 8 10 12 14 16 18 20

Performan

ce(O

P/s)

OperationalIntensity(Operations/Byte)

Compute-Bound(inefficientuseofmemoryresources)

OptimalPoint(compute&memoryresourcesfullyutilized)

Figure2:ThesizeoftheMACkernelsforbothconvolutionalandfullyconnected(e.g.,matrixmultiply)layers.

Wemakeseveralsimplifyingassumptionstomodelthelowerboundoftheenergyandexecutiontimeofa

convolutionalneuralnetworkinferenceaccelerator,tounderstandtheinherentparallelismavailableintheinferencephaseofconvolutionalneuralnetworks:

1. Memoryaccessis“free”(e.g.,zeroenergyanddelay),astheseparametersarenotinherenttothecomputelogic,butrathertheintegrationtechniqueusedtoconnectlogicandmemorycomponents(e.g.,off-chipmemories,interposer-based2.5Dintegration[Volta],die-stacked3D-TSVintegration[HMC],monolithic3D[Aly15],etc.).

2. Computelogicconsistsofmultipliersandaddersonly,duetothereasonsstatedabove.3. OnehardwareMACunitisinstantiatedperMACoperationintheneuralnetwork.4. Alloperationsare8-bitfixedpointoperations,consistentwith[Jouppi17].

Toestimatetheminimumexecutiontimeandenergyoftheaccelerator,wefirstperformeddetailedphysical

designstudiesofthemultiplierandaddercircuits.Followingthis,weanalyticallyexpresstheminimumsystemenergyanddelayusingtheMACenergyanddelayderivedfromthecircuit-levelsimulations,ignoringroutingoverheads.

Reduction-TreeMAC:

Weexplorethereduction-treebasedMACasourfirstMACtopology.Suchanimplementationrepresentsthe maximum parallelization that can be achieved at the circuit level, at the cost of increased chip area. Areduction-treebasedMACofsize𝑛iscomprisedof𝑛multipliersandareductionadderofdepthlog> 𝑛with(𝑛 −1)adders,asshowninFigure3.

Figure3:Dataflowofareduction-treebasedMAC.

Theenergy,executiontime,andareaofsuchaMACtopologycanbemodeledwiththefollowingequations,where𝑡,𝐸CD7,and𝑃FGHI,refertothedelay,dynamicenergy,andleakagepower,respectively:

1. 𝑑𝑒𝑙𝑎𝑦 = 𝑡LMFN + (log> 𝑛)×𝑡HCC2. 𝑎𝑑𝑑𝑒𝑛𝑒𝑟𝑔𝑦 = (𝑛 − 1)×(𝐸HCC

CD7 + 𝑑𝑒𝑙𝑎𝑦×𝑃HCCFGHI), where the add subscript denotes the delay, dynamicenergy, and leakage power of a single adder. 𝑎𝑑𝑑𝑒𝑛𝑒𝑟𝑔𝑦 refers to the total energy of the reductionadder.

3. 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑖𝑒𝑟𝑒𝑛𝑒𝑟𝑔𝑦 = 𝑛×(𝐸LMFNCD7 + 𝑑𝑒𝑙𝑎𝑦×𝑃LMFNFGHI)

4. 𝑎𝑟𝑒𝑎 = 𝑎𝑟𝑒𝑎LMFN×#LMFN + 𝑎𝑟𝑒𝑎HCC×#HCC,wheretheaddsubscriptdenotes theareaandnumberofasingleadder.

We performed detailed physical design studies using an industry 28nm process development kit (PDK) to

obtainpost-layoutvaluesfor𝑡,𝐸CD7,and𝑃FGHI.WeusedSynopsystoolstoperformsynthesis(DesignCompiler),place&route(ICCompiler),andpowersimulation(PrimeTime).Morespecifically,weperformedRCextractiononthe circuits post place& route prior to power simulation, such that the delay, energy, and power numbersweobtained account for effects present in realistic VLSI circuits (e.g., wire parasitics, etc.). The results aresummarized in Figures 4 and 5 below. The Pareto frontier is obtained by specifying timing constraints duringsynthesis,placeandroute,andpowersimulation,allowingDesignCompilertooptimizethecircuittopology(e.g.,ripple-carry vs. carry-look-ahead adders) to meet the specified timing constraint. We select the energy-delay-product(EDP)minimumpointontheParetofrontierastheoptimalcircuitimplementation,tradingoffenergyanddelaywithequalweight.

Figure4:TheParetofrontieroftheenergy-delaylandscapefora28nm8-bitadder,derivedfrompost-layout

powersimulationsafterRCextractionatthecircuitlevelwithanindustryPDK.

Figure5:TheParetofrontieroftheenergy-delaylandscapefora28nm8-bitmultiplier,derivedfrompost-layout

powersimulationsafterRCextractionatthecircuitlevelwithanindustryPDK.

Using the results from our circuit-level simulations, we can model the convolutional neural networkinference accelerator as a fully combinational circuit and estimate the energy and execution time of the

accelerator.Wenote that thisapproachdoesnot take routingoverheads intoaccount, though theseoverheadscanbeignoredfornow,aswearetryingtounderstandtheboundsoftheproblem.Thetotalsystem-levelenergyandexecutiontimecanbeestimatedbasedonthenumberofMACunitsperlayer,andthenumberofmultipliersperMAC.WeshowVGG-19[Simoyan14]asanexampleapplicationinTable1below.

Table1:EstimatedminimumenergyandexecutiontimeofVGG-19usingthemodeldiscussedabove.

ItisimportanttonotethattheselargeEDPbenefitscomeatalargeareacost.Weestimatetheareaofasinglemultiplierandaddertobeapproximately270μm2and50μm2,respectively,basedonpost-place-and-routeareaestimates.Toachievethelevelofparallelismdescribedhereusingthereduction-treebasedMACtopology,theareacostisontheorderof6.28meters2,orapproximately90industry-standard300mmsiliconwafers!Suchanareacost isastronomicalanduntenable in real-lifesystems. Thus,weexplore thesequentialMACtopology,whichismoreareaefficientcomparedtothereduction-treebasedMAC.SequentialMAC:

ThesequentialMACtopologyinstantiatesasingleadderandmultiplierperMAC,andtimemultiplexestheinputtomultiplyandaccumulatetheinputargumentovermultipleclockcycles. ThesequentialMACistypicallymoreareaefficientthanthereduction-treebasedMAC,aseachsequentialMACiscomprisedofasingleadderandmultiplier(vs.𝑛 − 1and𝑛inareduction-treebasedadder,respectively)andflip-flopsontheinputsandoutputs(Figure6).

Figure6:DataflowofasequentialMACunit.

Theenergyandexecution timeofa sequentialMACcanbemodeledwith the followingequations,where𝑡TUV ,𝐸TUVCD7 ,and𝑃TUVFGHI,refertothedelay,dynamicenergy,andleakagepowerofthesequentialMAC,respectively:

1. 𝑑𝑒𝑙𝑎𝑦 = 𝑛×𝑡TUV ,where𝑡TUV isdefinedasthetimeittakestopropagateasignalfromtheinputofthemultipliertotheoutputoftheoutputflip-flop(e.g.,oneclockcycle).

2. 𝑒𝑛𝑒𝑟𝑔𝑦 = 𝑛×(𝐸TUVCD7 + 𝑑𝑒𝑙𝑎𝑦×𝑃TUVFGHI)

WeperformeddetailedphysicaldesignstudiesofthesequentialMACunitusinganindustry28nmPDK,similar

tothatdescribedpreviously.Forbrevity,wereferthereadertothedescriptioninthereduction-treeMACsection.WeselecttheEDPminimumpointontheParetofrontierastheoptimalcircuitimplementation(Figure7).

Figure7:TheParetofrontieroftheenergy-delaylandscapefora28nm8-bitsequentialMACunit,derivedfrom

post-layoutpowersimulationsafterRCextractionatthecircuitlevelwithanindustryPDK.

We thenmodeled the energy and execution time of an accelerator based on the sequentialMAC unitsimilartotheanalysisintheprevioussection.Thetotalsystem-levelenergyandexecutiontimecanbeestimatedbasedonthenumberofMACunitsper layer,andthenumberofmultipliersperMAC. Weshowtheresults forVGG-19 [Simoyan14] as an example in Table 2 below. We estimate the area of the sequential MAC to beapproximately444μm2basedonpost-place-and-routeresults,whichwouldcorrespondtoasystem-levelareacostofapproximatelyof6,600mm2,orapproximately0.1industry-standard300mmsiliconwafers.

Table2:EstimatedminimumenergyandexecutiontimeofVGG-19usingthesequentialMAC.

TakeawaysandRevisitingAssumptions:

Ourdesignspaceexplorationhasledustothefollowingconclusions.First,thereisalotofparallelizationleft on the table. That is, neural network inference applications (e.g., VGG-19) have a significant amount ofinherent parallelism.Without area constraints, theminimum bound for energy and execution time is very lowwhen an accelerator is constructed with a fully-combinational network of reduction-tree based MAC units.However, it is important to note that area is an important constraint. A fully-parallelized network with thereduction-tree based MAC unit produces an equivalent area of ~90 industry-standard 300 mm silicon wafers,whichissimplyunobtainableusingtoday’stechnology.Asequential-MAC,whichtimemultiplexestheinputsandreusesthesamemultiplierandadderovermultipleclockcycles,achievesasignificantly lowerareacomparedtothe reduction-tree basedMAC (~890× lower). However, this area reduction comes at the cost of ~874× higherexecutiontime,and~675×higherenergy.Thehigherexecutiontimeoriginatesfromthetimemultiplexing,andtheincreasedenergyconsumption isa resultof the leakagepower integratedoveran~874× longerexecutiontime.Thus,we can see that it is important to understand the cost of scheduling neural network applications onto alimited number ofMAC units, as the naïve, fully-combinational implementation (which is easier to schedule) isuntenableintermsofareacost.

We revisit an earlier assumption where we assumed that the maxpool layers were an insignificant

contributor to the energy andexecution timeof a deep learning inference accelerator. WeperformaphysicaldesignstudyofthemaxpoolkernelfromVGG-19,wherethemaxpoolkernelhasa2x2windowsizewithstride2.Morespecifically,thisisa4-to-1maxpoolfunctionthatcomputestheargmaxover4arguments.Weperformthephysicaldesignstudyusingthesametoolflowasdescribedinprevioussections.TheresultsareshowninFigure8.

Figure8:TheParetofrontieroftheenergy-delaylandscapefora28nm8-bitmaxpoolunitwithawindowsizeof2x2,derivedfrompost-layoutpowersimulationsafterRCextractionatthecircuitlevelwithanindustryPDK.

Atthenetwork-level,wecanestimatetheeffectofthemaxpoollayerswiththefollowingequations:

1. 𝑑𝑒𝑙𝑎𝑦 = #LHWXYYFFHDGZ[×𝑡LHWXYYF,where#LHWXYYFFHDGZ[referstothenumberofmaxpoollayers,and𝑡LHWXYYFreferstotheEDPoptimaldelaythroughasinglemaxpoolkernel.

2. 𝑒𝑛𝑒𝑟𝑔𝑦 = #LHWXYYFYXGZHN\Y7[×(𝐸LHWXYYFCD7 + 𝑑𝑒𝑙𝑎𝑦×𝑃LHWXYYFFGHI )put

ForVGG-19,themaxpooldelay,energy,andareacorrespondto~5ns,~25µJ,and80mm2,respectively,forthesequentialMACtopology,whichamounttolessthan~0.1%ofthesequentialMACdelayandenergy,andlessthan~1%ofthesequentialMACarea.Forthisreason,ourinitialassumptionthatthemaxpoollayerswereinsignificantcontributors was correct. We also assumed that the ReLU activation function was insignificant. Earlier, weassertedthattheReLUactivationimmediatelyfollowsaMACoperationandcanbeconstructedinhardwarewithan8-bitbitwiseANDgateaninverter.Thistranslatesto9logicgates.Weinspectthepost-place-and-routenetlist

of the sequentialMACunit (whichhas less gates than the reduction-treebasedMAC), andnote that there are~400 gates. As such, the activation function accounts for less than 0.25% of the logic gates, and would notcontributesignificantlyenergy,asenergyroughlycorrelatestothenumberofgatesinacombinationalcircuit.ThedelaywouldnotincreasesignificantlyastheReLUactivationwouldadd~1gatedelaytothecriticalpath,whichisnotverysignificantconsideringtheserialnatureofthesequentialMACtopology.

PartII:Top-downDeepLearningsimulationInsearchofaDeepLearningAcceleratorsimulator Moderncomputerarchitectureisundeniablycomplex.Furthermore,thetremendoustimeandfinancialcostsassociatedwithtapingoutachipdictatethatmostdesignspaceexplorationmustbedonewithinsimulations.Withinthearchitectureresearchcommunity,itisacommonly-heldbeliefthat,duetotheoften-unpredictableinteractionsbetweensystemsubparts,analyticalmodelsbythemselvesareinsufficient[Miller10,Sanchez13].Whilesomeopinionsdiffer[Nowatzki14],whatisgenerallyrequiredaresimulatorsthatdirectlymodeltheunderlyingcomponentsofthesystem,andtheinteractionsthatcanoccurbetweenthem. Cycle-levelsimulatorssuchasGEM5[Binkert14]sufferfromtwofundamentalproblems.Thefirstisthattheyareprohibitivelyslow.Thesesimulatorsdohaveaplace–theycanmodelindividualfunctionalunits–butaresosluggishastobealmostuselessinmodelingchip-levelbehaviorofentireapplications(ratherthanjusttraces).ThesecondisthattheysupportonlyCPU-likearchitectures,andextendingthebasearchitecturesinvolveswritingandvalidatingsubstantialamountsofSystemCorC++.

Weneedasimulatorthat,likezsim,takesintwothings:1.)ahigh-levelacceleratorconfiguration,and2.)anoff-the-shelfprepackagedapplication.Forzsim,1.)takestheformofasimple.configfile,and2.)takestheformofastandardLinuxELFbinary.Forourdeeplearninginferencesimulator,1.)shouldalsobea.configfile,and2.)shouldbeahigh-levelinputrepresentationsuchasapre-trainedTensorFloworCAFFEmodel.

Inlightoftheserequirements,wesuccessfullybroughtupNvidia’snewNVDLAdeeplearninginference

acceleratorandbuiltalightweightsimulator,DLISim,topofit.Oursimulatorrunsentirelyinsoftware(thoughopportunitiesforhardwareaccelerationexist)andspitsoutricharchitecture-levelstatistics,suchas:“Howmuchdatawasusedacrossallconvolution(matmul)stages?”,“Howmanyoperationsofeachtype(matrixmultiply,activationfunction,pooling,etc.)occurred?”,and“Howwerelargeoperationssubdividedandscheduledonthehardware?”NVDLA+DLISimis,tothebestofourknowledge,thefirstsuchopen-sourcesimulatortoaccomplishthegoalsoutlinedinthepreviousparagraph.

NVDLA–whatitgivesus TheNVDLAproject[NVDLA]iscomprisedofthreedistinct,complementarycoderepositories.Thefirstoftheseishw,thehardwarerepository.ItcontainsVerilogRTLandaSystemCgoldenmodel,aswellastestbenches.(TheprovidedVerilogpasseditstestbencheswhenwerantheminVCS.)Thesecondissw.ItcontainsthetwodriversnecessaryforNVDLAtointeractwithitshostsystem:auser-modedriver,UMD,fortaskssuchas.jpegimagedecompression,andakernel-modedriver,KMD,responsibleforsettingcontrolregistersandtransferringdata.Inaddition,swcontainsthe(asofnow)black-boxbinarynvdla_compiler.Finally,vpprovidesanARMQEmuexecutionenvironmentfortheswdriverstorunin.Theoverallsystemlayoutisbelow.

Figure9:NVDLAplatformoverview

Thisgivesusquiteabittoworkwith.Itshouldbenoted,however,thattheentireplatformisverymuch

inanalpha-releasestage,andthatsourcecodeforsomecomponentsissimplynotavailableyet.Forinstance,Nvidia’sopen-sourceroadmapindicatesthatsourcecodeforthemodelcompiler(NVDC,i.e.,nvdla_compiler)willbeavailableeventually,thoughnoexactdateisgiven–“Thecompilerwillinitiallybeonlyreleasedinbinaryform,withasourcereleasecominglater”[Roadmap].ThispresentedaproblemwhenseveralpopularDLmodelsfailedtocompile–wehadnomeansoffixingthecompilerourselves.Nonetheless,Nvidiahas,asofnow(mid-March2018),stucktothereleasescheduleathttp://nvdla.org/roadmap.html,andhascompleted2ofthe3proposedstages(2017-Q3and2017-Q4,with2018-H1forthcoming).ThehwandswcomponentswerefirstmadeavailabletothepublicinOctober2017,andthevpcomponentinDecember2017.

Likeanyprocessor,NVDLAcontainsdifferenttypesoffunctionalunits.ThefollowingtabledetailsthesixtypesoffunctionalunitspresentinNVDLA,theiracronyms/abbreviations,andthespecificoperationwithinaneuralnetthateachsupports.

Table3:NVDLAfunctionalunitsAcronym/Abbr. FunctionalUnit DeeplearningfunctionalityCONV ConvolutionalMAC Matrix-matrixmultiplicationSDP ScalarDataProcessor Appliesactivationfunctions(sigmoid,tanh)PDP PlanarDataProcessor Appliespoolingfunctions(max,min,average)CDP ChannelDataProcessor Appliesnormalization(centersmean,variance)Rubik Tensorreshape Splits/slices/mergesdatatofitintocomputeunitsBDMA BridgeDirectMemoryAccess TransfersdatafromDRAMtoSRAMcache(nottypicallyused)

Thediagrambelowshowsthelayoutoftheindividualfunctionalunits(lefthandside),aswellashowtheNVDLAcoreconnectstotheSRAMbuffer(“SecondDBBinterface,”)DRAM(“DBBIF,”)andCPU(“IRQ”fora1-bitsynchronousinterruptsignal,andCSBfor32-bitsynchronouscontrolsignaling).ImageandweightdataisDMA’edovertheDBBIFinterface.[Primer]

Figure10:NVDLAcoreoverview

TheConvolutionBuffercanbethoughtofasakintoanL1(-D)cacheinatraditionalCPU,andtheSRAM

buffer(“SecondDBBinterface”)asanL2cache.Finally,theDBBinterfaceprovidesDMAaccesstoDRAM. Note that the first four functional units are arranged so that, dataflow schedulingpermitting, data canmovedirectly fromone stage to another, and not have tomake a round-trip to SRAM/DRAMandback. In theNVDLAliterature,thisisreferredtoas“fused”mode,incontrastto“independent”mode.Physically,fused-modepipeliningismadepossiblewithasetofsmallFIFOsbetweeneachofthefunctionalunits[Primer].However,inourexperiments, the NVDLA KMD scheduler never pipelined any operations, even when it could have (a“fused_parent = [ dla_consumer =>”annotationshouldhaveappearedinthelogs,butneverdid).Intheneuralnetworkswetested,activationfunctionsnearlyalwaysfollowedconvolutions,etc.,soitseemsthatthisearlyimplementationoftheKMDschedulerleavesalotofperformanceonthetable.

Regardingthefunctionalunits,CONVisreallyageneral-purposematrixmultiplyunit.Inpractice,it’softenused to perform convolutions, but also supports more general matrix-matrix multiplication for fully-connectedlayers.SDP currently supports sigmoidand tanhactivations,aswell asP/ReLU,but, since it’s implementedasasimple LUT, could bemodified to handle other types of activations by simply changing the LUTROM.BDMA iscapableoftransferringmemorydirectlyfromthe“Primary”DRAMinterfacetothe“secondary”SRAMcache.Noneof the fournetworkswe lookedatmadeuseof theBDMA FU,and it’spossible thatnvdla_compilerdoesn’tsupportityetanyway.Likewise,nonetworkwetestedmadeuseoftheRubikreshapeengine.

DLISim–whatNVDLAlacks Despitetheamountofcodeavailableinthethreerepos,NVDLA’spre-compileddriversgrantverylittlevisibilityintotheruntimeworkingsofthesystem.AfterfeedinginanNVDLAloadableandinput.jpegimage,theprovidednvdla_runtimebinarysimplyindicateswhetherthenetworkrantocompletionorencounteredanerror.However,sincethedriverisopen-source,wewereabletoinstrumentittoprovidealotofadditionalinformation.

Togetusefulprofilingdata,wemodifiedtheKMDdrivertoenableaseriesofop-triggeredprintouts.EachtimeoneofthesixoperationsrunsontheSystemCmodel,amessageispassedtothekernelviadmesg.Eachofthesemessagescontainsinformationsuchasthenumberofinputbytesprocessed,dataprecision,andstridelengththroughtheinputarray(s).Overthecourseofexecution,thesemessagescollectinthekernel’sringbuffer,andcanbedumpedasalogfilewithtimestampsuponcompletion.(Foradriver,dmesgloggingisconsideredcleanerthanprintingtostdout/stderr.)

Aftermodifyingthesourcecodetoenabletheprintouts,weneededtorecompilethedriver.TheKMD

codecompilesintoakernelmodule,whichmustbeloadedintheQEmuemulator’sBusyBoxLinuxenvironmentviainsmodbeforesimulationcanbegin.Unfortunately,sincetheQEmusimulatorrunsAArch64,themodifiedKMDmodulemustbecross-compiled.Furthermore,tocross-compileakernelmodule,thekernelitselfmustbecross-compiledfirst.Fortunately,theLinarotoolchain[Linaro]wasabighelpingettingboththekernelcompiledandmodifiedKMDrunningonAArch64. InadditiontotheKMDsourcecodemodifications,wealsocreatedsomePythonscriptstoparsethelogsoutputbyKMD,andboilthestatisticsdownintoamoredigestibleformat.Thescriptscanbefoundathttps://github.com/andrewbartolo/dlisim. SeebelowforaschematicofourmodifiedNVDLAsetup,withDLISIMattached.

Figure11:NVLDAoverviewwithDLISimattached

Togetaclearsenseofthesimulatorandarchitecture’scapabilities,wetested13networksfromCAFFE’s

ModelZoo[ModelZoo],whichareshowninthetablebelow.Unfortunately,manyofthemdidn’tworkwiththegivennvdla_compiler.SinceNvidiahasn’topen-sourcedthecompileryet,wewereunabletodomuchtoremedythisissue.Whatlittleinsightwehaveintonvdla_compilerisgrantedbythefactthattheprovidedbinaryisnotstripped.AfterrunningGDB,weatleasthaveastacktraceofwhatcausedtheerroruponthesegmentationfault.Ongoingdiscussiononthenvdla/swGitHubissuestracker(https://github.com/nvdla/sw/issues)indicatesthattheNVDLAdevelopmentteamisawareoftheissues,andisworkingtofixthem.

Table4:NeuralnetsandNVDLAcompilerstatusNetwork InputDataset Status NotesAlexNet ILSVRC2012 WORKS CONV+SDP n/a WORKS R-CNN ILSVRC2012 WORKS GoogLeNet ILSVRC2014 FAIL Compilercouldnotresolvedependencyindataflow

graph(CONVlayer)HybridCNN ILSVRC2012 FAIL Compilersegmentationfault(inparseCaffeNetwork())LeNet MNIST WORKS MobileNetv1 ILSVRC2012 FAIL Compiler.prototxtparsererror(Poolinglayer)MobileNetv2 ILSVRC2012 FAIL Compiler.prototxtparsererror(Poolinglayer)Network-in-Network ILSVRC2012 FAIL Compilersegmentationfault(inparseCaffeNetwork())SqueezeNetv1.0 ILSVRC2012 FAIL Compiler.prototxtparsererror(Poolinglayer)SqueezeNetv1.1 ILSVRC2012 FAIL Compiler.prototxtparsererror(Poolinglayer)VGG-16 ILSVRC2014 FAIL Compilersegmentationfault(inparseCaffeNetwork())VGG-19 ILSVRC2014 FAIL Compilersegmentationfault(inparseCaffeNetwork())

Notethatalloftheseareconvolutionalneuralnetworks(CNNs).Unfortunately,therearen’tasmanyCAFFEmodelsavailableforRNNs,suchasLSTMsandGRUs.Inthefuture,aTensorFlowfrontendmayfacilitaterunningthesenetworksonNVDLA.Google’sTPUsupportsrecurrentneuralnetworks(infact,Googleclaimsthat90%ofitsdeployedTPUsarerunningMLPsorLSTMRNNs,ratherthanCNNs[Jouppi17]).SinceNVDLAisverysimilartotheTPU,itshouldbepossibletoaddsupportforthesenetworks.Alsonotethatallofthesearepre-trainednetworks,withtrainedweightsprovidedin.caffemodelformat.(Inthisproject,wedidn’tconsidertheproblemofNNtraining,thoughHWsupportfortrainingremainsanareaofactiveresearch.)

OnefinalNVDLAlimitationisthatitcurrentlyonlysupportsoneacceleratorconfiguration,“nv_full”–see

https://github.com/nvdla/hw/issues/94.Perhw/spec/defs/nv_full.spec,thisconfigurationfeatures2048multiply-accumulatorsintheCONVFU,andoperatesin8-bitfixed-point(integer)modeforallfunctionalunits.Theprimaryandsecondarymemorybusesareeach512bitswide.DiscussiononthehwGitHubissuetrackerindicatesthatdifferentconfigurations,suchasnv_small,willbesupportedinfutureNVDLAreleases.

Results

Forallevaluations,LeNetwasrunona28x28MNIST.pgminputimage.AlexNetandR-CNNwererunona227x227.jpeginputimage,whichwasdecompressedbythehostbeforebeingsenttotheaccelerator.TheCONV+SDPloadablewasrunwithitsintegratedinputdatamatrix.Allinferenceoperationswereperformedon8-bitfixed-point(integer)representations.Opcounts

Togetahigh-levelsenseofthenetworks’compositions,wefirstconsiderthenumberofNVDLAoperationsnecessarytoruneachofthemtocompletion.Thetableandchartbelowshowtherelativemakeupofeachnetwork,withthetotalnumberofoperationslistedatopeachbar.RecallthatnoneofthenetworkswetestedusetheBDMAorRubikfunctionalunits.

Table5:NeuralnetopcountsNetwork CONV SDP PDP CDP TotalAlexNet 15 22 3 2 42R-CNN 15 22 3 2 42LeNet 4 5 2 0 11CONV+SDP 1 1 0 0 2

Figure12:Neuralnetopcounts

Atleastintermsofnumberofoperations,CONVandSDPopsdominateallotherkinds.Thismakessense,perthenetworks’namesake(CNN),andthefactthatactivationsalmostalwaysfollowconvolutions.RecallthattheCONVunitisalsousedforfully-connectedlayers,butthatSDPactivationsusuallyfollowthese,too.WhatismoreinterestingisthatthenumberofSDPopsiscommonlygreaterthanthenumberofCONVops.Uponinvestigatingthelogs,wefoundthat,forallthreeofthe“real”networks,theschedulerwassubdividingsome–butnotall–SDPstagesintotwoseparateops,oneimmediatelyfollowingtheother.Thisbehavioroccurredmostoftenintheearlystagesofthenetwork,wheretheamountofdataoutputbytheprecedingCONVlayerwasgreatest.

ThisindicatesthatawiderSDPunitcouldpotentiallyprocessallCONVoutputsinoneoperation.Physically,thiswouldbemanifestedinNVDLAasaddingportstotheSDPelement-wiseLUT.TheSDP_EW_THROUGHPUTparameterinthehw/spec/defs/nv_full.specfileindicatesthatSDPwidthwillindeedeventuallybeadjustable.Fornow,however,itsvalueissetat4ports,duetothefixednv_fullconfig.So,eithertheSDPunititselfistoonarrow,orsomeotherbottleneckexistsinthepipelinebetweenCONVandSDP.Wesurmisethat,ifsuchanexternalbottleneckexists,itislikelyrelatedtothepipelinecomponentsoperatinginindependent,ratherthanfused,mode,withdataneedingtogobackandforthontheDRAMinterfaceinsteadofthroughthehigher-throughputinter-stageFIFOs. WesawthattheschedulersometimessubdividesopstomakethemfitintheFUs.However,thismeansthatsomeoperations,evensubdividedopsofthesame“macro-op,”mayhavedifferentinputdatasizes.Toresolvethispotentialimbalance,wenowfocusonthetotalamountofdataprocessedbyeachFUthroughoutthecourseofexecutionoftheentirenetwork.FUdataflowanalysis

Thenumberofdatabytesconsumedbyeachfunctionalunit,acrosstheexecutionofanentirenetwork,isprovidedintabularandgraphicalformatsbelow.Becausethenetworkshavevastly-varyingamountsofdata(R-

CNNandAlexNetmuchmorethanLeNetandCONV+SDN),tomaintainreasonabley-axisranges,theyareplottedseparately.

Table6:BytesprocessedbyeachNVDLAFU

Network CONVinput CONVweight SDP PDP CDP TotalAlexNet 5,478,592 122,188,416 2,635,104 1,040,576 954,048 990,752R-CNN 5,478,592 115,634,816 2,635,104 1,040,576 226,496 132,296,736LeNet 37,376 861,184 47,136 45,056 0 125,015,584CONV+SDP 16,384 73,728 0 0 0 90,112

Figure13:BytesprocessedbyeachNVDLAFU

Oneinterestingthingtonoteaboutthesegraphsisthat,comparedtotheopcountschart,CONV(duetoitsweights)dominatesanevenlargerportionoftheoverallpercentages.Thisaddscredencetothehypothesisthatnv_full’sSDPunitisundersizedcomparedtotheCONVunitthatprecedesit.Specifically,wenoticethatforCONV,thereisalotmoreweightdatathanthereisinput(i.e.,activation)data.Thismakessense,asinputdimensionsforR-CNNandAlexNetare227x227,whileeachoftheir.caffemodel(weights)fileswerearound200megabytesin

size.Notealsothatthefinallayersofallthree“real”networksarefully-connected.Incontrasttoatrueconvolution,thismatrix-matrixmultiplicationintheFClayerscanblowuptheoutputmatrixdimensionsbymultiplyingasmallerinputmatrixbyamuchlargerweightsmatrix,furtherincreasingtheamountofweightsdataconsumed.

Finally,weobservethatinCONV+SDP,theSDPoperationisreallyano-op.ThismightberelatedtoaschedulerissuethatrequiresanSDPtofollowaCONV,eveniftheSDPdoesn’tdoanything(identityfunction).Here,however,itdoesn’tevenactasanidentityfunction–itsimplytakesinnodataatall.CONV+SDPisapre-packagedloadableintendedtotestonlytheCONVunit,sothisisn’tanissue.Networksimulationruntime

Unfortunately,theNVDLASystemCvirtualplatformdoesn’tcomewithanytoolsforestimatingcyclesconsumed.IntheNVDLAliterature,theauthorssuggestthatanFPGAmaybeusedinsteadfor“limitedcycle-countingperfomanceevaluation”[Primer].However,FPGAsupportfornv_fullisstillforthcoming(seehttps://github.com/nvdla/hw/issues/90).ItmayalsobepossibletoinstrumenttheSystemCinthevirtualplatformtocount“virtual”cycles;wedidnotexplorethisoption.

However,inordertogetsometime-baseddata,wesimplylookedatthesimulationtimesrequiredforthe

fournetworks.Obviously,simulationtimecorrespondsverycrudelytorealexecutiontime,butdoesgivesomesenseoftheamountofcomputationnecessarytorunallthestagestocompletion.

Figure14:Networksimulationtimes

CONV+SDPandLeNetaresignificantlyshallowernetworksthanAlexNetandR-CNN,andexhibit

correspondinglylowerruntimes.AlsorecallthatAlexNetandR-CNNoperateon227x227-pixelimages,whileLeNet’sinputis28x28,andCONV+SDPusesanintegrated8x8inputimage.AsAlexNetandR-CNNbothrequire42ops,inapproximatelythesameorder,itisreasonablethattheirruntimesaresosimilar.

Themostinterestingtakeawayfromthedataisthatsimulationtimeseemstoscalefairlywellwith

networksize.I.e.,therewasn’tagreatdealofunavoidableoverheadpresentforthesmallernetworks.WeexpectthatmovingthesimulationtoanFPGAcandrasticallyreducesimulationtimes,andthusgrantustheopportunitytosweepanevenbroaderrangeofdesignparametersoncethenv_fullrestrictionislifted.

Futuredirections

OurprimarylimitationwiththeNVDLAplatformisthat,forthetimebeing,itdoesn’tgiveuscycleortiminginformation.Unfortunately,thispreventedusfromunifyingthecircuit-levelworkwiththesimulatorworktothedegreethatwewouldhaveliked.Nonetheless,wewereabletoobtainvaluableinsightsfromeachphaseoftheproject,suchasMACparallelismvs.energy/area,andhowneuralnetexecutiononarealhardwareacceleratormightbescheduled.OncetheNVDCcompilerisopen-sourcedandmademorerobust,wecanlookatawidervarietyofnetworks;specifically,wecouldsimulateVGG-19onNVDLAtocomparetooursequentialMACanalysis.Finally,weeventuallyhopetosynthesizeandplace-and-routetheNVDLAdesign;thiswouldgiveaccurateclockcycletimeestimates,aswellaspermittingustoexperimentallychangeoutthedefault28nmSiCMOSlogic+DRAMforCNFET+RRAM,whichmayprovidesignificantEDPbenefits.[Aly15].

Conclusion

The riseof abundantdata computing,where amassive amountof structuredandunstructureddata isanalyzed,hasplacedextremedemandsontheenergyefficiencyoftoday’scomputingsystems.Withstate-of-the-artCMOScircuitspushing the limitsofDennardScaling [Dennard74], improving theenergyefficiencyofgeneralpurpose processor systems has become increasingly challenging. Several alternate hardwaremodels have beenexplored as potential successors to general purpose processor cores, including field-programmable gate arrayswith bit level configurability [Putnam14], domain-specific accelerators with limited programmability [Volta], orfixed-functionaccelerators[Jouppi17],eachwithitsowntradeoffs.Forexample,field-programmablegatearraysarehighlyprogrammable,atthecostofhigherenergy/executiontimeperoperationduetoroutingoverheadsandmappinglogicprimitives(e.g.,ANDs,ORs,etc.)tomulti-purposelookuptables.Ontheotherhand,fixed-functionacceleratorsprovidelowerenergy/executiontimeperoperationduetooptimizedcircuitimplementations,atthecost of limited programmability. For this project, we focused on fixed-function, deep learning inferenceacceleratorsforconvolutionalneuralnetworks,astheyrepresentaboundontheminimumenergy/executiontimethatcanbeachievedinhardwareforapopularclass(e.g.,deeplearninginference)ofabundant-dataapplications.

Ourcircuit-leveldesignspacegaveusadeepunderstandingofthecruxofthedesignproblem.Wefound

that deep learning inference applications such as convolutional neural networks can be parallelized to a highdegree. In particular, the area of the resultant accelerator becomes a difficult problem to manage as theapplicationisaggressivelyparallelized.WefoundthatasequentialMAC-basedtopologyproducesadesignwithamorereasonablearea,atthecostofincreasedexecutiontimeandleakageenergy.Weshowthatnaïveschedulingviaaggressiveparallelization(e.g.,designingalargecombinationalcircuitforthewholeneuralnetworkinferenceapplication) is impractical,andscheduling isan importantandnecessaryoverheadthatmustbecharacterizedinorderdevelopacceleratorsthatrunapplicationsattheoptimalpointontherooflinecurveshowninFigure1.Wealsoshowthatpoolingoperations(e.g.,maxpool)andactivationfunction(e.g.,ReLU)donotcompriseasignificantfractionoftheenergyandexecutiontimeofaneuralnetworkaccelerator.GivenbetterNVDLAcompilersupport,wewouldhavebeenabletoanalyzeawiderrangeofneuralnetworkinferenceapplications,andcharacterizetherooflinemodel for the accelerator provided by theNVDLA framework. Using this information,wewould havebeneabletocharacterizetheschedulingoverheadoftheNVDLAdataflow.Intheend,wewereabletogetsomeusefuldataflowinformationoutofNVDLA,eventhoughwewerenotabletomeasureaccuratecountcycleswithit. In the future,we hope to add hooks intoNVDLA to improve the simulation accuracy, validate the simulatoragainstphysicaldesignlayoutsofaccelerator,andunifythebottom-upandtop-downapproachesinPartsIandII,respectively.

Attributions

Williamperformedin-depthcircuit-levelanalysisandobtainedadder,multiplier,andMACenergy-delayresultsinPartI.AndybroughtupNVDLAandimplementedDLSimontopofitinPartII.Bothteammembers

contributedequallytotheIntroductionandConclusion,andtotheprojectingeneral.Allworkdoneanddatacollectedfortheprojectwasperformedoverthe10-weekWinter2017-2018academicquarter. DLISimsourcecodeisavailableathttps://github.com/andrewbartolo/dlisim.Inthisrepository,you’llfindbothacollectionoflogsfromthefournetworkswewereabletorun,andPythonscriptstoprocessandannotatethelogs.

Citations[Abadi16]“TensorFlow:Asystemforlarge-scalemachinelearning.”M.Abadiet.al.12thUSENIXSymposiumonOperatingSystemsDesignandImplementation(OSDI).2016.[Aly15]“Energy-efficientAbundant-dataComputing:TheN3XT1,000x.”M.M.S.Alyetal.ComputerMagazine.2015.[Bell78]“TheCRAY-1ComputerSystem.”G.Bellet.al.CommunicationsoftheACM.1978.[Binkert14]“Thegem5simulator.”N.Binkertet.al.ACMSIGARCHComputerArchitectureNews.[Chen16]“Eyeriss:AspatialArchitectureforEnergy-EfficientDataflowforConvolutionalNeuralNetworks.”Y.-H.Chenet.al.IEEEInternationalSolid-StateCircuitsConference.2016.[Dennard74]“Designofion-implantedMOSFETswithverysmallphysicaldimensions.”R.Dennardet.al.IEEEJournalofSolid-StateCircuits.1974.[Gelsinger00]‘TheInternationalTechnologyRoadmapforSemiconductors(ITRS):“Past,Present,andFuture.”’http://ieeexplore.ieee.org/document/906261/.RetrievedMarch2018.[Gonzalez96]“EnergyDissipationinGeneralPurposeMicroprocessors.”R.Gonzalez,M.Horowitz.IEEEJournalofSolid-StateCircuits.1996.[Han16]“EIE:EfficientInferenceEngineonCompressedDeepNeuralNetwork”.S.Hanet.al.InternationalSymposiumonComputerArchitecture.2016.[HMC]“HybridMemoryCube(HMC).”J.T.Pawlowski.HotChips23.2011.[Jouppi17]“In-DatacenterPerformanceAnalysisofaTensorProcessingUnit.”N.Jouppiet.al.InternationalSymposiumonComputerArchitecture.2017.[Kocher18]“SpectreAttacks:ExploitingSpeculativeExecution.”P.Kocheret.al.https://spectreattack.com/spectre.pdf.2018.[Krizhevsky12]“ImageNetClassificationwithDeepConvolutionalNeuralNetworks.”A.Krizhevsky,I.Sutskever,andG.Hinton.NeuralInformationProcessingSystems.2012.[Kung82]“WhySystolicArchitectures?”H.T.Kung.IEEEComputer.1982.[Linaro]“Linaro–LeadingsoftwarecollaborationintheARMecosystem.”https://linaro.org/downloads.RetrievedMarch2018.[Mangard18]“MeltdownandSpectre.”S.Mangardet.al.https://meltdownattack.com/meltdown.pdf.2018.[MCM]“Multi-ChipModule”.Techopedia.https://www.techopedia.com/definition/11836/multi-chip-module-mcm.RetrievedMar.2018.[Miller10]“Graphite:ADistributedParallelSimulatorforMulticores.”J.Milleret.al.InternationalSymposiumonHigh-PerformanceComputerArchitecture.2010.[ModelZoo]“BVLCCAFFEModelZoo.”https://github.com/BVLC/caffe/wiki/Model-Zoo.RetrievedMarch2018.[NECAurora]“AdeepdiveintoNEC’sAuroravectorengine.”T.Morgan.TheNextPlatform.https://www.nextplatform.com/2017/11/22/deep-dive-necs-aurora-vector-engine/[Nowatzki14]‘gem5,GPGPUSim,McPAT,GPUWattch,“Yourfavoritesimulatorhere”ConsideredHarmful.’T.Nowatzkiet.al.2014.[NVDLA]“TheNVIDIADeepLearningAccelerator.”http://nvdla.org.RetrievedMarch2018.[Primer]“NVDLAPrimer.”http://nvdla.org/primer.html.RetrievedMarch2018.[Putnam14]“AReconfigurableFabricforAcceleratingLarge-ScaleDatacenterServices(Catapult).”InternationalSymposiumonComputerArchitecture.[Roadmap]“NVDLAOpenSourceRoadmap.”http://nvdla.org/roadmap.html.RetrievedMarch2018.[Sanchez13]“ZSim:FastandAccurateMicroarchitecturalSimulationofThousand-CoreSystems.”D.SanchezandC.Kozyrakis.InternationalSymposiumonComputerArchitecture.2013.

[Simoyan14]“VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition.”K.SimoyanandA.Zisserman.InternationalConferenceonLearningRepresentations.”2014.[Volta]“NVIDIATeslaV100GPUArchitecture.”http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.RetrievedMarch2018.

domain-specific accelerator design & profiling for deep ...bartolo/assets/dsa-design-dl.pdf ·...

Documents