general-purpose code acceleration with limited-precision...

Appears in the Proceedings of the 41st International Symposium on Computer Architecture, 2014

General-Purpose Code Acceleration with Limited-Precision Analog ComputationRenée St. Amant? Amir Yazdanbakhsh§ Jongse Park§ Bradley Thwaites§

Hadi Esmaeilzadeh§ Arjang Hassibi? Luis Ceze† Doug Burger‡?University of Texas at Austin §Georgia Institute of Technology

†University of Washington ‡Microsoft [email protected] [email protected] [email protected] [email protected]

[email protected] [email protected] [email protected] [email protected]

AbstractAs improvements in per-transistor speed and energy effi-

ciency diminish, radical departures from conventional ap-proaches are becoming critical to improving the performanceand energy efficiency of general-purpose processors. Wepropose a solution—from circuit to compiler—that enablesgeneral-purpose use of limited-precision, analog hardwareto accelerate “approximable” code—code that can tolerateimprecise execution. We utilize an algorithmic transformationthat automatically converts approximable regions of code froma von Neumann model to an “analog” neural model. We out-line the challenges of taking an analog approach, includingrestricted-range value encoding, limited precision in computa-tion, circuit inaccuracies, noise, and constraints on supportedtopologies. We address these limitations with a combinationof circuit techniques, a hardware/software interface, neural-network training techniques, and compiler support. Analogneural acceleration provides whole application speedup of3.7× and energy savings of 6.3× with quality loss less than10% for all except one benchmark. These results show thatusing limited-precision analog circuits for code acceleration,through a neural approach, is both feasible and beneficialover a range of approximation-tolerant, emerging applica-tions including financial analysis, signal processing, robotics,3D gaming, compression, and image processing.

1. IntroductionEnergy efficiency now fundamentally limits microproces-sor performance gains. CMOS scaling no longer providesgains in efficiency commensurate with transistor density in-creases [16, 25]. As a result, both the semiconductor industryand the research community are increasingly focused on spe-cialized accelerators, which provide large gains in efficiencyand performance by restricting the workloads that benefit. Thecommunity is facing an “iron triangle”; we can choose anytwo of performance, efficiency, and generality at the expenseof the third. Before the effective end of Dennard scaling,we improved all three consistently for decades. Solutionsthat improve performance and efficiency, while retaining asmuch generality as possible, are highly desirable, hence thegrowing interest in GPGPUs and FPGAs. A growing bodyof recent work [13, 17, 44, 2, 35, 8, 28, 38, 18, 51, 46] hasfocused on approximation as a strategy for the iron triangle.Many classes of applications can tolerate small errors in theiroutputs with no discernible loss in QoR (Quality of Result).

Many conventional techniques in energy-efficient computingnavigate a design space defined by the two dimensions ofperformance and energy, and traditionally trade one for theother. General-purpose approximate computing explores athird dimension—that of error.

Many design alternatives become possible once precisionis relaxed. An obvious candidate is the use of analog circuitsfor computation. However, computation in the analog domainhas several major challenges, even when small errors are per-missible. First, analog circuits tend to be special purpose,good for only specific operations. Second, the bit widths theycan accommodate are smaller than current floating-point stan-dards (i.e. 32/64 bits), since the ranges must be representedby physical voltage or current levels. Another consideration isdetermining where the boundaries between digital and analogcomputation lie. Using individual analog operations will notbe effective due to the overhead of A/D and D/A conversions.Finally, effective storage of temporary analog results is chal-lenging in current CMOS technologies. These limitations hasmade it ineffective to design analog von Neumann processorsthat can be programmed with conventional languages.

Despite these challenges, the potential performance andenergy gains from analog execution are highly attractive. Animportant challenge is thus to architect designs where a signif-icant portion of the computation can be run in the analog do-main, while also addressing the issues of value range, domainconversions, and relative error. Recent work on Neural Pro-cessing Units (NPUs) may provide a possible approach [18].NPU-enabled systems rely on an algorithmic transformationthat converts regions of approximable general-purpose codeinto a neural representation (specifically, multi-layer percep-trons) at compile time. At run-time, the processor invokes theNPU instead of running the original code. NPUs have shownlarge performance and efficiency gains, since they subsumean entire code region (including all of the instruction fetch,decode, etc., overheads). They have an added advantage inthat they convert many distinct code patterns into a commonrepresentation that can be run on a single physical accelerator,improving generality.

NPUs may be a good match for mixed-signal implementa-tions for a number of reasons. First, prior research has shownthat neural networks can be implemented in analog domainto solve classes of domain-specific problems, such as patternrecognition [5, 47, 49, 32]. Second, the process of invok-ing a neural network and returning a result defines a clean,

coarse-grained interface for D/A and A/D conversion. Third,the compile-time training of the network permits any analog-specific restrictions to be hidden from the programmer. Theprogrammer simply specifies which region of the code canbe approximated, without adding any neural-network-specificinformation. Thus, no additional changes to the programmingmodel are necessary.

In this paper we evaluate an NPU design with mixed-signalcomponents and develop a compilation workflow for utilizingthe mixed-signal NPU for code acceleration. The goal of thisstudy is to investigate challenges and define potential solutionsto enable effective mixed-signal NPU execution. The objectiveis to both bound application error to sufficiently low levelsand achieve worthwhile performance or efficiency gains forgeneral-purpose approximable code. This study makes thefollowing four findings:1. Due to range limitations, it is necessary to limit the scope

of the analog execution to a single neuron; inter-neuroncommunication should be in the digital domain.

2. Again due to range issues, there is an interplay betweenthe bit widths (inputs and weights) that neurons can useand the number of inputs that they can process. We foundthat the best design limited weights and inputs to eight bits,while also restricting the number of inputs to each neuronto eight. The input count limitation restricts the topologicalspace of feasible neural networks.

3. We found that using a customized continuous-discretelearning method (CDLM) [10], which accounts for limited-precision computation at training time, is necessary to re-duce error due to analog range limitations.

4. Given the analog-imposed topology restrictions, we foundthat using a Resilient Back Propagation (RPROP) [30] train-ing algorithm can further reduce error over a conventionalbackpropagation algorithm.We found that exposing the analog limitations to the com-

piler allowed for the compensation of these shortcomings andproduced sufficiently accurate results. The latter three find-ings were all used at training time; we trained networks atcompile time using 8-bit values, topologies restricted to eightinputs per neuron, plus RPROP and CDLM for training. Usingthese techniques together, we were able to bound error on allapplications but one to a 10% limit, which is commensuratewith entirely digital approximation techniques. The averagetime required to compute a neural result was 3.3× better thana previous digital implementation with an additional energysavings of 12.1×. The performance gains result in an averagefull-application-level improvement of 3.7× and 23.3× in per-formance and energy-delay product, respectively. This studyshows that using limited-precision analog circuits for code ac-celeration, by converting regions of imperative code to neuralnetworks and exposing the circuit limitations to the compiler,is both feasible and advantageous. While it may be possibleto move more of the accelerator architecture design into theanalog domain, the current mixed-signal design performs well

enough that only 3% and 46% additional improvements inapplication-level energy consumption and performance arepossible with improved accelerator designs. However, improv-ing the performance of the analog NPU may lead to higheroverall performance gains.

2. Overview and BackgroundProgramming. We use a similar programming model as de-scribed in [18] to enable programmers to mark error-tolerantregions of code as candidates for transformation using a sim-ple keyword, approximable. Explicit annotation of code forapproximation is a common practice in approximate program-ming languages [45, 7]. A candidate region is an error-tolerantfunction of any size, containing function calls, loops, and com-plex control flow. Frequently executed functions provide agreater opportunity for gains. In addition to error tolerance,the candidate function must have well-defined inputs and out-puts. That is, the number of inputs and outputs must be knownat compile time. Additionally, the code region must not readany data other than its inputs, nor affect any data other than itsoutputs. No major changes are necessary to the programminglanguage beyond adding the approximable keyword.Exposing analog circuits to the compiler. Although ananalog accelerator presents the opportunity for gains in ef-ficiency over a digital NPU, it suffers from reduced accuracyand flexibility, which results in limitations on possible networktopologies and limited-precision computation, potentially re-sulting in a decreased range of applications that can utilizethe acceleration. These shortcomings at the hardware level,however, can be exposed as a high-level model and consideredin the training phase.

Four characteristics need to be exposed: (1) limited preci-sion for input and output encoding, (2) limited precision forencoding weights, (3) the behavior of the activation function(sigmoid), (4) limited feasible neural topologies. Other low-level circuit behavior such as response to noise can also beexposed to the compiler. Section 5 describes this necessaryhardware/software interface in more detail.Analog neural accelerator circuit design. To extract thehigh-level model for the compiler and to be able to acceler-ate execution, we design a mixed-signal neural hardware formultilayer perceptrons. The accelerator must support a largeenough variety of neural network topologies to be useful over awide range of applications. As we will show, each applicationsrequires a different topology for the neural network that is re-placing its approximable regions of code. Section 4 describesa candidate A-NPU circuit design, and outlines the challengesand tradeoffs present with an analog implementation.Compiling for analog neural hardware. The compileraims to mimic approximable regions of code with neural net-works that can be executable on the mixed-signal accelerator.While considering the limitation of the analog hardware, thecompiler searches the topology space of the neural networksand selects and trains a neural network to produce outputs

A"NPUCircuit,Design

A-NPUHigh-Level Model

Annotated,Source,Code

Programmer

Programming Design

Profiling,Path,for,Training,Data,Collec

and not the individual neurons. The overhead of conversionsin the ANUs significantly limits the potential efficiency gainsof an analog approach toward neural computation. However,there is a tradeoff between efficiency, reconfigurability (gener-ality), and accuracy in analog neural hardware design. Pushingmore of the implementation into the analog domain gains ef-ficiency at the expense of flexibility, limiting the scope ofsupported network topologies and, consequently, limiting po-tential network accuracy. The NPU approach targets codeapproximation, rather than typical, simpler neural tasks, suchas recognition and prediction, and imposes higher accuracyrequirements. The main challenge is to manage this tradeoff toachieve acceptable accuracy for code acceleration, while deliv-ering higher performance and efficiency when analog neuralcircuits are used for general-purpose code acceleration.

As prior work [18] has shown and we corroborate, regionsof code from different applications require different topolo-gies of neural networks. While a holistically analog neuralhardware design with fixed-wire connections between neu-rons may be efficient, it effectively provides a fixed topologynetwork, limiting the scope of applications that can benefitfrom the neural accelerator, as the optimal network topologyvaries with application. Additionally, routing analog signalsamong neurons and the limited capability of analog circuitsfor buffering signals negatively impacts accuracy and makesthe circuit susceptible to noise. In order to provide additionalflexibility, we set the digital-analog boundary in conjunctionwith an algorithmic, sigmoid-activated neuron. where a setof digital inputs and weights are converted to the analog do-main for efficient computation, producing a digital output thatcan be accurately routed to multiple consumers. We refer tothis basic computation unit as an analog neural unit, or ANU.ANUs can be composed, in various physical configurations,along with digital control and storage, to form a reconfigurablemixed-signal NPU, or A-NPU.

One of the most prevalent limitations in analog design isthe bounded range of currents and voltages within which thecircuits can operate effectively. These range limitations restrictthe bit-width of input and weight values and the networktopologies that can be computed accurately and efficiently.We expose these limitations to the compiler and our customtraining algorithm and compilation workflow considers theserestrictions when searching for optimal network topologiesand training neural networks. As we will show, one of theinsights from this work is that even with limited bit-width(≤ 8), and a restricted neural topology, many general-purposeapproximate applications achieve acceptable accuracy andsignificantly benefit from mixed-signal neural acceleration.Value representation and bit-width limitations. One ofthe fundamental design choices for an ANU is the bit-widthof inputs and weights. Increasing the number of bits resultsin an exponential increase in the ADC and DAC energy dis-sipation and can significantly limit the benefits from analogacceleration. Furthermore, due to the fixed range of voltage

and current levels, increasing the number of bits translates toquantizing this fixed value range to fine granularities that prac-tical ADCs can not handle. In addition, the fine granularityencoding makes the analog circuit significantly more suscepti-ble to noise, thermal, voltage, current, and process variations.In practice, these non-ideal effects can adversely affect thefinal accuracy when more bit-width is used for weights andinputs. We design our ANUs such that the granularity of thevoltage and current levels used for information encoding is toa large degree robust to variations and noise.Topology restrictions. Another important design choice isthe number of inputs in the ANU. Similar to bit-width, in-creasing the number of ANU inputs translates to encodinga larger value range in a bounded voltage and current range,which, as discussed, becomes impractical. There is a tradeoffbetween accuracy and efficiency in choosing the number ANUinputs. The larger the number of inputs, the larger the numberof multiply and add operations that can be done in parallel inthe analog domain, increasing efficiency. However, due to thebounded range of voltage and currents, increasing the numberof inputs requires decreasing the number of bits for inputs andweights. Through circuit-level simulations, we empiricallyfound that limiting the number of inputs to eight with 8-bitinputs and weights strikes a balance between accuracy andefficiency. A digital implementation does not impose suchrestrictions on the number of inputs to the hardware neuronand it can potentially compute arbitrary topologies of neuralnetworks. However, this unique ANU limitation restricts thetopology of the neural network that can run on the analogaccelerator. Our customized training algorithm and compi-lation workflow takes into account this topology limitationand produces neural networks that can be computed on ourmixed-signal accelerator.Non-ideal sigmoid. The saturation behavior of the analogcircuit that leads to sigmoid-like behavior after the summationstage represents an approximation of the ideal sigmoid. Wemeasure this behavior at the circuit level and expose it to thecompiler and the training algorithm.

4. Mixed-Signal Neural Accelerator (A-NPU)This section describes a concrete ANU design and the mixed-signal, neural accelerator, A-NPU.

4.1. ANU Circuit Design

Figure 3 illustrates the design of a single analog neuron (ANU).The ANU performs the computation of one neuron, or y ≈sigmoid(∑i wixi). We place the analog-digital boundary atthe ANU level, with computation in the analog domain andstorage in the digital domain. Digital input and weight valuesare represented in sign-magnitude form. In the figure, swiand sxi represent the sign bits and wi and xi represent themagnitude. Digital input values are converted to the analogdomain through current-steering DACs that translate digitalvalues to analog currents. Current-steering DACs are used for

Current'Steering'DAC

x0sx0w0sw0

Resistor'Ladder

Current'Steering'DAC

Resistor'Ladder

Diff'Pair

…

V (|w0x0|)

I+(w0x0)

V +⇣X

wixi

⌘

swn sxnwn xn

R(|w0|) R(|wn|)

I(|xn|)

V (|wnxn|)

I+(wnxn)

sy

y

FlashADC

y ⇡ sigmoid⇣V⇣X

wixi

⌘⌘

I(|x0|)

Diff'Pair

I�(w0x0)

I�(wnxn)

V �⇣X

wixi

⌘Diff'Amp

+

-

+

;

Figure 3: A single analog neuron (ANU).their speed and simplicity. In Figure 3, I(|xi|) is the analogcurrent that represents the magnitude of the input value, xi.Digital weight values control resistor-string ladders that createa variable resistance depending on the magnitude of eachweight (R(|wi|)) . We use a standard resistor ladder thatsconsists of a set of resistors connected to a tree-structuredset of switches. The digital weight bits control the switches,adjusting the effective resistance, R(|wi|), seen by the inputcurrent (I(|xi|)). These variable resistances scale the inputcurrents by the digital weight values, effectively multiplyingeach input magnitude by its corresponding weight magnitude.The output of the resistor ladder is a voltage: V (|wixi|) =I(|xi|)×R(|wi|). The resistor network requires 2m resistorsand approximately 2m+1 switches, where m is the number ofdigital weight bits. This resistor ladder design has been shownto work well for m ≤ 10. Our circuit simulations show thatonly minimally sized switches are necessary.

V (|wixi|), as well as the XOR of the weight and input signbits, feed to a differential pair that converts voltage valuesto two differential currents (I+(wixi), I−(wixi)) that capturethe sign of the weighted input. These differential currentsare proportional to the voltage applied to the differential pair,V (|wixi|). If the voltage difference between the two gates iskept small, the current-voltage relationship is linear, producingI+(wixi) =

Ibias2 +∆I and I

−(wixi) =Ibias

2 −∆I. Resistor laddervalues are chosen such that the gate voltage remains in therange that produces linear outputs, and consequently a moreaccurate final result. Based on the sign of the computation, aswitch steers either the current associated with a positive valueor the current associated with a negative value to a single wireto be efficiently summed according to Kirchhoff’s current law.The alternate current is steered to a second wire, retainingdifferential operation at later design stages. Differential op-eration combats environmental noise and increases gain, thelater being particularly important for mitigating the impact ofanalog range challenges at later stages.

Resistors convert the resulting pair of differential currentsto voltages, V+(∑i wixi) and V−(∑i wixi), that represent theweighted sum of the inputs to the ANU. These voltages are

Column'Selector

8"WideAnalog-Neuron

Weight'B

uffer 8"Wide

Analog-Neuron

Weight'B

uffer 8"Wide

Analog-Neuron

Weight'B

uffer 8"Wide

Analog-Neuron

Weight'B

uffer

Row'Selector

Output'FIFO

Config'FIFOInput'FIFO

…

Figure 4: Mixed-signal neural accelerator, A-NPU. Only four of theANUs are shown. Each ANU processes eight 8-bit inputs.used as input to an additional amplification stage (implementedas a current-mode differential amplifier with diode-connectedload). The goal of this amplification stage is to significantlymagnify the input voltage range of interest that maps to thelinear output region of the desired sigmoid function. Ourexperiments show that neural networks are sensitive to thesteepness of this non-linear function, losing accuracy withshallower, non-linear activation functions. This fact is rele-vant for an analog implementation because steeper functionsincrease range pressure in the analog domain, as a small rangeof interest must be mapped to a much larger output range inaccordance with ADC input range requirements for accurateconversion. We magnify this range of interest, choosing circuitparameters that give the required gain, but also allowing forsaturation with inputs outside of this range.

The amplified voltage is used as input to an ADC that con-verts the analog voltage to a digital value. We chose a flashADC design (named for its speed), which consists of a setof reference voltages and comparators [1, 31]. The ADC re-quires 2n comparators, where n is the number of digital outputbits. Flash ADC designs are capable of converting 8 bits ata frequency on the order of one GHz. We require 2–3 mVbetween ADC quantization levels for accurate operation andnoise tolerance. Typically, ADC reference voltages increaselinearly; however, we use a non-linearly increasing set of ref-erence voltages to capture the behavior of a sigmoid function,which also improves the accuracy of the analog sigmoid.

4.2. Reconfigurable Mixed-Signal A-NPU

We design a reconfigurable mixed-signal A-NPU that can per-form the computation of a wide variety of neural topologiessince each requires a different topology. Figure 4 illustratesthe A-NPU design with some details omitted for clarity. Thefigure shows four ANUs while the actual design has eight.The A-NPU is a time-multiplexed architecture where the al-gorithmic neurons are mapped to the ANUs based on a staticscheduling algorithm, which is loaded to the A-NPU beforeinvocation. The multi-layer perceptron consists of layers ofneurons, where the inputs of each layer are the outputs ofthe previous layer. The ANU starts from the input layer andperforms the computations of the neurons layer by layer. TheInput Buffer always contains the inputs to the neurons, either

coming from the processor or from the previous layer com-putation. The Output Buffer, which is a single entry buffer,collects the outputs of the ANUs. When all of its columns arecomputed, the results are pushed back to the Input Buffer to en-able calculation of the next layer. The Row Selector determineswhich entry of the input buffer will be fed to the ANUs. Theoutput of the ANUs will be written to a single-entry outputbuffer. The Column Selector determines which column of theoutput buffer will be written by the ANUs. These selectors areFIFO buffers whose values are part of the preloaded A-NPUconfiguration. All the buffers are digital SRAM structures.

Each ANU has eight inputs. As shown in Figure 4, eachA-NPU is augmented with a dedicated weight buffer, storingthe 8-bit weights. The weight buffers simultaneously feed theweights to the ANUs. The weights and the order in which theyare fed to the ANUs are part of the A-NPU configuration. TheInput Buffer and Weight Buffers synchronously provide the inputsand weights for the ANUs based on the pre-loaded order.A-NPU configuration. During code generation, the com-piler produces an A-NPU configuration that constitutes theweights and the schedule. The static A-NPU scheduling al-gorithm first assigns an order to the neurons of the neuralnetwork, in which the neurons will be computed in the ANUs.The scheduler then takes the following steps for each layerof the neural network: (1) Assign each neuron to one of theANUs. (2) Assign an order to neurons. (3) Assign an or-der to the weights. (4) Generate the order for inputs to feedthe ANUs. (5) Generate the order in which the outputs willbe written to the Output Buffer. The scheduler also assigns aunique order for the inputs and outputs of the neural networkin which the core communicates data with the A-NPU.

4.3. Architectural interface for A-NPU

We adopt the same FIFO-based architectural interface throughwhich a digital NPU communicates with the processor [18].The A-NPU is tightly integrated to the pipeline. The processoronly communicates with the ANUs through the Input, Output,Config FIFOs. The processor ISA is extended with special in-structions that can enqueue and dequeue data from these FIFOsas shown in Figure 4. When a data value is queued/dequeuedto/from the Input/Output FIFO, the A-NPU converts the valuesto the appropriate representation for the A-NPU/processor.

5. Compilation for Analog AccelerationAs Figure 1 illustrates, the compilation for A-NPU executionconsists of three stages: (1) profile-driven data collection, (2)training for a limited-precision A-NPU, and (3) code genera-tion for hybrid analog-digital execution. In the profile-drivendata collection stage, the compiler instruments the applicationto collect the inputs and outputs of approximable functions.The compiler then runs the application with representativeinputs and collects the inputs and their corresponding outputs.These input-output pairs constitute the training data. Section 4briefly discussed ISA extensions and code generation. While

compilation stages (1) and (3) are similar to the techniques pre-sented for a digital implementation [18], the training phase isunique to an analog approach, accounting for analog-imposed,topology restrictions and adjusting weight selection to accountfor limited-precision computation.Hardware/software interface for exposing analog circuitsto the compiler. As we discussed in Section 3, we ex-pose the following analog circuit restrictions to the compilerthrough a hardware/software interface that captures the fol-lowing circuit characteristics: (1) input bit-width limitations,(2) weight bit-width limitations, (3) limited number of inputsto each analog neuron (topology restriction), and (4) the non-ideal shape of the analog sigmoid. The compiler internallyconstructs a high-level model of the circuit based on theselimitations and uses this model during the neural topologysearch and training with the goal of limiting the impact ofinaccuracies due to an analog implementation.Training for limited bit widths and analog computation.Traditional training algorithms for multi-layered perceptronneural networks use a gradient descent approach to minimizethe average network error, over a set of training input-outputpairs, by backpropagating the output error through the net-work and iteratively adjusting the weight values to minimizethat error. Traditional training techniques, however, that donot consider limited-precision inputs, weights, and outputsperform poorly when these values are saturated to adhere tothe bit-width requirements that are feasible for an implemen-tation in the analog domain. Simply limiting weight valuesduring training is also detrimental to achieving quality outputsbecause the algorithm does not have sufficient precision toconverge to a quality solution.

To incorporate bit-width limitations into the training al-gorithm, we use a customized continuous-discrete learningmethod (CDLM) [10]. This approach takes advantage of theavailability of full-precision computation at training time andthen adjusts slightly to optimize the network for errors dueto limited-precision values. In an initial phase, CDLM firsttrains a fully-precise network according to a standard trainingalgorithm, such as backpropagation [43]. In a second phase,it discretizes the input, weight, and output values accordingthe the exposed analog specification. The algorithm calcu-lates the new error and backpropagates that error through thefully-precise network using full-precision computation andupdates the weight values according to the algorithm also usedin stage 1. This process repeats, backpropagating the ’dis-crete’ errors through a precise network. The original CDLMtraining algorithm was developed to mitigate the impact oflimited-precision weights. We customize this algorithm byincorporating the input bit-width limitation and the outputbit-width limitation in addition to limited weight values. Ad-ditionally, this training scheme is advantageous for an analogimplementation because it is general enough to also make upfor errors that arise due to an analog implementation, such as anon-ideal sigmoid function and any other analog non-ideality

that behaves consistently.In essence, after one round of full-precision training, the

compiler models an analog-like version of the network. Asecond, CDLM-based training pass adjusts for these analog-imposed errors, enabling the inaccurate and limited A-NPU asan option for a beneficial NPU implementation by maintainingacceptable accuracy and generality.Training with topology restrictions. In addition to deter-mining weight values for a given network topology, the com-piler searches the space of possible topologies to find an opti-mal network for a given approximable code region. Conven-tional multi-layered perceptron networks are fully connected,i.e. the output of each neuron in one layer is routed to the inputof each neuron in the following layer. However, analog rangelimitations restrict the number of inputs that can be computedin a neuron (eight in our design). Consequently, network con-nections must be limited, and in many cases, the network cannot be fully connected.

We impose the circuit restriction on the connectivity be-tween the neurons during the topology search and we use asimple algorithm guided by the mean-squared error of thenetwork to determine the best topology given the exposedrestriction. The error evaluation uses a typical cross-validationapproach: the compiler partitions the data collected duringprofiling into a training set, 70% of the data, and a test set, theremaining 30%. The topology search algorithm trains manydifferent neural-network topologies using the training set andchooses the one with the highest accuracy on the test set andthe lowest latency on the A-NPU hardware (prioritizing accu-racy). The space of possible topologies is large, so we restrictthe search to neural networks with at most two hidden layers.We also limit the number of neurons per hidden layer to pow-ers of two up to 32. The numbers of neurons in the input andoutput layers are predetermined based on the number of inputsand outputs in the candidate function.

To further improve accuracy, and compensate for topology-restricted networks, we utilize a Resilient Back Propagation(RPROP) [30] training algorithm as the base training algorithmin our CDLM framework. During training, instead of updatingthe weight values based on the backpropagated error (as inconventional backpropagation [43]), the RPROP algorithmincreases or decreases the weight values by a predefined valuebased on the sign of the error. Our investigation showed thatRPROP significantly outperforms conventional backpropaga-tion for the selected network topologies, requiring only half ofthe number of training epochs as backpropagation to convergeon a quality solution. The main advantage of the application ofRPROP training to an analog approach to neural computing isits robustness to the sigmoid function and topology restrictionsimposed by the analog design. Backpropagation, for example,is extremely sensitive to the steepness of the sigmoid function,and allowing for a variety of steepness levels in a fixed, analogimplementation is challenging. Additionally, backpropagationperforms poorly with a shallow sigmoid function. The require-

ment of a steep sigmoid function exacerbates analog rangechallenges, possibly making the implementation infeasible.RPROP tolerates a more shallow sigmoid activation steepnessand performs consistently utilizing a constant activation steep-ness over all applications. Our RPROP-based, customizedCDLM training phase requires 5000 training epochs, with theanalog-based CDLM phase adding roughly 10% to the trainingtime of the baseline training algorithm.

6. EvaluationsCycle-accurate simulation and energy modeling. We usethe MARSSx86 x86-64 cycle-accurate simulator [39] to modelthe performance of the processor. The processor is modeledafter a single-core Intel Nehalem to evaluate the performancebenefits of A-NPU acceleration over an aggressive out-of-order architecture1. We extended the simulator to include ISA-level support for A-NPU queue and dequeue instructions. Wealso augmented MARSSx86 with a cycle-accurate simulatorfor our A-NPU design and an 8-bit, fixed-point D-NPU witheight processing engines (PEs) as described in [18]. We useGCC v4.7.3 with -o3 to enable compiler optimization. Thebaseline in our experiments is the benchmark run solely on theprocessor without neural transformation. We use McPAT [33]for processor energy estimations. We model the energy of an8-bit, fixed-point D-NPU using results from McPAT, CACTI6.5 [37], and [22] to estimate its energy. Both the D-NPU andthe processor operate at 3.4GHz at 0.9 V, while the A-NPU isclocked at one third of the digital clock frequency, 1.1GHz at1.2 V, to achieve acceptable accuracy.Circuit design for ANU. We built a detailed transistor-levelSPICE model of the analog neuron, ANU. We designed andsimulated the 8-bit, 8-input ANU in the Cadence Analog De-sign Environment using predictive technology models at 45nm [6]. We ran detailed Spectre SPICE simulations to under-stand circuit behavior and measure ANU energy consumption.We used CACTI to estimate the energy of the A-NPU buffers.Evaluations consider all A-NPU components, both digital andanalog. For the analog parts, we used direct measurementsfrom the transistor-level SPICE simulations. For SRAM ac-cesses, we used CACTI. We built an A-NPU cycle-accuratesimulator to evaluate the performance improvements. Similarto McPAT, we combined simulation statistics with measure-ments from SPICE and CACTI to calculate A-NPU energy. Toavoid biasing our study toward analog designs, all energy andperformance comparisons are to an 8-bit, fixed-point D-NPU(8-bit inputs/weights/multiply-adders). For consistency withthe available McPAT model for the baseline processor, we

1Processor: Fetch/Issue Width: 4/5, INT ALUs/FPUs: 6/6, Load/StoreFUs: 1/1, ROB Entries: 128, Issue Queue Entries: 36, INT/FP PhysicalRegisters: 256/256, Branch Predictor: Tournament 48 KB, BTB Sets/Ways:1024/4, RAS Entries: 64, Load/Store Queue Entries: 48/48, DependencePredictor: 4096-entry Bloom Filter, ITLB/DTLB Entries: 128/256 L1: 32KB Instruction, 32 KB Data, Line Width: 64 bytes, 8-Way, Latency: 3 cyclesL2: 256 KB, Line Width: 64 bytes, 8-Way, Latency: 6 cycles L3: 2 MB,Line Width 64 bytes, 16-Way, Latency: 27 cycles Memory Latency: 50 ns

Table 1: The evaluated benchmarks, characterization of each offloaded function, training data, and the trained neural network.

Benchmark*Name Descrip0on Type#*of*

Func0on*Calls

#*of*Loops

#*of*Ifs/elses

#*of*x86@64*

Instruc0ons

Evalua0on*Input*Set Training*Input*Set

Neural*Network*Topology

Fully*Digital*NN*

MSE

Analog*NN*MSE*(8@bit)

Applica0on*Error*Metric

Fully*Digital*Error

Analog*Error

blackscholesMathema'cal*model*of*a*financial*market*

Financial*Analysis 5 0 5 309

4096*Data*Point*from*PARSEC

16384*Data*Point*from*PARSEC 6*E>*8*E>*8E>*1 0.000011 0.00228 Avg.*Rela've*Error 6.02% 10.2%

M RadixE2*CooleyETukey*fast*fourierSignal*Processing 2 0 0 34

2048*Random*Floa'ng*Point*Numbers

32768*Random*Floa'ng*Point*Numbers

1*E>*4*E>*4*E>*2 0.00002 0.00194 Avg.*Rela've*Error 2.75% 4.1%

inversek2j Inverse*kinema'cs*for*2Ejoint*arm Robo'cs 4 0 0 10010000*(x,*y)*Random*Coordinates

10000*(x,*y)*Random*Coordinates

2*E>*8*E>*2 0.000341 0.00467 Avg.*Rela've*Error 6.2% 9.4%

jmeintTriangle*intersec'on*detec'on

3D*Gaming 32 0 23 1,079

10000*Random*Pairs*of*3D*Triangle*Coordinates

10000*Random*Pairs*of*3D*Triangle*Coordinate

18*E>*32*E>*8*E>*2 0.05235 0.06729 Miss*Rate 17.68% 19.7%

jpeg JPEG*encoding Compression 3 4 0 1,257 220x200EPixel*Color*ImageThree*512x512EPixel*Color*Images 64*E>*16*E>*8*E>*64 0.0000156 0.0000325 Image*Diff 5.48% 8.4%

kmeans KEmeans*clustering Machine*Learning 1 0 0 26220x200EPixel*Color*Image

50000*Pairs*of*Random*(r,*g,*b)*Values

6*E>*8*E>*4*E>*1 0.00752 0.009589 Image*Diff 3.21% 7.3%

sobel Sobel*edge*detectorImage*Processing 3 2 1 88

220x200EPixel*Color*Image

One*512x512EPixel*Color*Image 9*E>*8*E>*1 0.000782 0.00405 Image*Diff 3.89% 5.2%

Table 2: Area estimates for the analog neuron (ANU).

Sub-circuit Area

8×8-bit DAC 3,096 T∗8×Resistor Ladder (8-bit weights) 4,096 T + 1 KΩ (≈ 450T)8×Differential Pair 48 TI-to-V Resistors 20 KΩ (≈ 30 T)Differential Amplifier 244 T8-bit ADC 2550 T + 1 KΩ (≈ 450 T)Total ≈ 10,964 T∗Transistor with width/length = 1

used McPAT and CACTI to estimate D-NPU energy. Eventhough we do not have a fabrication-ready layout for the de-sign, in Table 2, we provide an estimate of the ANU area interms of number of transistors. T denotes a transistor withwidthlength = 1. As shown, each ANU (which performs eight, 8-bitanalog multiply-adds in parallel followed by a sigmoid) re-quires about 10,964 transistors. An equivalent digital neuronthat performs eight, 8-bit multiply-adds and a sigmoid wouldrequire about 72,456 T from which 56,000 T are for the eight,8-bit multiply-adds and 16,456 T for the sigmoid lookup. Withthe same compute capability, the analog neuron requires 6.6×fewer transistors than its equivalent digital implementation.Benchmarks. We use the benchmarks in [18] and add onemore, blackscholes. These benchmarks represent a diverse setof application domains, including financial analysis, signal pro-cessing, robotics, 3D gaming, compression, image processing.Table 1 summarizes information about each benchmark: ap-plication domain, target code, neural-network topology, train-ing/test data and final application error levels for fully-digitalneural networks and analog neural networks using our cus-tomized RPROP-based CDLM training algorithm. The neuralnetworks were trained using either typical program inputs,such as sample images, or a limited number of random inputs.Accuracy results are reported using an independent data set,e.g, an input image that is different than the image used dur-ing training. Each benchmark requires an application-specificerror metric, which is used in our evaluations. As shown in Ta-

Applica'

on*Ene

rgy*Re

duc'on

0

2

4

6

8

10

blackscholes 0 inversek2j jmeint jpeg kmeans sobel geomean

Speedup A-NPU over D-ANPU

Application Topology Speedup (1.1336 GHz)

Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)

Application Topology A-NPU - 1/3 Digital Frequency (Manual)

A-NPU - 1/2 Digital Frequency (Manual)

D-ANPU (Hadi) Improvement

blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU speedup

blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts

Application Topology CPU Other Instructions NPU Queue Instructions

Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6

blackscholes fft inversek2j jmeint jpeg kmeans sobel geomean

0.44

0.510.470.49

0.17

0.53

0.29

Core + D-NPUCore + A-NPU

Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00

blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other InstructionsNPU Queue Instructions

8-bit D-NPU vs. A-NPU

8-bit D-NPU 8-bit A-NPU Cycle Improvement Energy Improvement

Application Topology Cycle Energy (nJ) Cycle Energy Cycle Improvement (1.1336 GHz)

Cycle Improvement (1.7 GHz)

Energy Improvement (1.1336 GHz)


blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

Improvem

ent

0

1

2

3

4

5


Ener

gy Im

prov

emen

t

0

2

4

6

8

10


10.810.5

7.6

73.825.5

4.1

3.3

8.5

12.1411.81

8.56

82.2128.28

4.57

3.79

9.53

1/3 Digital Frequency1/2 Digital Frequency

Analog Sigmoid

Application Topology Fully Precise Digital Sigmoid

Fully Precise Digital Sigmoid

Analog Sigmoid Analog Sigmoid

blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%


Fully Precise Digital SigmoidAnalog Sigmoid

Table 1

Maximum number of Incoming Synapses to each Neuron

4 8 16

Application Level Error (Image Diff)

14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%

Maximum number of Incoming Synapses to each Neuron4 8 16

Applica'

on*Spe

edup

0

2

4

6

8

10


Application Level Error

Benchmarks Fully Precise Digital Sigmoid

Analog Sigmoid

blackscholes 8.39% 10.21%

fft 3.03% 4.13%

inversek2j 8.13% 9.42%

jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%

swaptions 2.61% 3.34%

geomean 6.02% 7.32%

blackscholes fft inversek2j jmeint jpeg kmeans sobelFloating Point D-NPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

A-NPU + Ideal Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Benchmarks blackscholes fft inversek2j jmeint jpeg kmeans sobel swaptions geomean

Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%

NPU Queue Instructions

0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6 1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.8 1.9

0.5 0.8 1.2

2.43.1 3.6


Percentage instructions subsumed

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% 6.0%

Analog Sigmoid

10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3%

blackscholes fft inversek2j jmeint jpeg kmeans sobelPercentage Instructions Subsumed

97.2% 67.4% 95.9% 95.1% 56.3% 29.7% 57.1%

2.5

3.7

5.4

5.1

6.3 6.5

Application Energy Reduction

Application Topology Core + D-NPU Core + A-NPU Core + Ideal NPU Core + D-NPU Core + A-NPU

blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672Im

provement

0

1

2

3

4

5


Speedup

EnergyBSaving

8-bit D-NPU vs. A-NPU-1






blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

2.5

9.5

3.7

1.8

2.4

4.5

28.2

3.9

82.2

15.2

8.5

2.4 2

.711

.8

12.1

3.3

Application Speedup-1


blackscholes 6 -> 8 -> 8 -> 1

fft 1 -> 4 -> 4 -> 2

inversek2j 2 -> 8 -> 2

jmeint 18 -> 32 -> 8 -> 2

jpeg 64 -> 16 -> 8 -> 64

kmeans 6 -> 8 -> 4 -> 1

sobel 9 -> 8 -> 1

swaptions 1 -> 16 -> 8 -> 1

geomean

Application Energy Reduction-1


blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

swaptions 1 -> 16 -> 8 -> 1

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.011817544342672blackscholes G inversek2j jmeint jpeg kmeans sobel

FloaEngBPointB

DHNPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

AHNPUB+BIdealB

Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

AHNPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Error

0%

5%

10%

15%

20%

blackscholes 0 inversek2j jmeint jpeg kmeans sobel

FloaEngBPointBDHNPU

AHNPUB+BIdealBSigmoid

AHNPU

9.5

2.5

3.7

1.8

4.5

2.4

28.2

3.9

82.2

15.2

8.5

2.4

11.8

2.7

3.3

12.1

3.3

EnergyBSaving

Speedup

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

Figure 5: A-NPU with 8 ANUs vs. D-NPU with 8 PEs.ble 1, each application benefits from a different neural networktopology, so the ability to reconfigure the A-NPU is critical.A-NPU vs 8-bit D-NPU. Figure 5 shows the average energyimprovement and speedup for one invocation of an A-NPUover one invocation of an 8-bit D-NPU, where the A-NPUis clocked at 13 the D-NPU frequency. On average, the A-NPU is 12.1× more energy efficient and 3.3× faster thanthe D-NPU. While consuming significantly less energy, theA-NPU can perform 64 multiply-adds in parallel, while theD-NPU can only perform eight. This energy-efficient, par-allel computation explains why jpeg–with the largest neuralnetwork (64→16→8→64)–achieves the highest energy andperformance improvements, 82.2× and 15.2×, respectively.The larger the network, the higher the benefits from A-NPU.Compared to a D-NPU, an A-NPU offers a higher level of par-allelism with low energy cost that can potentially enable usinglarger neural networks to replace more complicated code.Whole application speedup and energy savings. Figure 6shows the whole application speedup and energy savings whenthe processor is augmented with an 8-bit, 8-PE D-NPU, our8-ANU A-NPU, and an ideal NPU, which takes zero cyclesand consumes zero energy. Figure 6c shows the percentage ofdynamic instructions subsumed by the neural transformationof the candidate code. The results show, following the Am-dahl’s Law, that the larger the number of dynamic instructionssubsumed, the larger the benefits from neural acceleration.

Applica'

on*Ene

rgy*Re

duc'on

0

2

4

6

8

10




Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)




blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup


blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts


Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6


0.44

0.510.470.49

0.17

0.53

0.29


Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00









blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

Improvem

ent

0

1

2

3

4

5


Ener

gy Im

prov

emen

t

0

2

4

6

8

10


10.810.5

7.6

73.825.5

4.1

3.3

8.5

12.1411.81

8.56

82.2128.28

4.57

3.79

9.53


Analog Sigmoid




blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%



Table 1


4 8 16


14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%


Applica'

on*Spe

edup

0

2

4

6

8

10




Analog Sigmoid


fft 3.03% 4.13%


jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%


geomean 6.02% 7.32%


6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%


8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%


Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%


0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6

1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.8

1.9

0.5 0.8 1.2

2.4 3.1 3.6



8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% 6.0%

Analog Sigmoid

10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3%


97.2% 67.4% 95.9% 95.1% 56.3% 29.7% 57.1%

2.5

3.7

5.4

5.1

6.3 6.5



blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672

Improvement

0

1

2

3

4

5


Speedup

EnergyBSaving







blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

2.5

9.5

3.7

1.8

2.4

4.5

28.2

3.9

82.2

15.2

8.5

2.4 2

.711

.8

12.1

3.3




fft 1 -> 4 -> 4 -> 2


jmeint 18 -> 32 -> 8 -> 2

jpeg 64 -> 16 -> 8 -> 64

kmeans 6 -> 8 -> 4 -> 1

sobel 9 -> 8 -> 1

swaptions 1 -> 16 -> 8 -> 1

geomean



blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

swaptions 1 -> 16 -> 8 -> 1


FloaEngBPointB

DHNPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

AHNPUB+BIdealB

Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

AHNPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Error

0%

5%

10%

15%

20%


FloaEngBPointBDHNPU


AHNPU

9.5

2.5

3.7

1.8

4.5

2.4

28.2

3.9

82.2

15.2

8.5

2.4

11.8

2.7

3.3

12.1

3.3

EnergyBSaving

Speedup

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

(a) Whole application speedup.

Applica'

on*Ene

rgy*Re

duc'on

0

2

4

6

8

10




Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)




blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup


blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts


Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6


0.44

0.510.470.49

0.17

0.53

0.29


Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00









blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

Improvem

ent

0

1

2

3

4

5


Ener

gy Im

prov

emen

t

0

2

4

6

8

10


10.810.5

7.6

73.825.5

4.1

3.3

8.5

12.1411.81

8.56

82.2128.28

4.57

3.79

9.53


Analog Sigmoid




blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%



Table 1


4 8 16


14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%


Applica'

on*Spe

edup

0

2

4

6

8

10




Analog Sigmoid


fft 3.03% 4.13%


jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%


geomean 6.02% 7.32%


6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%


8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%


Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%


0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6

1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.8

1.9

0.5 0.8 1.2

2.4 3.1 3.6



8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3% 2.6% 6.0%

Analog Sigmoid

10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2% 3.3% 7.3%


97.2% 67.4% 95.9% 95.1% 56.3% 29.7% 57.1%

2.5

3.7

5.4

5.1

6.3 6.5



blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

geomean 5.13306978649764 6.37013417935512 6.51636491720012 0.787719817984527 0.977559461493782 0.011817544342672

Improvement

0

1

2

3

4

5


Speedup

EnergyBSaving







blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

geomean 61.47 9.63 6.15 0.79 3.33 5.00 12.14 10.79

2.5

9.5

3.7

1.8

2.4

4.5

28.2

3.9

82.2

15.2

8.5

2.4 2

.711

.8

12.1

3.3




fft 1 -> 4 -> 4 -> 2


jmeint 18 -> 32 -> 8 -> 2

jpeg 64 -> 16 -> 8 -> 64

kmeans 6 -> 8 -> 4 -> 1

sobel 9 -> 8 -> 1

swaptions 1 -> 16 -> 8 -> 1

geomean



blackscholes 6 -> 8 -> 8 -> 1 42.5954312379609 51.250425520663 52.5015883985599 0.811317000822956 0.976169047145797 0.0238309528542033

fft 1 -> 4 -> 4 -> 2 1.66144638762811 1.70352109148241 1.71911629291664 0.966453749797993 0.990928361566649 0.00907163843335124

inversek2j 2 -> 8 -> 2 25.8726893148605 30.0198258158588 31.432603181923 0.823116340861644 0.955053758739376 0.0449462412606239

jmeint 18 -> 32 -> 8 -> 2 7.32010374044854 17.8930069836166 18.8933415016608 0.387443573165978 0.947053594624526 0.052946405375474

jpeg 64 -> 16 -> 8 -> 64 2.21156760942508 2.39631940662302 2.39878691088084 0.921952508325545 0.998971353292519 0.0010286467074806

kmeans 6 -> 8 -> 4 -> 1 1.15321498752577 1.33727439731608 1.36611738885167 0.844155119418499 0.978886886463078 0.0211131135369225

sobel 9 -> 8 -> 1 2.74676510816413 2.83229403346624 2.84047691805956 0.967008424078502 0.997119186379832 0.00288081362016845

swaptions 1 -> 16 -> 8 -> 1


FloaEngBPointB

DHNPU

6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%

AHNPUB+BIdealB

Sigmoid

8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

AHNPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%

Error

0%

5%

10%

15%

20%


FloaEngBPointBDHNPU


AHNPU

9.5

2.5

3.7

1.8

4.5

2.4

28.2

3.9

82.2

15.2

8.5

2.4

11.8

2.7

3.3

12.1

3.3

EnergyBSaving

Speedup

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

CoreB+BAHNPU

CoreB+BDHNPU

CoreB+BIdealB

(b) Whole application energy saving.

Appl

icat

ion

Ener

gy R

educ

tion

0

2

4

6

8

10


Core + D-NPUCore + A-NPUCore + Ideal NPU



Speedup (1.7 GHz)

blackscholes 6 -> 8 -> 8 -> 1 2.50 3.75

fft 1 -> 4 -> 4 -> 2 1.89 2.83

inversek2j 2 -> 8 -> 2 2.42 3.63

jmeint 18 -> 32 -> 8 -> 2 3.92 5.88

jpeg 64 -> 16 -> 8 -> 64 15.21 22.81

kmeans 6 -> 8 -> 4 -> 1 2.44 3.67

sobel 9 -> 8 -> 1 2.75 4.13

swaptions 1 -> 16 -> 8 -> 1 2.08 3.13

geomean 3.14 4.71

Energy (nJ)




blackscholes 6 -> 8 -> 8 -> 1 0.86 0.96 8.15 9.53

fft 1 -> 4 -> 4 -> 2 0.54 0.61 18.21 33.85

inversek2j 2 -> 8 -> 2 0.51 0.58 2.04 4.00

jmeint 18 -> 32 -> 8 -> 2 1.99 2.21 2.33 1.17

jpeg 64 -> 16 -> 8 -> 64 1.36 1.51 56.26 41.47

kmeans 6 -> 8 -> 4 -> 1 0.67 0.76 111.54 165.46

sobel 9 -> 8 -> 1 0.46 0.53 5.77 12.41

swaptions 1 -> 16 -> 8 -> 1 1.22 1.36 5.49 4.51

geomean 0.79 0.89 11.43 14.40

Application Speedup


blackscholes 6 -> 8 -> 8 -> 1 14.1441013944802 24.5221729784241 48.0035326510468 0.294647093939904 0.510841007404262 0.489158992595738

fft 1 -> 4 -> 4 -> 2 1.12709929506364 1.32327013458615 1.64546022960568 0.68497510592142 0.804194541306698 0.195805458693302

inversek2j 2 -> 8 -> 2 7.98161179269307 10.9938617211157 14.9861613789597 0.53259881505741 0.733600916412848 0.266399083587152

jmeint 18 -> 32 -> 8 -> 2 2.39085372136084 6.26190545947988 14.0755862116774 0.169858198827793 0.444877063399665 0.555122936600335

jpeg 64 -> 16 -> 8 -> 64 1.5617504494923 1.87946485929561 1.90676591975013 0.819057249406344 0.985682007334125 0.014317992665875

kmeans 6 -> 8 -> 4 -> 1 0.590012411780286 0.844832278645737 1.20518169214864 0.489563039020608 0.700999927354972 0.299000072645028

sobel 9 -> 8 -> 1 2.4864550898745 3.10723166292606 3.62429006473114 0.686053004992842 0.857335259438336 0.142664740561664

geomean 2.5478647166383 3.7797513074705 5.42766338726694 0.469422021014693 0.696386462789426 0.189114065410968

Dynamic Insts


Less Insts

blackscholes 6 -> 8 -> 8 -> 1 1.0 0.02 0.003 0.972

fft 1 -> 4 -> 4 -> 2 1.0 0.31 0.012 0.674

inversek2j 2 -> 8 -> 2 1.0 0.03 0.008 0.959

jmeint 18 -> 32 -> 8 -> 2 1.0 0.03 0.018 0.951

jpeg 64 -> 16 -> 8 -> 64 1.0 0.43 0.005 0.563

kmeans 6 -> 8 -> 4 -> 1 1.0 0.66 0.048 0.297

sobel 9 -> 8 -> 1 1.0 0.41 0.023 0.571

swaptions 1 -> 16 -> 8 -> 1

geomean

Nor

mal

ized

App

licat

ion

Spee

dup

0

0.2

0.4

0.6


0.44

0.510.470.49

0.17

0.53

0.29


Nor

mal

ized

# o

f Dyn

amic

Inst

ruct

ions

0.00

0.25

0.50

0.75

1.00









blackscholes 6 -> 8 -> 8 -> 1 45 8.15 6 0.86 2.50 3.75 9.53 8.48

fft 1 -> 4 -> 4 -> 2 34 2.04 6 0.54 1.89 2.83 3.79 3.33

inversek2j 2 -> 8 -> 2 29 2.33 4 0.51 2.42 3.63 4.57 4.05

jmeint 18 -> 32 -> 8 -> 2 141 56.26 12 1.99 3.92 5.88 28.28 25.50

jpeg 64 -> 16 -> 8 -> 64 365 111.54 8 1.36 15.21 22.81 82.21 73.76

kmeans 6 -> 8 -> 4 -> 1 44 5.77 6 0.67 2.44 3.67 8.56 7.57

sobel 9 -> 8 -> 1 33 5.49 4 0.46 2.75 4.13 11.81 10.45

swaptions 1 -> 16 -> 8 -> 1 50 10.30 8 1.22 2.08 3.13 8.45 7.58

geomean 59.90 9.71 6.35 0.84 3.33 4.71 12.14 10.32

Impr

ovem

ent

0

1

2

3

4

5


Energy SavingSpeedup

Ener

gy Im

prov

emen

t

0

2

4

6

8

10


10.3

7.6

10.5

7.6

73.825.5

4.1

3.3

8.5

12.14

8.45

11.81

8.56

82.2128.28

4.57

3.79

9.53


Analog Sigmoid




blackscholes 6 -> 8 -> 8 -> 1 0.0839 8.39 10.21 0.0182

fft 1 -> 4 -> 4 -> 2 0.0303 3.03 4.13 0.011

inversek2j 2 -> 8 -> 2 0.0813 8.13 9.42 0.0129

jmeint 18 -> 32 -> 8 -> 2 0.1841 18.41 19.67 0.0126

jpeg 64 -> 16 -> 8 -> 64 0.0662 6.62 8.35 0.0173

kmeans 6 -> 8 -> 4 -> 1 0.06 6.10 7.28 0.0118

sobel 9 -> 8 -> 1 0.0428 4.28 5.21 0.0093

swaptions 1 -> 16 -> 8 -> 1 0.0261 2.61 3.34 0.0073

geomean 0.06 6.02 7.32 0.01

Appl

icat

ion

Leve

l Err

or

0%

2%

4%

6%

8%

10%

12%



Table 1


4 8 16


14.32% 6.62% 5.76%

Appl

icat

ion

Leve

l Err

or (I

mag

e D

iff)

0%

3%

6%

9%

12%

15%


Appl

icat

ion

Spee

dup

0

2

4

6

8

10


Core + D-NPUCore + A-NPUCore + Ideal NPU



Analog Sigmoid


fft 3.03% 4.13%


jmeint 18.41% 19.67%

jpeg 6.62% 8.35%

kmeans 6.1% 7.28%

sobel 4.28% 5.21%


geomean 6.02% 7.32%


6.0% 2.7% 6.2% 17.6% 5.4% 3.2% 3.8%


8.4% 3.0% 8.1% 18.4% 6.6% 6.1% 4.3%

A-NPU 10.2% 4.1% 9.4% 19.7% 8.4% 7.3% 5.2%


Other Instructions

2.4% 31.4% 3.3% 3.1% 43.3% 65.5% 40.5%


0.3% 1.2% 0.8% 1.8% 0.5% 4.8% 2.3%

42.5

51.2

52.5

1.6 1.7

1.7

25.8

30.0

31.4

7.3

17.8

18.8

2.2 2.3

2.3

1.1 1.3

1.3

2.7 2.8

2.8

14.1

24.5

48.0

1.1 1.3 1.

6

7.9

10.9

14.9

2.3

6.2

14.0

1.5 1.

81.

9

0.5 0.8 1.

2

2.4 3

.1 3.6

9.5

2.5

3.7

1.8

4.5

2.4

28.2

general-purpose code acceleration with limited-precision...

Documents