on neural architecture search for resource-constrained … · 2019. 11. 4. · hardware resource in...

On Neural Architecture Search for Resource-ConstrainedHardware Platforms

Qing LuUniversity of Notre Dame

[email protected]

Weiwen JiangUniversity of Notre Dame

[email protected]

Xiaowei XuUniversity of Notre Dame

[email protected]

Yiyu ShiUniversity of Notre Dame

[email protected]

Jingtong HuUniversity of Pittsburgh

[email protected] the recent past, the success of Neural Architecture Search (NAS)has enabled researchers to broadly explore the design space usinglearning-based methods. Apart from finding better neural networkarchitectures, the idea of automation has also inspired to improvetheir implementations on hardware. While some practices of hard-ware machine-learning automation have achieved remarkable per-formance, the traditional design concept is still followed: a networkarchitecture is first structured with excellent test accuracy, andthen compressed and optimized to fit into a target platform. Sucha design flow will easily lead to inferior local-optimal solutions.To address this problem, we propose a new framework to jointlyexplore the space of neural architecture, hardware implementation,and quantization. Our objective is to find a quantized architecturewith the highest accuracy that is implementable on given hardwarespecifications. We employ FPGAs to implement and test our designswith limited loop-up tables (LUTs) and required throughput. Com-pared to the separate design/searching methods, our framework hasdemonstrated much better performance under strict specificationsand generated designs of higher accuracy by 18% to 68% in the taskof classifying CIFAR10 images. With 30,000 LUTs, a light-weightdesign is found to achieve 82.98% accuracy and 1293 images/secondthroughput, compared to which, under the same constraints, thetraditional method even fails to find a valid solution.

1 INTRODUCTIONMachine learning has demonstrated great success in a variety ofapplications [10, 15, 19, 22], which leads to the ever-growing de-mand in the off-the-shelf solutions to application-specific systems[5, 8, 14, 23]. Designing neural networks applying the hand-craftedapproach, however, involves huge expertise and labor. In responseto this challenge, automated machine learning (Auto-ML) is pro-posed to build neural networks without human intervention; inparticular, Neural Architecture Search (NAS) [26] is proposed to

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’19, November 4-7, 2019, Westminster, CO© 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-9999-9/18/06. . . $15.00https://doi.org/10.1145/1122445.1122456

identify the neural architecture with competitive or even betteraccuracy against the best design explored by experts.

On the other side, when deploying architectures explored byNASto real-world platforms, such as AIoT [12] and mobile embeddedplatforms [2–4, 13, 17], it is inevitably limited by the hardware con-straints. As a result, hardware-aware machine learning [2, 6, 17, 25]has emerged to explore neural architectures with the considerationof hardware efficiency on a target fixed hardware design. Mostrecently, authors in [7] open the hardware space in NAS to jointlyexplore the architectures and hardware designs. However, almost allexisting methods adopt a separated optimization flow [9]: a largenetwork is first invented with excellent performance, and thencompressed and optimized to fit into a target platform. Note thatcompression techniques, especially quantization [18–21], have tobe considered to fit the model into resource-constrained hardwareplatforms, which can tremendously reduce the hardware resourceconsumption and related computation consumption. Consequently,such approach usually failed to find the overall optimal solutions.For example, the best quantization scheme specifically tuned fora network may be significantly inferior when applied to anothernetwork or even not implementable under certain hardware speci-fications.

In this paper, we delve into the NAS-based methods of designautomation on hardware-constrained platforms. We aim to answersuch a concrete question: for a specific task, what is the best neuralarchitecture with the highest accuracy that is implementable givena defined set of hardware specifications? In particular, a novel co-exploration framework is proposed to investigate the optimality ofneural architectures with quantization. In our framework, we pa-rameterize the layer-wise quantization and search these parametersjointly with the hyperparameters of the architecture. A hardwaremodel is built by searching the hardware space and validated bythe design specifications. We use FPGAs as the target platform andrun experiments under various configurations and specifications.Compared with the existing separately searching, the proposedjoint search method is more robust, achieving 18% to 68% higheraccuracy on common used data sets.

The remainder of this paper is organized as follows. In Section2, we outline the progress in neural architecture search associatedwith hardware design. After that, we present the details of ourdesign framework in Section 3. Section 4 then investigates theperformance of our framework by experiment as compared withconventional methods of separate search. Finally, Section 5 remarksthe conculsion and future work.

arX

iv:1

911.

0010

5v1

[cs

.LG

] 3

1 O

ct 2

019

https://doi.org/10.1145/1122445.1122456

ICCAD ’19, November 4-7, 2019, Westminster, CO Qing Lu and Weiwen Jiang, et al.

The

Controller

(RNN)

Train a child network

with architecture

CN to get accuracy A

Sample architecture CN

with probability p

Compute gradient of p and

scale it by A to update the controller

Figure 1: The Pure-Software NAS framework.

2 TODAY’S NAS: FROM PURE-SOFTWARE VIAHARDWARE-AWARE TO CO-DESIGN

Likewise the development of designing the embedded systems, theevolution of today’s NAS has gone through three phases: (1) ex-ploring structure only, called pure-software NAS in this paper; (2)considering efficiency on a fixed hardware in exploring structures,called hardware-aware NAS; (3) co-exploring hardware implemen-tation and structures, called Co-Design NAS. In the following, wewill introduce each phase in detail, and then outline the develop-ment trend of NAS in the future.

Pure-Software NAS. Figure 1 shows the NAS framework pre-sented in [26]. In NAS, a controller (implemented as an RNN) it-eratively generates a child network and obtains its accuracy A bytraining it on a held-out data set. Then, accuracy A will be usedas the reward signal to the controller for its self-evolution for thenext iteration. The search process will be stopped if the controlleris converged for the maximum accuracy, or a termination conditionis satisfied. Existing work has demonstrated that the automaticallygenerated network architectures can achieve close accuracy to thebest human-invented architectures on the image classification task[26, 27].

Search Space: For image classification, the linear array is appliedas the backbone of network architecture. In [26], each cell is anormal convolution operation. In each cell, the search space is com-posed of the filter size, strides, and the number of filters. In [27],authors propose to incorporate B blocks (B = 5 in the paper) in onecell, where each block is a 2-branch structure, mapping from 2 inputtensors to 1 output tensor. And the controller determines the typeof operation on each input tensor. The operations include the differ-ent size of depthwise-separable convolution, atrous convolutions,average pooling, max pooling, skip connection, etc.

Hardware-AwareNAS. Figure 2 illustrates theworks on search-ing neural architectures targeting for fixed hardware [2, 13, 17]. Inthese works, mobile phones are commonly be employed to be thetestbed. In order to guarantee the final system can satisfy the timingspecification. The framework will test the hardware efficiency (e.g.,latency, energy consumption) for each child network. As shownin Figure 2, after training, the child network will be sent to thetarget platform to be executed. During execution, the hardwareefficiency E will be profiled. E together with the accuracy A will beapplied to update the controller to explore a better neural networkarchitecture.

More specifically, authors in [13] propose two optimizationmeth-ods. Assume the hardware efficiency E stands for latency. Given thelatency specification S , the first method is to maximize the accuracy

The Controller

(RNN)

Train a child network

with architecture

A to get accuracy A

Compute gradient of p and

scale it by A and E to update the controller

Predict the efficiency E

on a fixed hardware

(e.g. mobile phone)

Figure 2: The Hardware-Aware NAS framework.

A, subjecting to the constraint of E ≤ S . With this method, it stillhas the mono-criteria on maximizing accuracy. This method canguarantee the hardware efficiency to meet the specifications, but itcannot provide the Pareto optimal solutions. Then, a weighted prod-uct method to approximate Pareto optimal solutions is proposed.The objective function is revised asmax = A × E

Sw , wherew is the

weight factor. In this way, it enables the controller to effectivelyapproximate Pareto solutions nearby the specification S .

All the above approaches consider the hardware efficiency dur-ing the search space. However, they neglect the hardware designfreedom, which is commonly given in many AI applications (e.g.,IoT, embedded systems). As a result, it will potentially lead to infe-rior solutions. A more elegant way to tailor the hardware designfor neural architecture needs to be exploited.

Co-DesignNAS.Most recently, we propose the hardware/softwareco-design NAS to simultaneously optimize architecture accuracyand hardware efficiency. Interestingly, we observe that the hard-ware design space is tightly coupled with the architecture searchspace, i.e., the best neural architecture depends on the hardware(hardware-aware NAS), and the best hardware depends on the neu-ral architecture. It is therefore best to jointly explore both spacesto push forward the Pareto frontier between hardware efficiencyand test accuracy for better design tradeoffs.

Specifically, our architecture search space and hardware designspace co-exploration framework is shown in Figure 3(b). The pro-posed co-exploration can be built on any existing NAS framework[1, 2, 11, 26] by expanding it to delve into the hardware designspace, where a two-level (fast and slow) exploration is iterativelyconducted. In the fast exploration, the best hardware design is iden-tified for the sampled neural architectures without lengthy training.The architectures with inferior hardware efficiency will be quicklypruned, which significantly accelerates the search process. There-after, the superior candidates are trained in the slow explorationfor controller update using policy gradient reinforcement learningto explore the coupled architecture search space. The optimizationobjectives in the hardware design space can be varied accordingto the design specifications, such as area, monetary cost, energyefficiency, reliability, resource utilization, etc.

Near Future. The essential objective of NAS is for AI democ-ratization. Although the Co-Design NAS has already significantlypushed forward the progress to automatically implement machinelearning tasks on hardware, without considering the constrainedhardware resource in edge computing where tons of AI applica-tions are waiting to be deployed, it will easily find inferior solutionsor even cannot find feasible solutions. In the following sections,we made innovations on NAS to propose quantization search forresource-constrained hardware on edge.

On Neural Architecture Search for Resource-Constrained Hardware Platforms ICCAD ’19, November 4-7, 2019, Westminster, CO

Hardware Design Space

meet time?

Arch Search Space

accuracy

train

child network

Design 1

Design 2

…

time

monetary cost, utilization, etc.

update controllerpredict arch

select hardware

fast-level

N

Y

slow-level

…

NN1

NN2

Figure 3: The Co-Design NAS framework.

3 TOMORROW’S NAS: LANDING ON EDGEWITH QUANTIZATION SEARCH FORRESOURCE-CONSTRAINED HARDWARE

This section will present our framework on NAS and hardwareco-design. Specifically, we target on jointly optimizing neural ar-chitectures together with their quantization and hardware designswith multiple objectives, which can guarantee the resultant imple-mentation to meet the given specifications.

The overall framework is illustrated in Figure 5. We use the con-troller to explore the architecture space and the hardware searchtool to explore the hardware space. In each episode, the controllersamples a child network architecture as well as its quantizationscheme. Based on this network, the hardware builder will performa search procedure through the hardware space for the model ofan FPGA-based design. During this process, each candidate modelis validated by the design specifications and accordingly the resultis used to generate the return to the controller. If any FPGA modelis valid, the sampled quantized network is trained on a held-outdataset and feedback the controller with its test accuracy, and oth-erwise, the return is zero instead. The following sections will revealthe details of this framework.

3.1 Design Space and ParameterizationThis paper takes the widely used convolutional neural networksand their FPGA implementation to show the proposed framework,where a serial of stacked convolutional layers are optimized andimplemented on an FPGA. The proposed framework will jointlyconsider three design spaces: architecture space, quantization spaceand hardware space.

3.1.1 Architecture Space. We consider one neural network layer iscomposed of a convolutioanl operation followed by a pooling oper-ation. For a convolutional operation, its exploration space can beparameterized, including the number of filters (N ), filter height (Fh),filter width (Fw), stride height (Sh), and stridewidth (Sw). For a pool-ing operation, we employ the size parameter Ps to indicate its lengthand stride. As a whole, each layer can be represented by a 6-elementsequence: (N , Fh, Fw, Sh, Sw, Ps), and the architectural space ofeach layer is A = ∏

p∈A |p |, where A = {N , Fh, Fw, Sh, Sw, Ps},and |p | denotes the number of possible values of a parameter p.

3.1.2 Quantization Space. We apply the quantization to all thetrainable parameters and activations in each layer to make tradeoffsbetween the hardware size and test accuracy. In this paper, weconsider a linear quantization with fixed-point representation that

Architecture

Space

Hardware

Space

Controller

Partiton 1 Partiton 2 Partition G

Layer LLayer L-1Layer 1Layer

lLayer 2 Layer 3 Layer 4 Layer 5

Partition g

Figure 4: Overview of the proposed exploration framework.

is composed of the integer and fractional parts which are taken asseparate parameters in our framework.

Assuming the rectified linear unit (ReLU) as the activation func-tion, the output A of the convolutional layer is non-negative, andwe apply the unsigned quantization as

Q(A) = clip(round( A∆q

) × ∆q, 0, B − ∆q). (1)

where ∆q is the precision and B is the range amplitude, both ofwhich are determined by the bit width in the integer part Ai andfractional part Af , respectively. We conclude their relationship as:B = 2Ai , ∆q = 2−Af .

For the weight and bias parametersW , signed quantization isapplied, such that

Q(W ) = clip(round(W∆q

) × ∆q, −B, B − ∆q), (2)

where we have the relationship between ∆q, B, Wi and W f asfollows: B = 2W i−1, ∆q = 2−W f .

Similarly to the architecture parameterization, the quantizationscheme can be represented by Q = {Ai,Af ,Wi,W f } and thus thequantization space is Q = ∏

p∈Q|p |.

3.1.3 Hardware Space. Given a determined architecture togetherwith its quantization, the implementation varies in terms of twoaspects: intra-layer parallelism (single-layer accelerator design) andinter-layer parallelism (mapping accelerators to an FPGA).

For the single-layer accelerator design, we adopted the widelyused tile-based paradigm [24]. We represent tiling parameters asa sequence of functions with Tm(M) as the number of channelsof the input tiles, Tn(N ) as the number of channels of the outputtiles,Tr (R) as the height of input tile, andTc(C) as the width of theinput tile, whereM , R andC are the number of channels, rows, andcolumns of the input feature maps, respectively.

For the mapping of single-layer accelerators to an FPGA, we par-tition and allocate hardware resources to accelerators. The partitionscheme P , as a function of L, is a selection from all the combina-tions of the L layers clustered into any number of sections from 1to L, which results in a space consisting of 2L−1 candidates. Thehardware space is then represented byH(L) = 2L−1

∏p∈H

|p |, where

H = {Tm,Tn,Tc,Tr }.


Controller

Hardware

Builder

FPGA Model

Network

Architecture

Exist?trainerY

N

Hardware Searching Tool

Specifications

Quantization

Architecture

space

Hardware

space

Figure 5: Hardware-architecture co-design framework.

3.1.4 Overall. The proposed framework will jointly determinearchitecture, quantization, and tiling parameters, together withpartition of layers, to identify the neural architecture and hard-ware implementation, such that both test accuracy and hardwareefficiency can be maximized.

We will use the reinforcement learning method to explore ar-chitecture and quantization spaces, and develop a multi-objectivesearch algorithm to explore the hardware space. More specifically,there is a controller to control the exploration, as shown in Figure 4.Details will be discussed in the following sections.

3.2 Update the ControllerThe architecture and quantization parameters are both optimizedto generate high accuracy. As shown in Figure 5, we employed re-inforcement learning method to explore the space A and Q wherethe controller interacts with the environment modeled as a MarkovDecision Process (MDP). In each episode, the controller rolls out asequence of actions under a stochastic policy. These actions, usedas the architecture parameters A and quantization parameters Q ,are mapped to a quantized child network. Next, we evaluate thesampled child network in two stages. In the first stage a hardwaresearching tool is developed to verify whether the sampled networkis implementable under the constraints of design specifications.If the result is positive, i.e. there exists an implementable hard-ware model, and the second stage will launch to train and validatethe child network on a held-out dataset. After the child networkvalidation is finished, a reward signal

R(a,q) ={

0, H(a,q) = ∅Acc, otherwise (3)

is returned to the controller for updating. In the above formula,H(a,q) represents the hardware space given sampled parametersa and q. We follow the Monte Carlo policy gradient algorithm [16]to update the controller using

▽J (θ ) = 1m

m∑k=1

T∑t=1

γT−t▽θ logπθ (at |a(t−1):1)(Rk − b) (4)

where m is the batch size and T is the total number of steps ineach trajectory. The rewards are discounted at every step by anexponential factor γ and the baseline b is the exponential movingaverage of the rewards.

3.3 Co-Explore Architecture and QuantizationMuch like the architectures, the quantization also determines theoverall performance and computational complexity of a network.Therefore, it is natural to automate the design of quantization to-gether with the design of the architecture. Under constrained re-source in hardware, the joint exploration of architecture and quan-tization space is actually the optimization of the trade-off betweenstructural complexity and data cleanness. Therefore, the rewardsignal is the reflection of how efficiently the hardware freedom areutilized.

The implementation of architecture-quantization joint searchmay vary by different settings and discretion, but generally thereare two types of methods characterized by the number of controllerused: a single controller to predict both architecture and quantiza-tion, or two controllers to predict them separately. In this paper, wefocus on the single-controller method and extend the RNN-basedcontroller in [26]. As displayed in Figure 6, we simply insert 4 ad-ditional steps into the controller, each step sampling one of theaforementioned quantization parameters.

For comparison, we list all the plausible methods for performingthe automate design of neural architecture with quantization. Thedifference among these methods reside in the space to be explored.

- quantization search. This is the traditional method that em-ploys the controller to search only the quantization for agiven architecture.

- architecture search. This is the reverse procedure of quantiza-tion search, where the quantization is fixed and the objectiveis to find the architecture to best fit this quantization.

- joint search. This is the exploration of architecture and quan-tization as related space and what we intend to investigatein this project.

3.4 Explore Hardware SpaceWe revisit the hardware space discussed in Section 3, this time withconsideration of our hardware model. In our specific model, thespecifications are in the LUT usage of the target FPGA and thethroughtput in frames-per-second, though the framework can beadapted to other aspects without much effort. In particular, onlyTn and Tm are variable while Tr and Tc are dropped due to theirirrelevance to the target specifications. The other parameters suchas quantizations are given as constant instead of variable. As aresult, the actual hardware space to explore is

H(L) = 2L−1L∏i=1

Ni−1Ni (5)

whose size grows exponentially with L. To avoid exploringH ex-haustively, we have developed an efficient searching algorithmusing dynamic programming. The hardware model and searchingalgorithm are explained in details as follows.

3.4.1 Tile-Based Implementation. In the tile-based model, the mainprocessing unit is the quantized computation engine (QCE) whichis composed of an array of multipliers, an adder tree, a trunca-tor, and the accumulation registers (Figure 7). For implementationon the FPGAs, the consumption of lookup-tables (LUTs) by each


Number

of Filters

Filter

Height

Filter

Width

Stride

Height

Stride

Width

Pooling

Height

Pooling

Width

Integer

Width of

Weight

Fractional

Width of

Weight

Integer

Width of

Activation

Fractional

Width of

Activation

Number

of Filters

Layer N Layer N+1Layer N-1

Fractional

Width of

Activation

Figure 6: The controller in sync search samples both architecture and quantization parameters layer by layer.

Aligner

Activation

Weight +

Activation

Weight

Activation

Weight +

Activation

Weight

Activation

Weight +

Activation

Weight

Activation

Weight +

Activation

Weight

++

Truncator

Truncator

TnTm

Quantization

Information

Figure 7: RTL architecture of quantized computing engine.

QCE scales with Tn, Tm, and the bit-width of the layer as it con-figures the size of the above components. Due to the data formatinconsistency between the activation and weight (inter-layer), andthe inconsistency between activations of consecutive layers (intra-layer), we customize the QCEs in each layer, according to the 6parametersWi ,W f , Ai , Af , Ai ′, and Af ′, where Ai ′, and Af ′ arethe bitwidth of the activation from the last layer. For the inter-layerinconsistency, the problem is handled by informing the multipliersand adders of the number of integer and fractional bits of eachoperand. This makes no difference to the fixed-point multipliersas the only effect is on the position of the decimal point. As forthe adders, this information means to perform data alignment byspecifically extending the MSB (integer part) and LSB (fractionalpart) to certain numbers, which however does not incur any ex-tra logic. On the other hand, the intra-layer consistency involvestruncating the partial sum produced by the adder tree and tailoringthe registered result to a target format. This inconsistency directlyaffects the truncator in size.

With the above model, the total size and latency of a single-layeraccelerator can be approximated. As mentioned, the size of layerl is a function of Tn, Tm, and bit width of weight and activations,incoming and outgoing, i.e.

Luti = qce(Tni ,Tmi ,Aii−1,Afi−1,Aii ,Afi ,Wii ,W fi ), (6)

where qce is the LUT approximator of the QCE for FPGAs andis predefined as a library. Besides, the latency of the single-layeraccelerator can be explicitly approximated by computation as

Lati ≈⌈MiTmi

⌉×⌈NiTni

⌉× Ri ×Ci × Fhi × Fwi . (7)

Equation (6) and (7) are used to calculate the LUT usage and latencyof a single-layer accelerator, upon which our multi-layer accelerator

model is based. If a multi-layer partition д contains consecutivelayers from i to j operating in a pipelined fashion, then we havethe overall size and latency as

Lutд:i∼j =j∑

k=i

Lutk , Latд:i∼j = maxi≤k≤j

Latk . (8)

Suppose a number ofG partitions covering a total of L layers iteratetheir operations on the same FPGA, the total LUT usage and latencyare then

Lut1∼G :1∼L = max1≤д≤G

Lutд:i∼j , Lat1∼G :1∼L =G∑д=1

Latд:i∼j (9)

3.4.2 Searching Algorithm. As implied by (5), the problem of search-ing the hardware spaceH involves deciding parametersTn ∈ [1,N ],Tm ∈ [1,N ′], as well as partitioning the L layers into G ∈ [1,L]clusters. Let rL and rT be the required LUT usage and throughputlimits by the design specifications, we introduce P⟨r L,rT ⟩

L to rep-resent both this problem and solution set to this problem: givenspecification pair ⟨rL, rT ⟩, P⟨r L,rT ⟩

L returns all the possible solu-tions for implementing the CNN accelerator of L layers. The taskof our searching algorithm is to verify whether P⟨r L,rT ⟩

L = ∅. Inorder to address this task, we also need to introduce the single-layersearch problem as the basic tool: we use p⟨r L,rT ⟩

l to represent theproblem of searching for the hardware solutions to a single layerl under the constraint of ⟨rL, rT ⟩ and again its solution set. Weshall solve the problem P⟨r L,rT ⟩

L by incrementally solving the basicproblem p⟨r L

′,rT ′⟩i , from i = 1 to L.

For any solution s ∈ P⟨r L,rT ⟩L , we define three functions f1(s),

f2(s) and f3(s) respectively as 1) the number of LUTs consumedby the last partition of s , 2) the overall latency of s , and 3) the sumof latency of all the partitions in s except the last one. For anytwo solutions s1 and s2, if we have f1(s1) ≤ f1(s2), f2(s1) ≤ f2(s2),and f3(s1) ≤ f3(s2), then s1 is considered superior to s2, and allthe solutions that is not inferior to any other solution compose thePareto Frontier of the solution set. Our algorithm is based on the factthat the existence of a solution is equivalent to the existence of thefrontier of the solution set. Following this observation, we search forthe frontier of P⟨r L,rT ⟩

L such that the space is significantly pruned.If the network has a depth of l + 1, there are only two scenarios ofhow layer (l + 1) is related to the first l layers:


Algorithm 1 Dynamic Search in Hardware SpaceInput: L, rL, rTOutput: solution set S1: Initialize rl = rL, rt = rT ,2: Initialize S0 = {s} where fi (s) = 0 for i=1, 2, 33: for each l = 1, 2, ...,L do4: for each s ∈ Sl−1 do5: compute rl and rt using Equation (10)6: compute sl = F (p⟨r l,r t ⟩l )7: for all s ′ ∈ sl do8: append s ′ to the last partition of s and add the result to

Sl9: end for10: compute rl and rt using Equation (11)11: compute sl = F (p⟨r l,r t ⟩l )12: for all s ′′ ∈ sl do13: set s ′′ as a new partition to s and add the result to Sl14: end for15: update the frontier Sl = Fr (Sl )16: end for17: end for18: S = SL

(1) layer (l+1) is appended to the last partition of the previous llayers, and

(2) layer (l+1) forms a partition itself.These two scenarios differ by the constraints to the problem ofsearching the last layer p⟨r l,r t ⟩l+1 . For every solution s ∈ Fr (P⟨r l,r t ⟩l ),where Fr (S) denotes the frontier of solution set S , we update thelayer l + 1 in both two manners under updated constraints. In thefirst case, {

rl = rL − f1(s)rt = ( 1

rT − f3(s)clock_rate )

−1 (10)

and in the second case,{rl = rL

rt = ( 1rT − f2(s)

clock_rate )−1 (11)

When p⟨r l,r t ⟩l+1 is solved, it would also be sufficient to keep only itsfrontier with respect to only f1 and f2. Then Fr (pr l,r tl+1 ) is achievedby combining Fr (P⟨r l

′,r t ′⟩l ) and F (p⟨r l,r t ⟩l+1 ) in the corresponding

way of how layer l + 1 is derived. The full procedure is shown inAlgorithm 1.

4 EXPERIMENTWe apply the joint architeture and quantization search method todesign CNN accelerators. The design objective is the classificationtask on CIFAR-10 dataset that satisfies two constraints of availableLUTs and throughput tolerance. This dataset provides 50000 imagesfor training and 10000 for testing, and the entirety of them are usedin the search process. For the training set, augmentation techniquesare applied for tuning a network which consists of normalization,rotation, shifting, and random flip.

The architecture and quantization space used in the experimentis listed in Table 1. For the child network, we assume each layer

Table 1: The architecture and quantization space of eachCNN layer used in the experiment

Parameter Symbol Value

# filters N (24, 36, 48, 64)filter height Fh (1, 3, 5, 7)filter width Fw (1, 3, 5, 7)stride height Sh (1, 2, 3)stride width Sw (1, 2, 3)pooling size Ps (1, 2)

activation integer bits Ai (0, 1, 2, 3)activation fractional bits Af (0, 1, 2, 3, 4, 5, 6)

weight integer bits Wi (0, 1, 2, 3)weight fractional bits W f (0, 1, 2, 3, 4, 5, 6)

5

83 7

5

7

77

4

5

54

activation weightint

frac

1

3

5

1

3

5

(a) A1-d1

5 8

3

7

3 7

7 7

4

55 4


frac

1

3

5

1

3

5

(b) A1-d2

57

6

7 34

4 83

4

59


frac

1

3

5

1

3

5

(c) B1-d1

4

53 4

1

6

74 3

8

7 7


frac

1

3

5

1

3

5

(d) B1-d2

2

61

71

9

14

4

47

7


frac

1

3

5

1

3

5

(e) B1-d3

7 85

7 66

5

65 6

7

6


frac

1

3

5

1

3

5

(f) D

3

63

6 7

7 78

3 5 5 5


frac

1

3

5

1

3

5

(g) E

4

74

5

8

5

55

4

4

75


frac

1

3

5

1

3

5

(h) F

Figure 8: Quantization details of the sampled designs

is composed of certainly a convolutional operation followed byrectified linear units, and possibly zero-padding and max-poolingoperations before and after it. After the convolutional layers, twofully connected layers are tailing to output the prediction distribu-tions, which are not included as part in our hardware model. Wetrain the child CNN for 30 epochs with Stochastic Gradient Decent(SGD) algorithm taking batches of 128 images, learning rate of 0.01,and momentum of 0.9. Once the training is finished, the reward isthe test accuracy averaged over the last 5 epochs. After the search,the best sample are selected and tuned for 150 epochs, along with64 batch size and decaying learning rates from 0.01 downward to0.0001. The highest accuracy along the tuning process is what isfinally reported. On the other side, the controller consists of a two-layer LSTM cell with 35 hidden units at each layer accompaniedwith an embedding and fully-connected layer at each time step ofcorresponding dimensions. To train the controller, we apply the


Table 2: Architectural information of the sampled designs. A1 and A2 are the best architectures found by NAS in 1000 episodesand their layered hyper-parameters are given in the form of (N, Fh, Fw, Sh, Sw, Ps). Thenwe remove the strides from the searchand get B1 and B2 whose parameters of each layer are listed as (N, Fh, Fw, Ps). D, E, and F are the results from our joint searchprocess with the same architecture space but no strides.

Network Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 #paras Acc w/o BN Acc w/ BN

A1 (64,3,3,1,1,1) (48,7,5,1,1,1) (48,5,5,2,1,1) (64,3,5,1,1,1) (36,5,7,1,1,1) (64,3,1,1,2,2) 300,804 87.76% 88.96%A2 (24,3,3,1,1,1) (36,5,5,1,1,1) (64,5,5,2,1,1) (64,5,5,1,1,1) (24,5,5,1,2,1) (64,3,3,1,2,1) 234,748 87.46% 88.87%B1 (64,3,3,1) (64,3,5,1) (64,3,3,2) (64,5,5,2) (64,5,3,1) (64,7,7,1) 464,960 89.71% 90.30%B2 (64,5,3,1) (64,3,5,1) (64,3,5,2) (64,5,5,2) (64,5,3,1) (64,7,7,1) 490,688 89.38% 90.49%D (48,5,3,1) (48,3,1,2) (36,1,7,2) (36,7,3,1) (24,5,5,1) (24,1,1,1) 70,776 83.65% 84.31%E (48,5,1,1) (48,5,3,2) (36,1,5,1) (64,7,7,2) (64,7,3,2) (48,5,3,1) 289,220 86.99% 88.27%F (64,1,5,1) (36,1,7,1) (64,5,7,2) (48,5,3,2) (48,7,7,1) (36,1,5,1) 26,5640 87.03% 88.42%

Table 3: Implementation information of the sampled designs. For network A and B, the designs are found by quantizationsearch to certain architectures in Table 2. For D, E and F, the quantization and implementation on hardware are designedtogether with their architectures. The quantization details are shown in Figure 4.

Design rL rT Acc w/o quantization Acc w/ quantization #LUTs Throughput (frames/s) parameter size (kbits)

A1-d1 100,000 500 87.76% 80.23% 99,871 556 1,867A1-d2 100,000 1000 87.76% 25.79% 99,848 1157 1,189B1-d1 100,000 500 89.71% 87.64% 96,904 512 3,463B1-d2 100,000 1000 89.71% 64.35% 98,752 1020 2,784B1-d3 300,000 2000 89.71% 50.93% 285,441 2083 2,835D 30,000 1000 83.65% 82.98% 29,904 1293 457E 100,000 1000 86.99% 82.76% 94,496 1042 1,923F 300,000 2000 87.03% 84.92% 299,860 2089 1,217

Stochastic Gradient Ascent (SGA) algorithm with a learning rateof 0.2 and batch size of 5. The baseline is the exponential movingaverage of the previous rewards. Finally, we build the hardwaremodel based upon an Altera Cyclone IV FPGA platform where weset the global clock rate as 100 MHz. In order to be build practicalFPGA synthesis, we set the depth of the child network to have6 layers, and designate the allowable LUT usage at three scales:30,000, 100,000, and 300,000.

For comparison, we first perform the NAS to find architectureswith good performances in accuracy and then search the best quanti-zation to fit them under some hardware specifications. The samplednetworks with highest accuracy on the test set and their best designwith quantization are shown in Table 2 and 4, respectively. We usethe floating-point accuracy in Table 2 as the reference of our designof quantization. Note that the same architecture is about 1% lessperformant without the batch normalization (BN), but that is justwhat we shall refer to. The reason is twofold: 1) the BN involvesoperations that require additional hardware support for both com-putation and memory, and more importantly 2) it will make thequantized network extremely unstable whose output may have aprohibitively large variance for the searching algorithm. Next, weuse the joint method to search architecture and quantization to-gether with the specifications under which the quantization searchfails to provide a good result in a general sense with respect tothe best architectures found. As a result, we have three designs ofthe CNN accelerators with different hardware specifications. Thedetails of these designs are reported in Table 3.

Table 4: Quantization search result for the sampled net-works A and B in Table 2. The best accuracy on the test setof CIFAR10 in 2000 episodes are reported.

Network rT rL=30,000 rL=100,000 rL=300,000

A1

500 10.65% 80.23% 86.16%1000 x 25.79% 84.90%2000 x x x

A2

500 55.45% 85.92% 86.26%1000 x 67.30% 76.51%2000 x x x

B1500 10.02% 87.64% 87.34%1000 x 64.35% 87.43%2000 x x 50.93%

B2500 10.20% 85.31% 88.53%1000 x 43.81% 86.50%2000 x x 16.71%

Table 4 shows the quantization search results with throughputrequirement of 500, 1000, and 2000 frames per second. It is clearlynoted the drop in accuracywith increasing throughput is very sharp.For example, the accuracy of the quantized network B1 drops from87.64% to 64.35% with doubling 500 MHz throughput requirementat rL=100,100. It is implied although the architecture has an excel-lent original performance, it is too sensitive to quantization to besuitable for resource-constrained hardware design. On the otherhand, our joint search method has found a solution to achieve


(a) NAS with stride (b) NAS without stride

(c) Joint: rL=30000, rT=500 (d) Joint: rL=300000, rT=500

Figure 9: The plotted training process: comparison betweenNAS and Joint search.

82.76% accuracy and meanwhile the 1000 throughput. In contrast,the original accuracy of architecture E is only 86.99%, worse thanB1 by nearly 3%, but the its accuracy is more robust to quantizationwith 4% degradation. With the joint search, we could even found adesign using less than 30,000 LUTs but achieving 82.98% accuracyand 1293 throughput. Note there are no valid designs for almostevery sample architecture even with 300,000 available LUTs.

We further compare the quantization designs for A1 and B1 withthose using joint search for D, E, and F. As illustrated in Figure8, the convolutional layers generally exhibit different patterns interms of bit-width requirement. Another observation is with fixedarchitecture, the quantization search tends to spend more bits onthe weight but not the activations, while the joint search treatsthese two values fairly.

5 CONCLUSION AND CHALLENGEIn this paper, we overviewed the recent development of automaticmachine learning, identifying the trend towards the hardware-software co-design using NAS. A hardware-aware co-design frame-work is proposed to jointly explore architecture, quantization, andhardware design space. It is proved by experiment the joint searchcan provides much more flexibility in compressed design robustperformance as compared to the traditional artificial design usingfixed architecture.

In this project, however, the existence of difficulties in applyinghardware-aware NAS is also identified. Compared to pure NAS, thecontroller is forced the burden to learn the hardware constraintfrom the beginning of the search process, resulting in a highervariance and early convergence to local optimality (Figure 9). Onthe other hand, the search process is computation-intensive andresource-/time-consuming. Lastly, the hardware exploration anddesign automation heavily rely on the hardware model that needsto be built with more effort. There remains a lot of room of im-provement in this topic.

ACKNOWLEDGEMENTThis workwas supported in part by the National Science Foundationunder Grant CCF-1820537 and CNS-1822099.REFERENCES[1] Gabriel Bender et al. 2018. Understanding and simplifying one-shot architecture

search. In Int. Conf. on Machine Learning. 549–558.[2] Han Cai et al. 2018. ProxylessNAS: Direct neural architecture search on target

task and hardware. arXiv preprint arXiv:1812.00332 (2018).[3] Weiwen Jiang et al. 2016. Optimal functional-unit assignment and buffer place-

ment for probabilistic pipelines. In 2016 Int. Conf. on Hardware/Software Codesignand System Synthesis (CODES+ ISSS). IEEE, 1–10.

[4] Weiwen Jiang et al. 2017. Optimal functional unit assignment and voltage selec-tion for pipelined MPSoC with guaranteed probability on time performance. InACM SIGPLAN Notices, Vol. 52. ACM, 41–50.

[5] Weiwen Jiang et al. 2018. Heterogeneous fpga-based cost-optimal design fortiming-constrained cnns. IEEE Trans. Comput.-Aided Design of Integr. Circuitsand Syst 37, 11 (2018), 2542–2554.

[6] Weiwen Jiang et al. 2019. Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search. In Proc. 56th Annual DesignAutomation Conference 2019. ACM, 5.

[7] Weiwen Jiang et al. 2019. Hardware/software co-exploration of neural architec-tures. arXiv preprint arXiv:1907.04650 (2019).

[8] Weiwen Jiang et al. 2019. XFER: a novel design to achieve super-linear perfor-mance on multiple FPGAs for real-time AI. In Proc. of Int. Symp. on FPGA. ACM,305–305.

[9] David Koeplinger et al. 2016. Automatic generation of efficient accelerators forreconfigurable hardware. In 2016 ACM/IEEE 43rd Annual Int. Symp. on Comput.Archit. (ISCA). IEEE, 115–127.

[10] Boyang Li et al. 2019. Exploiting computation power of blockchain for biomedicalimage segmentation. In IEEE Conf. on Computer Vision and Pattern RecognitionWorkshops.

[11] Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018. Darts: differentiablearchitecture search. arXiv preprint arXiv:1806.09055 (2018).

[12] Research and Markets. 2018. Artificial intelligence in IoT: AIoT Technology,Platforms, Applications and Services by Industry Vertical 2018 - 2023. Report(2018).

[13] Mingxing Tan et al. 2018. Mnasnet: Platform-aware neural architecture searchfor mobile. arXiv preprint arXiv:1807.11626 (2018).

[14] Tianchen Wang et al. 2019. MSU-Net: multiscale statistical U-Net for real-time3D cardiac MRI video segmentation. In Proc. of Medical Image Computing andComputer Assisted Interventions (MICCAI). 0–0.

[15] Tianchen Wang et al. 2019. SCNN: a general distribution based statistical con-volutional neural network with application to video object detection. In AAAIConf. on AI.

[16] Ronald J Williams. 1992. Simple statistical gradient-following algorithms forconnectionist reinforcement learning. Machine learning 8, 3-4 (1992), 229–256.

[17] Bichen Wu et al. 2018. FBNet: hardware-qware efficient convNet design viadifferentiable neural architecture search. arXiv preprint arXiv:1812.03443 (2018).

[18] Xiaowei Xu et al. 2017. Edge segmentation: empoweringmobile telemedicine withcompressed cellular neural networks. In Proc. of the 36th Int. Conf. on Computer-Aided Design. IEEE Press, 880–887.

[19] Xiaowei Xu et al. 2018. Efficient hardware implementation of cellular neuralnetworks with incremental quantization and early exit. ACM J. on EmergingTechnologies in Computing Systems (JETC) 14, 4 (2018), 48.

[20] Xiaowei Xu et al. 2018. Quantization of fully convolutional networks for accuratebiomedical image segmentation. Preprint at https://arxiv. org/abs/1803.04907(2018).

[21] Xiaowei Xu et al. 2018. Resource constrained cellular neural networks for real-time obstacle detection using FPGAs. In 2018 19th Int. Symp. on Quality ElectronicDesign. IEEE, 437–440.

[22] Xiaowei Xu et al. 2018. Scaling for edge inference of deep neural networks.Nature Electronics 1, 4 (2018), 216.

[23] Xiaowei Xu et al. 2019. Whole-heart and great vessel segmentation in congenitalheart disease using deep neural networks and graph matching. In Proc. of MedicalImage Computing and Computer Assisted Interventions (MICCAI). 0–0.

[24] Chen Zhang et al. 2015. Optimizing fpga-based accelerator design for deepconvolutional neural networks. In Proc. of FPGA. ACM, 161–170.

[25] Xinyi Zhang et al. 2019. When Neural Architecture Search Meets HardwareImplementation: from Hardware Awareness to Co-Design. In Proc. of ISLVLSI.25–30.

[26] Barret Zoph et al. 2016. Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578 (2016).

[27] Barret Zoph et al. 2018. Learning transferable architectures for scalable imagerecognition. In Proceedings of the IEEE Conf. on computer vision and patternrecognition. 8697–8710.

on neural architecture search for resource-constrained … · 2019. 11. 4. · hardware resource in...

Documents