domino: atailorednetwork-on-chiparchitecturetoenable

13
Domino: A Tailored Network-on-Chip Architecture to Enable Highly Localized Inter- and Intra-Memory DNN Computing Kaining Zhou * , Yangshuo He * , Rui Xiao * and Kejie Huang * * Zhejiang University ABSTRACT The ever-increasing computation complexity of fast-growing Deep Neural Networks (DNNs) has requested new com- puting paradigms to overcome the memory wall in conven- tional Von Neumann computing architectures. The emerg- ing Computing-In-Memory (CIM) architecture has been a promising candidate to accelerate neural network comput- ing. However, the data movement between CIM arrays may still dominate the total power consumption in conventional designs. This paper proposes a flexible CIM processor archi- tecture named Domino to enable stream computing and local data access to significantly reduce the data movement energy. Meanwhile, Domino employs tailored distributed instruction scheduling within Network-on-Chip (NoC) to implement inter-memory-computing and attain mapping flexibility. The evaluation with prevailing CNN models shows that Domino achieves 1.15-to-9.49× power efficiency over several state- of-the-art CIM accelerators and improves the throughput by 1.57-to-12.96×. 1. INTRODUCTION The rapid development of Deep Neural Network (DNN) algorithms has led to high energy consumption due to mil- lions of parameters and billions of operations in one infer- ence [19, 37, 39]. Meanwhile, the increasing demand for Artificial Intelligence (AI) computing entails flexible and power-efficient computing platforms to reduce inference en- ergy and accelerate DNN processing. However, shrinkage in Complementary Metal-Oxide Semiconductor (CMOS) tech- nology nodes abiding by Moore’s Law has been near the end. Meanwhile, the conventional Von Neuman architectures encounter a “memory wall” where the delay and power dissi- pation of accessing data has been much higher than that of an Arithmetic Logic Unit (ALU), which is caused by the separa- tion of storage and computing components in physical space. Therefore, new computation paradigms are pressed for the computing power demand for AI devices in the post-Moore’s Law era. One of the most promising solutions is to adopt Computing- in-Memory (CIM) scheme to greatly increase the parallel computation speed with much lower computation power. Re- cently, both volatile memory and non-volatile memory have been proposed as computing memories for CIM. SRAM is a volatile memory that enables partially digital [10, 17, 23, 47] or fully digital [11, 22] CIM schemes and allows weight up- dating during inference. Resistive Random Access Memory (ReRAM), which has shown great advantages of high density, high resistance ratio, and low reading power, is one of the most promising candidates for CIM schemes [7, 29, 36, 44, 45]. However, most research focuses only on the design of the CIM array, which lacks a flexible top-level architecture for configuring the storage and computing units of DNNs. Due to the high write power of ReRAM, the weight updating should be minimized [35, 38]. Therefore new flexible interconnect architectures and mapping strategies should be employed to meet the various requirements of DNNs while achieving high components utilization and energy efficiency. There are two main challenges to designing a low-power flexible CIM processor. The first challenge is that complicated 4-D tensors have to be mapped to 2-D CIM arrays. The conven- tional 4-D tensor flattening method has to duplicate data or weight, which increases the on-chip memory requirement. The second challenge comes from the flexibility requirement to support various neural networks. Therefore, partial-sums and feature maps are usually stored in the external memory, and an external processor is generally required to maintain complicated data sequences and computing flow. Network-on-Chip (NoC) with high parallelism and scal- ability has attracted a lot of attention from industry and academia [3]. In particular, NoC can optimize the process of computing DNN algorithms by organizing multiple cores uniformly under specified hardware architectures [1, 6, 8, 9, 13]. This paper proposes a tailored NoC architecture called Domino to enable highly localized inter- and intra- memory computing for DNN inference. Intra-memory com- puting is the same as the conventional CIM array to execute Multiplication-and-Accumulation (MAC) operations in mem- ory. Inter-memory computing is that the rest of computing (partial sum addition, activation, and pooling) is performed in the network when data are moving between CIM arrays. Con- sequently, “computing-on-the-move” dataflow is proposed to maximize data locality and significantly reduce the energy of the data movement. The dataflow is controlled by distributed local instructions instead of an external/global controller or processor. A synchronization and weight duplication scheme is put forward to maximize parallel computing and through- put. The high-energy efficient CIM array is also proposed. It is worth noting that various CIM schemes can be adopted in our proposed Domino architecture. The contributions of this arXiv:2107.09500v1 [cs.AR] 18 Jul 2021

Upload: others

Post on 02-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

Domino: A Tailored Network-on-Chip Architecture to EnableHighly Localized Inter- and Intra-Memory DNN Computing

Kaining Zhou*, Yangshuo He*, Rui Xiao* and Kejie Huang*

*Zhejiang University

ABSTRACTThe ever-increasing computation complexity of fast-growingDeep Neural Networks (DNNs) has requested new com-puting paradigms to overcome the memory wall in conven-tional Von Neumann computing architectures. The emerg-ing Computing-In-Memory (CIM) architecture has been apromising candidate to accelerate neural network comput-ing. However, the data movement between CIM arrays maystill dominate the total power consumption in conventionaldesigns. This paper proposes a flexible CIM processor archi-tecture named Domino to enable stream computing and localdata access to significantly reduce the data movement energy.Meanwhile, Domino employs tailored distributed instructionscheduling within Network-on-Chip (NoC) to implementinter-memory-computing and attain mapping flexibility. Theevaluation with prevailing CNN models shows that Dominoachieves 1.15-to-9.49× power efficiency over several state-of-the-art CIM accelerators and improves the throughput by1.57-to-12.96×.

1. INTRODUCTIONThe rapid development of Deep Neural Network (DNN)

algorithms has led to high energy consumption due to mil-lions of parameters and billions of operations in one infer-ence [19, 37, 39]. Meanwhile, the increasing demand forArtificial Intelligence (AI) computing entails flexible andpower-efficient computing platforms to reduce inference en-ergy and accelerate DNN processing. However, shrinkage inComplementary Metal-Oxide Semiconductor (CMOS) tech-nology nodes abiding by Moore’s Law has been near theend. Meanwhile, the conventional Von Neuman architecturesencounter a “memory wall” where the delay and power dissi-pation of accessing data has been much higher than that of anArithmetic Logic Unit (ALU), which is caused by the separa-tion of storage and computing components in physical space.Therefore, new computation paradigms are pressed for thecomputing power demand for AI devices in the post-Moore’sLaw era.

One of the most promising solutions is to adopt Computing-in-Memory (CIM) scheme to greatly increase the parallelcomputation speed with much lower computation power. Re-cently, both volatile memory and non-volatile memory havebeen proposed as computing memories for CIM. SRAM is avolatile memory that enables partially digital [10, 17, 23, 47]

or fully digital [11, 22] CIM schemes and allows weight up-dating during inference. Resistive Random Access Memory(ReRAM), which has shown great advantages of high density,high resistance ratio, and low reading power, is one of themost promising candidates for CIM schemes [7,29,36,44,45].

However, most research focuses only on the design of theCIM array, which lacks a flexible top-level architecture forconfiguring the storage and computing units of DNNs. Due tothe high write power of ReRAM, the weight updating shouldbe minimized [35, 38]. Therefore new flexible interconnectarchitectures and mapping strategies should be employedto meet the various requirements of DNNs while achievinghigh components utilization and energy efficiency. Thereare two main challenges to designing a low-power flexibleCIM processor. The first challenge is that complicated 4-Dtensors have to be mapped to 2-D CIM arrays. The conven-tional 4-D tensor flattening method has to duplicate data orweight, which increases the on-chip memory requirement.The second challenge comes from the flexibility requirementto support various neural networks. Therefore, partial-sumsand feature maps are usually stored in the external memory,and an external processor is generally required to maintaincomplicated data sequences and computing flow.

Network-on-Chip (NoC) with high parallelism and scal-ability has attracted a lot of attention from industry andacademia [3]. In particular, NoC can optimize the process ofcomputing DNN algorithms by organizing multiple coresuniformly under specified hardware architectures [1, 6, 8,9, 13]. This paper proposes a tailored NoC architecturecalled Domino to enable highly localized inter- and intra-memory computing for DNN inference. Intra-memory com-puting is the same as the conventional CIM array to executeMultiplication-and-Accumulation (MAC) operations in mem-ory. Inter-memory computing is that the rest of computing(partial sum addition, activation, and pooling) is performed inthe network when data are moving between CIM arrays. Con-sequently, “computing-on-the-move” dataflow is proposed tomaximize data locality and significantly reduce the energy ofthe data movement. The dataflow is controlled by distributedlocal instructions instead of an external/global controller orprocessor. A synchronization and weight duplication schemeis put forward to maximize parallel computing and through-put. The high-energy efficient CIM array is also proposed. Itis worth noting that various CIM schemes can be adopted inour proposed Domino architecture. The contributions of this

arX

iv:2

107.

0950

0v1

[cs

.AR

] 1

8 Ju

l 202

1

Page 2: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

paper are as follows:•We propose an NoC based CIM processor architecture

with distributed local memory and dedicated routers for inputfeature maps and output feature maps, which enables flex-ible dataflow and reduces the data movement significantly.The evaluation results show that Domino improves powerefficiency and throughput by more than 15% and 57%, re-spectively.• We define a set of instructions for Domino. The dis-

tributed, static, and localized instruction schedules are pre-loaded to Domino to control specific actions in the runtime.Consequently, our scheme avoids the overhead of the instruc-tion and address movement in the NoC.•We design a “computing-on-the-move” dataflow that per-

forms Convolution Neural Network (CNN) related operationswith our proposed NoC and local instruction tables. MACoperations are completed in CIMs, and other associated com-putations like partial sum addition, pooling, and activationare executed in the network when data are moving betweenCIM arrays.

The rest of the paper is organized as follows: Section 2introduces the background of CNNs, dataflow, and NoCs;Section 3 details the opportunity and innovation with respectto Domino; Section 4 describes the architecture and buildingblocks of Domino; Section 5 illustrates the computation anddataflow model; Section 6 defines the instruction and execu-tion; Section 7 presents the evaluation setup, experimentalresults, and experimental comparison; Section 8 introducesrelated works; finally, Section 9 concludes this work.

2. BACKGROUNDHigh computation costs of DNNS have posed challenges

to AI devices for power-efficient computing in a real-timeenvironment. Conventional processors such as CPUs andGPUs are power-hungry devices and inefficient for AI com-putations. Therefore, accelerators that improve computingefficiency are under intensive development to meet the powerrequirement in the post Moore’s Law era.

2.1 CNN BasicsAn essential computation operation of a CNN is convolu-

tion. Within a convolution layer (CONV layer), 3-D inputactivations are stacked as an Input Feature Map (IFM) tobe convolved with a 4-D filter. The value and shape of anOutput Feature Map (OFM) are determined by convolutionresults and other configurations such as convolution strideand padding type.

Given parameters in Tab. 1, and let O, I, W , and B denotethe OFM tensor, IFM tensor, weight tensors (filters), and bias(linear) tensor, respectively, the total OFM can be calculatedas

O[m][x][y] =C−1

∑c=0

K−1

∑i=0

K−1

∑j=0

I[c][Sx+ i][Sy+ j]

×W[m][c][i][ j]+B[m],

0≤ m < M, 0≤x < E, 0≤ y < F,

E = bH +2P−K +SS

c, F = bW +2P−K +SS

c.

(1)

Usually, a nonlinear activation is applied on the OFM

Parameter DescriptionH/W IFM height / widthC Number of IFM / filter channelsP Padding sizeK Convolution filter height / widthKp Pooling filter height / widthM Number of filters / OFM channelsE/F OFM height / widthS Convolution strideSp Pooling stride

Table 1: Shape Parameters of a CNN.

followed by pooling to reduce the spatial resolution and avoiddata variation and distortion. After a series of CONV andpooling operations, Fully Connected (FC) layers are appliedand performed to classify target objects.

2.2 DataflowThe dataflow rules how data are generated, calculated and

transmitted on an interconnected multi-core chip under agiven topology. General dataflow for a CNN accelerator canbe categorized into four types: Weight Stationary (WS), InputStationary (IS), Output Stationary (OS), and Row Stationary(RS), based on the taxonomy and terminology proposed in[42].

Kwon et al. propose MEARI, i.e., a type of flexibledataflow mapping for DNN with reconfigurable intercon-nects [25, 27]. Chen et al. present a row-stationary dataflowadapting to their spatial architecture designed for CNNs pro-cessing [9]. Based on that, [8] is put forward further as amore hierarchical and flexible architecture. FlexFlow is an-other dataflow model dealing with parallel types mismatchbetween the computation and CNN workloads [31]. Theseworks attempt to make the best advantages of computationparallelism, data reuse, and flexibility [5, 16, 34].

2.3 Network-on-ChipCurrent CIM-based DNN accelerators use a bus-based

H-tree interconnect [33, 40], where most latency of eachdifferent type of CNN is spent on communication [32]. Abus has limited address space and is forced synchronized ona complex chip. In contrast, an NoC can span synchronousand asynchronous clock domains or use asynchronous logicthat is not clock-bound, to improve a processor’s scalabilityand power efficiency.

Unlike simple bus connections, routers and ProcessingElements (PEs) are linked according to the network topology,which makes a variety of dataflows on an NoC. FollowingNNs’ characteristics of massive calculation and heavy trafficrequirements, the NoC is a prospective solution to providean adaptive architecture basis. Kwon et al. propose an NoCgenerator that generates customized networks for dataflowwithin a chip [26]. Firuzan et al. present a reconfigurableNoC architecture of 3-D memory-in-logic DNN accelerator[15]. Chen et al. put forward Eyeriss v2, which achieves highperformance with a hierarchical mesh structure [8].

2

Page 3: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

3. OPPORTUNITIES & INNOVATIONSAimed at challenges mentioned in Section 1, we identify

and point out opportunities lying in prior propositions.

3.1 OpportunitiesOpportunity #1. The CIM schemes significantly reduce

the weight movement. However, they may still have to ac-cess the global buffer or external memory to complete theoperations such as tensor transformation. partial addition,activation, and pooling. Methods like image-to-column andsystolic matrix need to duplicate input data, leading to ad-ditional storage requirement and low data reuse. Therefore,the power consumption may be still dominated by the datamovement.

Opportunity #2. Some overhead still exists in instructionand address transmission on an NoC. They need to be dis-patched from a global controller, and transmitted in the router.These overheads will cause extra bandwidth requirement andbring in long latency and synchronization issues.

3.2 Domino InnovationsIn face of the aforementioned opportunities and challenges,

we employ the “computing-on-the-move” dataflow, localizedstatic schedule table, together with tailored routers to avoidthe cost of off-chip accessing, instruction transmitting overNoC. Following are our innovations in Domino’s architecture.

Innovation #1. Domino adopts CIM array to minimize theenergy cost of weight refreshing and transmission. The dataare transmitted through dedicated dual routers NoC structure,where one router is for input feature map processing andanother one is equipped with a computation unit and scheduletable for localized control and partial-sum addition. Thelocalized data processing greatly reduces the energy for datamovement.

Innovation #2. Domino adopts “computing-on-the-move”dataflow to reduce the excessive and complex input andpartial-sum movement and tensor transformation. In priorworks of CIM-based architectures, the data have to be dupli-cated or stored in extra buffers for future reuses, the additionof partial-sum is either be performed in an external accumu-lator or a space-consuming adder tree outside PE arrays [27].The “computing-on-the-move” dataflow used in Domino aimsat performing the extra computation in data movements andmaximize data locality. Domino reuses input by transferringover the array of tiles and adds partial-sums along unifiedrouters on NoC. The overall computation can be completedon-chip and no off-chip data accesses are required.

Innovation #3. Domino adopts localized instructions inrouters to achieve flexible and distributed control. We identifythat existing designs of “computing-on-the-move” dataflow,like [21], are supported by the external controller or the headinformation including source and destination in transmis-sion packets. They introduce additional transmission delay,bandwidth requirement, and power consumption. A localinstruction table enables distributed and self-controlled com-putation to reduce the energy and time consumed by externalinstruction or control signals. The instructions fit and supportDomino’s “compute-on-the-move” well.

4. DOMINO ARCHITECTURE

This section details the designed architecture for DNNprocessing. To boost DNN computation while maintainingflexibility, we propose an architecture called Domino. Thepurpose of Domino’s architecture is to enable “computing-on-the-move” within a chip, so Domino assigns hardwareresources uniformly and identically on each building blockof different hierarchy.

From a top view, Domino consists of an input buffer andAr×Ac tiles interconnected in a 2-D mesh NoC. The numberof tiles is adjusted for different applications. The weights andconfiguration parameters are loaded initially. The input bufferis used to store the required input data temporarily. A Dominoblock is an array of tiles virtually split in mesh NoC to serve aDNN layer. A tile contains a PE performing in-memory DNNcomputation, and the two routers transmit the results in a tile.By this means, Domino achieves a high level of distributedcomputation and uniformity, making it a hierarchical, flexible,and easily reconfigurable DNN processor architecture.

4.1 Domino BlockDomino is virtually split into blocks corresponding to the

layers in a neural network. Let mt and ma denote the numberof rows and columns of a tile array in a block. In the initial-ization and configuration stages, a mt ×ma array of tiles areassigned to form a block used to deal with a CONV layertogether with a pooling layer (if needed) or an FC layer. Ablock provides interconnection via a bi-direction link betweentiles in four directions, as shown in Fig. 1. A Domino blockprovides diverse organizations of tiles for various dataflowsbased on layer configurations, including DNN weight dupli-cation and block reuse schemes. For instance, in pursuit oflayer synchronization, block adopts weight duplication thatthe block processes ma rows of pixels simultaneously. Wewill discuss the details of block variations in the dataflowsection.

Figure 1: The top-level block diagram of the Domino ar-chitecture. A Domino block is a mt ×ma array of tiles onNoC for computation of a DNN layer.

4.2 Domino TileA tile is the primary component of Domino that is used

for DNN computation. It involves a CIM array called PE, a

3

Page 4: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

Figure 2: A Domino tile contains two routers Rifm andRofm and a computation center PE.

router transferring IFMs called Rifm, and a router transfer-ring OFMs or their partial-sums in convolution computationcalled Rofm. The basic structure of a tile is illustrated inFig. 2. Rifm receives input data from one out of four direc-tions in each tile and controls the input dataflow to remoteRifm, local PE, and local Rofm. The in-memory computingstarts from the Rifm buffer and ends at Analog-to-Digital(ADC) converters in PE. The outputs of PE are sent to Rofmfor temporary storage or partial-sum addition. Rofm is con-trolled by a series of periodic instructions to receive eithercomputation results or input data via a shortcut from Rifmand maintain the dataflow to adding up partial-sums.

4.3 Domino RifmAs shown in Fig. 2, each Rifm possesses I/O ports in four

directions to communicate with Rifms in the adjacent tiles.It is also equipped with a buffer called Rifm buffer to storereceived input data in the current cycle. Moreover, Rifmhas in-tile connections with PE and Rofm. The bits in theRifm buffer controlled by an 8-to-1 MUX are sequentiallysent to PE for MAC computations. The buffer size is 8×Ncbits where Nc is the number of rows in PE (PE consists aNc×Nm ReRAM array which is discussed in Section 4.5). Itsupports an in-buffer shifting operation with a step size of 64or the multiple of 64, which supports the case when the inputchannels of a layer are less than Nc. A shortcut connectionfrom Rifm to Rofm is established to support the situation thatMAC computation is skipped (i.e., the shortcut in a ResUnit).A counter and a controller in Rifm decide input dataflowbased on the initial configuration. Once Rifm receives inputpackets, the counter starts to increment the counter’s value.The controller chooses to activate MAC computation or sends

“enable" signals to I/O ports to receive or transmit data basedon the initial configuration and the counter’s value.

4.4 Domino RofmRofm is the key component for “computing-on-the-move”

dataflow controlled by instructions to manage I/O ports andbuffers, add up partial/group-sum results, and perform acti-vation or pooling to get convolution results. Fig. 2 showsthe micro-architecture in Rofm, which consists of a set offour-direction I/O ports, input/output registers, an instructionschedule table, a counter to generate instruction indices, aRofm buffer to store partial computation results, a reusableadder, a computation unit with adequate functions, and a de-coder. The instructions are generated by the compiler basedon DNN configurations. The instructions in the scheduletable are executed periodically based on initial configurations.The internal counter starts functioning and keeps increasingits value as soon as any input packet is received. The decoderreads in the value of counter every cycle used as the indexof the schedule table to fetch an instruction. Instructions arethen decoded and split into control words to control ports,buffers, and other computation circuits within Rofm. Thepartial-sums are added to group-sums when transferring be-tween tiles. The group-sums are queued in the buffer forother group-sums to be ready and then form a complete com-putation result.

A computation unit is equipped in each Rofm to cope withthe non-linear operations in DNN computation, includingactivation and pooling, as shown in Fig. 2. Activation isonly used in the last tile of a block. The comparator is usedfor the max-pooling, which outputs the larger value of theconvolution results from adjacent Rofms. The multiplier andadder perform the average-pooling.

Figure 3: Domino PE: (a) The block diagram of theDomino PE, (b) the neuron circuit in the Domino PE, and(c) the simulation waveform of the PE.

4.5 Domino PEPE is the computation center that links a Rifm and a Rofm

in the same tile. Fig. 3 (a) shows the architecture of the pro-posed Domino PE, which is composed of a 1T1R crossbararray, current mirrors, two integrators, and a Successive Ap-proximation Register (SAR) ADC [50]. As shown in Fig. 3(b), eight single-level 1T1R cells are used to represent one8-bit weight w[7:0]. The voltage on ReRAM is clamped by thegate voltage and threshold voltage, which will generate twolevels’ of current based on the status of ReRAM and accesstransistors. Weight bits are divided into two sets (BL7−4 and

4

Page 5: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

BL3−0). In each set, four current mirrors are used to providethe significance for each bit line ( k

8 , k4 , k

2 , and k, where k is thegain to control the integration speed). The output of currentmirrors is accumulated in the integrator. The higher four bitsand lower four bits are jointed by the charge redistributionbetween two capacitors in the two integrators with a ratio of16:1. The significance of the input data is also realized byaveraging the charge between integrators and ADC [50]. Thesimulation waveform of the integration and charge sharingprocess is shown in Fig. 3 (c). In the initial configuration,neural network weights are mapped to ReRAM arrays in PEs,which will be discussed in detail in Section 5.

5. DATAFLOW MODELWe propose “computing-on-the-move” dataflow to reduce

both data movement and data duplication. Below we will in-troduce the dataflow in FC layers, CONV layers, and poolinglayers, then reveal how it supports different types of DNN andachieve high performances. The other “computing-on-the-way” schemes such as MAERI [27] and Active-routing [21]map computation elements to memory network for near-memory computing and aggregate intermediate results whentransmitting in a tree adder controlled by a host CPU. Incontrast, Domino conducts MAC operations in memory. Thepartial-sums are added up to get group-sums, which are thenstored in the buffer to wait for other group-sums ready foraddition. The group-sums are added up when moving alongthe routers controlled by local schedule tables. Domino onlytransmits data while MAERI/Active-routing contains extracomplex information such as operator and flow states.

5.1 Dataflow in FC layers

Figure 4: The proposed mapping and dataflow in FC lay-ers: (a) the partitioned input vector and weight matrix;(b) the dataflow to transmit and add multiplication re-sults to a complete MVM result.

In an FC layer, IFMs are flattened to a vector with di-mension 1×Cin to perform a Matrix-Vector Multiplication(MVM) computing with the weights. MVM can be formu-lated as y = xW, where x∈R1×Cin , y∈R1×Cout are input andoutput vector, respectively, and W ∈ RCin×Cout is a weightmatrix.

In most cases, the input and output vector dimensions inFC layers are larger than the size of a single crossbar arrayin PE. Therefore, the partitioned matrix multiplication as

described in Eqn. 2 is used to deal with the above-mentionedissue: the Cin×Cout weight matrix is divided into mt ×masmaller matrices Wi j with the dimension of Nc×Nm. Theinput vector is divided into mt small slices xi to multiply thepartitioned weight matrices. As shown in Fig. 4 (a), the sumof the multiplication results within a column of partitionedmatrices is a slice of output vector y j. The slices of vectorsare concatenated to produce the complete output.

x = [x1,x2 · · · ,xmt ]

W =

W11 W12 · · · W1ma...

.... . .

...Wmt 1 Wmt 2 · · · Wmt ma

y = [y1,y2 · · · ,yma ], y j =

mt

∑i=1

xiWij

(2)

We propose a mapping scheme for efficient computationin FC layers to efficiently handle partitioned matrix multi-plication in our tile-based Domino. As shown in Fig. 4 (b),the partitioned blocks are mapped to mt ×ma tiles (mt =

dCinNce, ma = dCout

Nme), which is divided into four columns and

four rows. The input vectors are transmitted to all ma columnsof tiles. The multiplication results, ¶ to ¹, are added whiletransmitting along the column. The final addition results inthe last tiles of the four columns, U to Z, are small slices of anoutput vector. Concatenating the small slices in all columnsgives the complete partitioned matrix multiplication result.

5.2 Dataflow in CONV LayersBased on Eqn. 1, the computation of a pixel in OFM is

the sum of point-wise MAC results in a sliding window.As shown in Fig. 5, we define the Nm point-wise and row-wise MAC results as the partial-sum and group-sum, respec-tively. This figure illustrates our distributed way of addingpartial-sums and timing control of “computing on the move”dataflow. Let P(l)

a denote the partial-sum of the lth pixels inthe ath sliding window and G(b)

a denote the group-sum of thebth row in the ath sliding window.

P(l)a =

C−1

∑c=0

I[Sx+ i][Sy+ j][c]×W[i][ j][c]

G(b)a =

C−1

∑c=0

c−1

∑j=0

I[Sx+b][Sy+ j][c]×W[b][ j][c]

O[x][y] =K−1

∑b=0

G(b)a

(3)

where l = iK + j, and a = xF + y. P(l)a , G(b)

a , O[x][y], andW[i][ j][c] ∈ RM . As shown in Fig. 5 (a), K partial-sums (¶,·, and ¸) can be added up to be a group-sum (U1). Fig. 5(b) illustrates how K group-sums (U1, U2, and U3) are addedup to be a complete convolution result U (the same for X, Y,Z). U1, U2, and U3 are sequentially generated and summedup one by one, in different timing and tiles. The group-sumswait in the buffer for ready of other group-sum, or are evictedonce they are no longer needed.

5

Page 6: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

Figure 5: (a) The defined partial-sum and group-sum; (b)The timing and location of how Domino generates OFMwith naive dataflow.

As shown in Fig. 6 (a), the input channels and outputchannels of a CONV layer are mapped to the inputs and theoutputs of a tile, respectively. If Nc = C and Nm = M, K2

points of the filters are mapped to K2 tiles. In the case that thedimension of a weight matrix exceeds the size of the ReRAMcrossbar array in PE (Nc ≤C and Nm ≤M), the filters are splitand mapped to d C

Nce×d M

Nme tiles. The tiles are placed closely

to minimize the data transmission. Likewise, the input vectoris split and contained in d C

Nce input packets. Adding and

concatenating MAC results of these PEs will generate thesame result as the non-split case. This case is similar to theFC dataflow as discussed earlier. When Nc > C, multiplepoints in a filter can be mapped to the same tile to improveReRAM cells’ utilization and reduce the energy for datamovement and partial-sum addition. In such a scenario, theCONV computing is accomplished by the in-buffer shiftingoperation. In case Nm ≥ 2M, the filters can be duplicatedinside a tile to maximize parallel computing.

Figure 6: The proposed mapping and dataflow in CONVlayer: (a) a K2×1 tile array mapping weights in a CONVlayer; (b) another type of K × K tiles array mappingweights in a CONV layer; (c) adding a group-sum whiletransmitting MAC results in a group of tiles.

As shown in Fig. 6 (a) and (b), each split block is allocatedwith an array of tiles that maps weights in a CONV layer.The array has multiple mapping typologies, such as K2×1(mt = K2, ma = 1) and K ×K (mt = ma = K), in pursuitof flexible dataflow and full utilization of tiles in the NoC

array. As defined in Eqn. 3, a group-sum is the sum of Kpartial-sums. We can cluster K tiles mapped into the samerow to a group. We categorize the CONV dataflow into threetypes: input dataflow, partial-sum dataflow, and group-sumdataflow.

The input dataflow. The blue arrows in Fig. 6 (a) illus-trate the input dataflow in a K2×1 array of tiles. The inputsare transmitted to an array in rows and flow through Rifms inK2 tiles with identical I/O directions. Another type of zigzagdataflow is demonstrated in Fig. 6 (b) in a K×K array oftiles. When the input data reach the last tile of a group, Rifminverts the flow direction, and the input data are transmittedto the tile at the adjacent column.

The partial-sum dataflow. The partial-sum dataflow isshown in Fig. 6 (c). a represents the first input data of IFM.At each cycle, input packets are transmitted to the Rifm ofthe succeeding tile. Meanwhile, MAC computation in PEis enabled if needed. Take input data a as an example, inthe first cycle, a is multiplied with weight A to produce apartial-sum ¶ which will be transmitted by Rofm to the nexttile. In the second cycle, a is transmitted to tile whose PEis mapped with weight B. The controller in the Rifm willbypass MAC computation because a only multiplies with Ain convolution. In the meanwhile, ¶ is stored in the Rofmbuffer waiting for addition. In the third cycle, b is multipliedwith B in PE, and partial-sum · is received by Rofm. Then¶ is popped out from the Rofm buffer and added with ·.Generally, partial-sums are added up to be a group-sum U1when transmitting along a group of tiles.

The group-sum dataflow. Adding group-sums to convo-lution results is also a process executed during data trans-mission. As shown in Fig. 6 (a) and (b), the orange arrowsindicate the group-sum dataflow. Group-sums add up to getthe convolution result U in the last tile of each group. WhenU1 is generated, it will wait in the third tile until U2 is gen-erated in the sixth tile. Then, U1 will be transferred to thesixth tile to add U2. Similarly, U1 +U2 waits in the sixthtile for U3 to generate a complete result U. After all linearmatrix computation, the computation unit in Rofm controlledby instructions takes effect. An activation function is appliedon the complete convolution result in Rofm in the last tile.

5.3 SynchronizationBecause of down sampling in CONV layers, the CONV

steps change with layers. For example, if the pooling filtersize Kp = 2 and the pooling filter stride Sp = 2, every fourOFM pixels will produce a pooling result. Therefore, the com-puting speed of the next layer has to be four times slower thanthe preceding layer, which will waste the hardware resourceand severely affect the computing throughput. To maximizecomputation parallelism and throughput, the CONV filtersneed to be duplicated. In other words, the weights of thepreceding filters should be duplicated by four times in theabove example.

However, a DNN model usually has a few down samplinglayers. As a result, the number of duplicated tiles may exceedthe total number of tiles in Domino. Therefore, a blockreuse scheme is proposed for such a situation to alleviateheavy hardware requirements. Fig. 7 demonstrates our weightduplication and block reuse scheme for VGG-11 model used

6

Page 7: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

Figure 7: Duplication and reuse scheme in VGG-11model used in [23], there are three pooling layers beforeL5, L7, and L9. The left axis shows the number of tilesand duplication, and the right axis shows the number ofreuses.

in [23] (CIFAR-10 dataset). There are three pooling layers,and the CONV steps before these three layers are 64, 16,and 4. If all layers are synchronized, a total of 892 tiles arerequired to map the network. If we reuse the tiles for allCONV layers by four times, the speed of the CONV layerswill be four times faster than the FC layers. In such a situation,a total of 286 tiles are required to map the network. It canbe concluded that there is a trade-off between chip size andthroughput.

5.4 Dataflow with Weight Duplication

Figure 8: The proposed mapping and dataflow in CONVlayers with weight duplication: (a) each column of tiles isassigned with duplicated weights and receives differentrows of data in IFM; (b) (c) adding partial-sums to begroup-sums while transmitting at two cycles.

Take weight duplication and reuse into consideration, thedataflow should be modified to match such scheme. The over-all dataflow with weight duplication is illustrated in Fig. 8. Inthis scenario, a block has ma (ma = 4 is the number of dupli-cation in this example) duplicated K2×1 (mt = K2) arraysof tiles.

The input dataflow. The input dataflow is demonstratedin Fig. 8 (a): four rows of data of an IFM are transmittedthrough four arrays of tiles in parallel. The massive datatransmission is alleviated by leveraging the spatial locality.Every ma rows of data are alternatively transferred to the

tiles in reverse order. As shown in Fig. 8 (a), the first fourrows are transmitted to the block in increasing order. Thefifth row of an IFM flows into the last column of the block.Group-sum T1 is computed in the third tile when the fourthrow flows through the tile. When the fifth row flows through,group-sum T2 is computed in the sixth tile. Since T1 and T2are generated in the same column of a block, inter-columndata transfer is reduced.

The partial-sum dataflow. Fig. 8 (b) and (c) depict thepartial-sum dataflow at a certain cycle and the third cyclethereafter, respectively. In Fig. 8 (b), partial-sums ¶, ¹, and¼ are computed in the first tiles of their groups and stored inthe Rofm buffer in their succeeding tiles. After two cycles,partial-sums ·, º, and ½ are generated in the second tilesin their groups, which are then added with the stored partial-sums ¶, ¹, and ¼, respectively. The computing process issimilar to the dataflow without weight duplication except thatK group-sums are computed in K columns.

The group-sum dataflow. A group-sum dataflow withweight duplication aims to reduce the data moving distanceby adding up group-sums locally. Blue and pink arrowsindicate the group-sum dataflow in Fig. 8 (a). We take theconvolution results Q and T for instances. Based on the inputdataflow and partial-sum dataflow, group-sums Q1 and Q2are computed in the first two columns of a block. Thus Q1is transmitted from the first column to the adjacent columnwhen it reaches the sixth tile. Q2 is stored in the Rofm bufferwaiting for Q1. Q2 will be popped out by instruction whenRofm receives Q1. They will be added up before transmittingalong the second column. The pink arrows represent anothersituation that the second group-sum is not waiting in Rofm.Based on the input dataflow, T1 and T2 are computed in thesame column but in two turns of inputs. When T1 reaches thesixth tile, T2 is not ready yet. Therefore, T1 will be stored inthe Rofm buffer to wait for the fifth row of data to flow intothe tiles. Once T2 is computed in the sixth tile, T1 and T2are added up to get T1 +T2. The group-sum dataflow greatlyreduces the data moving distance and frequency.

5.5 Pooling DataflowThe computation of CONV and FC layers is processed

within Domino blocks, while the computation of the poolinglayer is performed during data transmission between blocks.If a pooling layer follows a CONV layer, with pooling filtersize Kp = 2 and pooling stride Sp = 2, every four activation re-sults produce a pooling result. Weight duplication situation isshown in Fig. 9 (b), in every cycle a block produces four acti-vation results T to Y. When transmitting across tiles, the dataare compared, and the pooling result Z is computed. Fig. 9(c) shows the block reuse case that activation results are com-puted and stored in the last tile. A comparison is taken whenthe next activation result is computed. The Rofm outputs apooling result Z once the comparison of the pooling filter iscompleted. In this case, the computation frequency beforepooling layers is 4× higher than the succeeding blocks.

6. INSTRUCTIONS AND WORKFLOWDomino is a highly distributed and decentralized archi-

tecture and adopts self-controlled instructions, which arestored in each Rofm. The reason is that Domino tries to

7

Page 8: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

Figure 9: Output in the last tile: weight duplication orblock reuse scheme is used to deal with pooling layer.

reduce the bandwidth demand for transmitting data or in-structions through whole NoC, and avoid the delay and skewof long-distance signals; meanwhile, localized instructionsretain flexibility and compatibility for different DNNs to beprocessed.

6.1 Instruction SetA set of instructions is designed for computing and trans-

mitting data automatically and flexibly when dealing withdifferent DNNs. In Domino, Rofm is responsible for send-ing packets and accumulating partial-sums and group-sumscorrectly. Because our packets transmitted through NoC onlycontain payloads of DNN without any other auxiliary infor-mation (i.e., a message header, a tail, etc.), there must be anintuitive way to control the behavior of each Rofm on thechip. Consequently, we define an instruction set for Domino.

The instruction consists of several fields, each of whichrepresents a set of control words for relative ports or buffers,as well as the designed actions, as shown in Tab. 2. Theinstruction length is 16 bits, including four distinct fields: anRx field, a Function field, a Tx field, and an opcode field.The Rx field takes responsibility for receiving packets fromdifferent ports. The Function field contains sum and buffercontrol in C-type, activation function, pooling, and FC layercontrol in M-type. The Tx field governs the transmissionof one Rofm to four ports. There are two types of opcodesbearing two usages. C-type (stands for convolution type)denotes the instruction is served for controlling the processduring convolution computation, while the M-type is formiscellaneous operations other than convolution, such asactivating and pooling.

15 11 10 7 6 5 4 1 0

Rx Ctrl. Sum Buffer Tx Ctrl. Opc. C-typeRx Ctrl. Func Tx Ctrl. Opc. M-type

Table 2: The instruction format for Domino. There aretwo basic types of instructions generated by the compilerto handle dataflow correctly.

6.2 Schedule TableIn Domino, instructions in a Rofm should fit both intra-

and inter- block dataflows to support the “computing-on-the-move” procedure. The compiler generates the instructiontable and configuration information for each tile based onthe initial input data and the DNN structure. The highly

distributed and local-controlled structure avoids the extrainstruction transmission and external control signals whenprocessing DNNs.

After cycle-accurate analyses and mathematical derivation,instructions reveal an attribute of periodicity. During the con-volution computation, C-type instructions are fetched fromthe schedule table and executed periodically. The period p(p = 2(P+W )) is related to IFM. Within a period, actions arefixed to the given IFM and DNN configurations. Furthermore,every port’s behavior exhibits a period of p with a differentbeginning time. The control words for each port are stored inRofm, and they are generated based on the assumption thatthe convolution stride is one. When the convolution stride isnot one, the compiler will shield certain bit in control wordsto “skip” some actions in the corresponding cycles to meetthe required dataflow controlling. When a Rofm is mappedand configured to process the last row of a layer in a CNN, itwill generate activation and pooling instructions. Its period isrelated to pooling stride, p = 2Sp.

Once instructions for each Rofm are received and stored,the Rofm is configured to prepare for computation. For agiven Rofm, when a clock cycle begins, a counter providesan index to Rofm to fetch corresponding control words peri-odically. When the execution of an instruction ends, the stateof the current Rofm will be updated.

6.3 WorkflowAfter elaborated introduction and illustration of Domino

technical details, the overall workflow and processing chainare naturally constructed and revealed. Fig. 10 shows howDomino works in a sequence to process DNN. These build-ing blocks cover both ideas and implementations mentionedabove and integrate them into a whole architecture.

7. EVALUATIONThis section evaluates Domino’s characterization and per-

formances in detail, including energy efficiency and through-put. Meanwhile, we compare Domino against other state-of-the-art CIM-based architectures on several prevailing typesof CNNs to show the proposed architecture and dataflowadvantages.

7.1 MethodologyFirst, we specify the configurations of our Domino under

test; second, we describe the experiment setup and bench-marks; then we determine the normalization methods.

7.1.1 ConfigurationThe configuration of Domino and its tile is displayed in

Tab. 3. Domino adopts silicon-proven ADC in [20], silicon-proven SRAM array in [43], and the transmission parametersin [4]. The rest of components are simulated by spice orsynthesized by Synopsys DC with a 45 nm CMOS ProcessDesign Kit (PDK). Domino runs at a step frequency of 10MHz for instruction updating. The data transmission fre-quency is 640 MHz, and the data width is 64 bits. The supplyvoltage is 1 V. ReRAM’s area information is normalized fromthe silicon-proven result in [30]. In the initialization stage,each ReRAM cell in a PE is programmed with a 1-bit weightvalue. Eight ReRAM cells make up an 8-bit weight. At

8

Page 9: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

Figure 10: The workflow of Domino. Domino takes network model, dataset information, and area or power specificationfor compilation. The generated schedule tables and weights are mapped into routers and CIM arrays respectively. Afterinitialization, Domino begins to execute instructions and infer.

Component Descript. Energy/Compo. Area (µm2)ADC 8b×256 1.76 pJ 351.5

Integrator 8b×256 2.13 pJ 568.3Crossbar cell 8b 32.9 fJ 1.62

PE total Nc×Nm 48.1 fJ/MAC 341632

Buffer 256B×1 281.3 pJ 826.5Ctrl. circ. 1 4.1 pJ 1400.6Rifm total - 2227.1

Adder 8b×8×2 0.03 pJ/8b 0.07Pooling 8b×8 7.6 fJ/8b 34.06

Activation 8b×8 0.9 fJ/8b 7.07Data buffer 16KiB 281.3 pJ 52896

Sched. Table 16b×128 2.2 pJ/16b 826.5Input buffer 64b×2 17.6 pJ 51.7

Output buffer 64b×2 17.6 pJ 51.7Ctrl. circ. 1 28.5 pJ 2451.2

Rofm total - 53867.1

Tile total - 0.398 mm2

Table 3: The configuration information summary forDomino under evaluation.

the same time, each Rifm and Rofm is configured, and theschedule tables pre-load instructions. Once the configurationof Domino is finished, the instructions are ready for execu-tion, and Domino will commence processing DNNs whenthe clock cycle starts.

7.1.2 Experiment SetupWe select several types of CIM accelerator architectures

as the reference, including three silicon-proven architecturesand another four simulated architectures. The performancedata and the configuration parameters of the baselines aretaken from their papers. The following three experiments areconducted: (1) running different architectures with the bench-mark DNNs to compare the power efficiency; (2) evaluatingthe throughput of Domino under different DNNs and input;and (3) estimating PE utilization with various DNN modelsand PE size.

7.1.3 BenchmarksWe evaluate Domino using representative DNN models

like VGG-11 [23], VGG-16, VGG-19 [39], ResNet-18, andResNet-50 [18] as the model benchmarks. The datasetsCIFAR-10 [24] and ImageNet [14] are chosen to be datasetbenchmarks for evaluation. Then we run benchmarks consis-

tent to each comparison object respectively.

7.1.4 NormalizationA variety of CIM models lead to discrepancies of many

attributes that need normalization and scaling. To make afair comparison, we normalize technology nodes and supplyvoltage of digital circuits in each design to 45 nm and 1 Vaccording to the equations given in [41]. The activations’and weights’ precision is scaled to 8-bit as well. The supplyvoltage of analog circuits is also normalized to 1 V.

7.2 Performance ResultsWe evaluate Domino’s Computational Efficiency (CE),

power dissipation, energy consumption, throughput, and exe-cution time for each experiment environment in benchmarks,and compare them with other architectures.

7.2.1 Overall PerformanceWe evaluate Domino on various DNN models to prove

that its performance can be generalized to different modelpatterns. Tab. 4 shows Domino’s system execution time,energy consumption, and computational efficiency when itruns benchmarks with specified configurations. The time andenergy spent on initialization and compilation are not consid-ered. We break down energy into five parts: CIM, on-chipdata moving, on-chip data memory, other computation, andoff-chip data accessing. The first one is the energy consumedin PEs related to MAC operations. The latter four are theenergy of peripheral circuits. Domino achieves a peak CEof 25.92 TOPS/W when running VGG-19 with ImageNet,and a valley CE of 19.99 TOPS/W running ResNet-18 withCIFAR-10. It denotes that the larger the DNN model size,the better CE it achieves. The peripheral energy in VGG-19only occupies one-third of the total energy. The through-put of Domino varies with input data size and DNN model.VGG and ResNet models with a small dataset could achieve6.25E5 inferences/s high throughput. VGG-19 with Ima-geNet dataset still has a throughput of 1.276E4 inferences/s.The high throughput and low peripheral energy of Dominoare benefited from the pipelined “computing-on-the-move”dataflow that fully utilizes the data locality.

7.2.2 Computational EfficiencyThe comparison between Domino and other state-of-the-

art CIM architectures and DNN accelerator is summarizedin Tab. 4. The data in the table are either taken or calculatedbased on the results provided in their papers. The silicon-proven designs usually have low throughput and power con-

9

Page 10: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

Dataset CIFAR-10 ImageNetModel VGG-11 ResNet-18 VGG-16 VGG-19 ResNet-50

Architecture [23] Ours [48] Ours [46]*1 [27] Ours [35] [12] Ours [28] OursCIM type SRAM ReRAM SRAM ReRAM ReRAM n.a. ReRAM ReRAM ReRAM ReRAM ReRAM ReRAM

Tech. (mm) 16 45 65 45 40 28 45 32 65 45 65 40VDD (V) 0.8 1 1 1 0.9 n.a. 1 1 1 1 1.2 1

Frequency (MHz) 200 10 100 10 100 200 10 1200 1200 10 40 10Act. & W. precision 4 8 4 8 8 16 8 16 16 8 8 8

# of CIM array 16 900 4 900 1 n.a. 2500 2560 6400 2500 20352 900CIM array size (kb) 288 512 16 512 64 n.a. 512 256 4 512 256 512

Area (mm2) 25 358.2 9 358.2 9 6 995 5.32 0.99 995 91 358.2Tapeout/Simulation T S T S T S S S S S S S

Exec. time (us) 128 129.5 1890 203.5 0.67s n.a. 3471 6920 n.a. 3557 n.a. 2397CIM energy (uJ) 11.47 36.74 36.11 26.44 3700 0 744.1 25000 13000 944.3 130.2 168.3

On-chip datamoving energy (uJ) 0.16 2.63 n.a. 3.89 n.a. 36741 46.39 n.a. n.a. 52.81 0.64 16.97

On-chip memoryenergy (uJ) 5.11 25.41 n.a. 24.21 n.a. 27556 446.4 2310 2180 508.1 71.22 115.41

Other comp. (uJ)*2 2.67 0.48 n.a. 0.46 n.a. 50519 8.41 n.a. 1040 9.59 44.96 1.68Off-chip data

accessing energy (uJ) n.a. 0 123.89 0 3690 n.a. 0 5650 4000 0 85.98 0

Power (W) 0.17 40.78 0.005 34.38 0.011 0.54 15.89 4.8 0.003 19.33 166 30.84CE (TOPS/W) 71.39 23.41 6.91 19.99 4.15 0.27 24.84 0.68 3.55 25.92 21 23.14

Normalized*3

CE (TOPS/W) 9.53 23.41 2.82 19.99 9.24 0.36 24.84 2.73 12.98 25.92 22.46 23.14

Throughput (TOPS) 12.14 954.66 0.036 687.26 0.046 0.15 394.7 3.28 0.1 501 3488 713.6Throughput

(TOPS/mm2) 0.49 2.67 0.004 1.92 0.005 0.025 0.4 0.62 0.10 0.5 38.33 1.99

inferences/s 7815 6.25E5 n.a. 6.25E5 n.a. n.a. 1.28E4 n.a. n.a. 1.28E4 n.a. 1.02E5Accuracy(%) 91.51 89.85 91.15 91.57 46 n.a. 70.71 n.a. n.a. 72.38 76 74.89

*1 Adapted and normalized from average statistics. *2 Including CNN related computation like partial sum accumulation, activation functions, pooling operations, etc.*3 Voltage normalized to 1 V, precision normalized to 8-bit and digital circuit normalized to 45 nm referring to [41].

Table 4: Domino’s evaluation results and comparison under different DNN models.

sumption due to the small chip size. The typical bit width isonly 4-bit because analog computing is difficult to achievea high signal-to-noise ratio. It can be seen between the CIMcomputing efficiency and the system computing efficiencythat energy consumed by the peripheral circuits for data mov-ing and memory accessing may still dominate the total powerconsumption in most of the design. Benefited from a 16 nmtechnology node and tailored DNN model, Jia [23] couldachieve a smaller gap between the CIM computing efficiency(121 TOPS/W) and the system computing efficiency (71.39TOPS/W) at 4-bit configuration.

To make a fair comparison among different designs, allcircuits are normalized to 1 V, 8-bit and digital componentsare further normalized to 45 nm. From Fig. 11 we can see thatDomino achieves the highest system CE (25.92 TOPS/W),which is 1.15-9.49× higher than the CIM counterparts. More-over, Domino has a lower percentage of memory accessing,data transmitting energy (in Tab. 4). The energy breakdownshows that Domino effectively reduces peripheral energyconsumption, which is a vital factor in the overall systemperformance. The data locality and “computing-on-the-move”dataflow are very efficient in reducing the overall energyconsumption for DNN inference.

As for DNN accelerators, Domino overwhelms MAERI[27] on CE, which is the representative of conventional digitalarchitecture. The superiority comes not only from the use ofCIM but also from “computing-on-the-move” dataflow.

Figure 11: Performance comparisons with other archi-tectures (a) System computational efficiency after nor-malization; (b) Throughput normalized to each 8-bitcrossbar array cell.

7.2.3 ThroughputDomino has a distinguished advantage over other architec-

tures in terms of throughput. The existing architectures needto store feature maps and partial-sums to external memoryto support the complex dataflow. Furthermore, some de-signs also have to access external memory to update weightsfor different layers. In Domino, the weights of the DNN

10

Page 11: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

are all stored in crossbar arrays in the initial configuration.Layer synchronization with weight duplication and blockreuse scheme is further proposed to maximize the parallelcomputing speed of inference.

The chip size and the number of memory cells greatly af-fect the overall throughput of a system, and different memorytype has different memory size (i.e., ReRAM and SRAM).As shown in Tab. 4, the area normalized throughput of [48]is 0.004 TOPS/mm2 due to the small array size and 65 nmtechnology node. With a larger crossbar array size and 16 nmtechnology node, [23] achieves 0.486 TOPS/mm2. Our pro-posed scheme can achieve up to 2.67 TOPS/mm2 throughput,which is better than the state-of-the-art schemes except [28].The high density of [28] is due to the transistor-less cross-bar array estimated on 20nm technology node, which oftensuffers from the write sneak path issue and is very difficultto program the ReRAM cells to the desired state. Therefore,we normalize the throughput to an 8-bit cell, which reflectsboth throughput and PE utilization. As shown in Fig. 11(b), our design achieves 16.19 MOPS/8-b-cell throughput,which is 3.10× higher than TIMELY and 270× higher thanCASCADE. Although [47] supports zero-skipping and im-plements a ping-pong CIM for computing while refreshing,and [23] provides various mapping strategies to improve thethroughput, Domino still outperforms them by 7.36× and1.57×, respectively. The high throughput of Domino comesfrom the synchronization, the weight pre-loading, the datalocality, and the pipelined dataflow to reduce the computinglatency while maximizing the parallelism.

7.2.4 Utilization RatesThe size of the crossbar array greatly affects the CIM

power efficiency and cell utilization in PEs. Fig. 12 de-picts the crossbar array utilization over all layers in fourneural network models: VGG-11, VGG-16, ResNet-18, andResNet-50 with three crossbar array configurations (128×128, 256×256, and 512×512). Though the mapping strate-gies in Domino have improved cell utilization in PEs, theystill have low utilization in the first few layers because theinput and output channels are much less than the side lengthof the crossbar array. The average utilization of four modelsis 98%, 96%, 93%, and 92% when using a 128×128 crossbararray. The utilization is reduced to 90%, 89%, 86% and 79%using a 256×256 crossbar array, and 75%, 76%, 67% and54% with a 512×512 crossbar array. Lower utilization inResNet comes from its architecture that layers with smallchannels are prevalent. Though a smaller crossbar array hashigher utilization, it sacrifices the CIM computing efficiency,which is 31.4 TOPS/W, 41.58 TOPS/W, and 49.38 TOPS/Wat these three configurations. Therefore, it is a balance be-tween utilization and computational efficiency.

8. RELATED WORKVarious CIM architectures have been proposed to adopt

different designs, computing strategies, and architectures.SRAM- or DARAM-based CIM chips usually need to updateweights by accessing off-chip memory [10, 17, 23, 47], re-sulting in high energy consumption and latency. [23] utilizes1152 input rows to fit a 3×3 convolution filter, which will facelow cell utilization in other DNN configurations. ISAAC [38]

Figure 12: Utilization rates over all layers in four DNNmodels considering three PE crossbar configurations.

is a ReRAM-based accelerator that uses ReRAM crossbars tostore weights and eDRAM to store feature maps, which areorganized by routers and H-trees. Pipelayer [40] stores bothfilters and feature maps in ReRAM crossbars and computesthem in the pipelined dataflow. The interface between analogdomain and digital domain also contributes to high powerconsumption. [50] proposes a binary input pattern for CIM,instead of a high power consumption DAC, to reduce theinterface energy. In this way, both computing efficiency andreliability are improved. CASCADE [12] applies the signifi-cance of the bit lines in the analog domain to minimize A/Dconversions, thus enhancing power efficiency. TIMELY [28]also focuses on analog current adder to increase the CIMblock size, TDC and DTC to reduce the power consumption,and input data locality to reduce data movement. However,the uncertainty caused by the clock jitter and the time vari-ation caused by the complex RRAM crossbar network willgreatly reduce the ENOB of the converters. Moreover, thelarge sub-chip and weight duplication inside the crossbar ar-ray also significantly reduce the utilization of the CIM arrays.Till now, most of the designs don’t provide on-chip controlcircuits to maintain the complicated dataflow. Therefore, thefeature maps or partial-sums have to be stored on the externalmemory and controlled by additional processors to transformthe feature maps’ flow and synchronize the computing la-tency of different data paths. Accessing feature maps andpartial-sums have been the leading energy consumption inmost designs. Reducing the data movement energy has beenone of the most important research topics to design CIMprocessors.

9. CONCLUSIONThis paper has presented a tailored Network-on-Chip ar-

chitecture called Domino with highly localized inter- andintra-memory computing for DNNs. The key contributionand innovation can be concluded as follows: Domino (1)changes conventional NoC tile structure by using two dualrouters for different usages aimed at DNN processing, and en-abling architecture to substitute PE for different types of CIMarrays, (2) constructs highly distributed and self-controlledprocessor architecture, proposes efficient matching dataflowto perform “computing-on-the-move”, and (3) defines a set ofinstructions for routers, which is executed periodically to pro-cess DNNs. Benefiting from such design, Domino has uniquefeatures with respect to (1) eliminating data access to memory

11

Page 12: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

during one single inference, (2) minimizing data movementon-chip, (3) achieving high computational efficiency andthroughput. Compared with the competitive architectures,Domino has achieved 1.15-to-9.49× power efficiency im-provement and improved the throughput by 1.57-to-12.96×over several current advanced architectures [2] [49] [30].

REFERENCES[1] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur,

P. Merolla, N. Imam, Y. Nakamura, P. Datta, G. Nam, B. Taba,M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk,B. Jackson, and D. S. Modha, “Truenorth: Design and tool flow of a65 mw 1 million neuron programmable neurosynaptic chip,” IEEETransactions on Computer-Aided Design of Integrated Circuits andSystems, vol. 34, no. 10, pp. 1537–1557, 2015.

[2] M. Bavandpour, M. R. Mahmoodi, and D. B. Strukov,“Energy-efficient time-domain vector-by-matrix multiplier forneurocomputing and beyond,” IEEE Transactions on Circuits andSystems II: Express Briefs, vol. 66, no. 9, pp. 1512–1516, 2019.

[3] M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun,O. Mutlu, and T. Hoefler, “Slim noc: A low-diameter on-chip networktopology for high energy efficiency and scalability,” SIGPLAN Not.,vol. 53, no. 2, p. 43–55, Mar. 2018. [Online]. Available:https://doi.org/10.1145/3296957.3177158

[4] V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti,“Cycle-accurate network on chip simulation with noxim,” ACM Trans.Model. Comput. Simul., vol. 27, no. 1, Aug. 2016. [Online]. Available:https://doi.org/10.1145/2953878

[5] L. Chang, X. Ma, Z. Wang, Y. Zhang, Y. Ding, W. Zhao, and Y. Xie,“Dasm: Data-streaming-based computing in nonvolatile memoryarchitecture for embedded system,” IEEE Transactions on Very LargeScale Integration (VLSI) Systems, vol. 27, no. 9, pp. 2046–2059, 2019.

[6] K.-C. J. Chen, M. Ebrahimi, T.-Y. Wang, and Y.-C. Yang, “Noc-baseddnn accelerator: A future design paradigm,” in Proceedings of the 13thIEEE/ACM International Symposium on Networks-on-Chip, ser.NOCS ’19. New York, NY, USA: Association for ComputingMachinery, 2019. [Online]. Available:https://doi.org/10.1145/3313231.3352376

[7] W. Chen, K. Li, W. Lin, K. Hsu, P. Li, C. Yang, C. Xue, E. Yang,Y. Chen, Y. Chang, T. Hsu, Y. King, C. Lin, R. Liu, C. Hsieh, K. Tang,and M. Chang, “A 65nm 1mb nonvolatile computing-in-memoryreram macro with sub-16ns multiply-and-accumulate for binary dnn aiedge processors,” in 2018 IEEE International Solid - State CircuitsConference - (ISSCC), 2018, pp. 494–496.

[8] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexibleaccelerator for emerging deep neural networks on mobile devices,”IEEE Journal on Emerging and Selected Topics in Circuits andSystems, vol. 9, no. 2, pp. 292–308, 2019.

[9] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture forenergy-efficient dataflow for convolutional neural networks,” inProceedings of the 43rd International Symposium on ComputerArchitecture, ser. ISCA ’16. IEEE Press, 2016, p. 367–379. [Online].Available: https://doi.org/10.1109/ISCA.2016.40

[10] Z. Chen, X. Chen, and J. Gu, “15.3 a 65nm 3t dynamic analogram-based computing-in-memory macro and cnn accelerator withretention enhancement, adaptive analog sparsity and 44tops/w systemenergy efficiency,” in 2021 IEEE International Solid- State CircuitsConference (ISSCC), vol. 64, 2021, pp. 240–242.

[11] Y. D. Chih, P. H. Lee, H. Fujiwara, Y. C. Shih, C. F. Lee, R. Naous,Y. L. Chen, C. P. Lo, C. H. Lu, H. Mori, W. C. Zhao, D. Sun, M. E.Sinangil, Y. H. Chen, T. L. Chou, K. Akarvardar, H. J. Liao, Y. Wang,M. F. Chang, and T. Y. J. Chang, “16.4 an 89tops/w and 16.3tops/mm2all-digital sram-based full-precision compute-in memory macro in22nm for machine-learning edge applications,” in 2021 IEEEInternational Solid- State Circuits Conference (ISSCC), vol. 64, 2021,pp. 252–254.

[12] T. Chou, W. Tang, J. Botimer, and Z. Zhang, “Cascade: Connectingrrams to extend analog dataflow in an end-to-end in-memoryprocessing paradigm,” in Proceedings of the 52nd Annual IEEE/ACMInternational Symposium on Microarchitecture, ser. MICRO ’52.New York, NY, USA: Association for Computing Machinery, 2019, p.114–125. [Online]. Available:https://doi.org/10.1145/3352460.3358328

[13] M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday,G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C. Lin, A. Lines, R. Liu,D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan,Y. Weng, A. Wild, Y. Yang, and H. Wang, “Loihi: A neuromorphicmanycore processor with on-chip learning,” IEEE Micro, vol. 38,no. 1, pp. 82–99, 2018.

[14] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in 2009 IEEE Conferenceon Computer Vision and Pattern Recognition, 2009, pp. 248–255.

[15] A. Firuzan, M. Modarressi, M. Daneshtalab, and M. Reshadi,“Reconfigurable network-on-chip for 3d neural network accelerators,”in 2018 Twelfth IEEE/ACM International Symposium onNetworks-on-Chip (NOCS), 2018, pp. 1–8.

[16] D. Fujiki, S. Mahlke, and R. Das, “In-memory data parallel processor,”SIGPLAN Not., vol. 53, no. 2, p. 1–14, Mar. 2018. [Online]. Available:https://doi.org/10.1145/3296957.3173171

[17] R. Guo, Z. Yue, X. Si, T. Hu, H. Li, L. Tang, Y. Wang, L. Liu, M. F.Chang, Q. Li, S. Wei, and S. Yin, “15.4 a 5.99-to-691.1tops/wtensor-train in-memory-computing processor usingbit-level-sparsity-based optimization and variable-precisionquantization,” in 2021 IEEE International Solid- State CircuitsConference (ISSCC), vol. 64, 2021, pp. 242–244.

[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” 06 2016, pp. 770–778.

[19] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in2018 IEEE/CVF Conference on Computer Vision and PatternRecognition, 2018, pp. 7132–7141.

[20] Y. Hu, C. Shih, H. Tai, H. Chen, and H. Chen, “A 0.6v6.4fj/conversion-step 10-bit 150ms/s subranging sar adc in 40nmcmos,” in 2014 IEEE Asian Solid-State Circuits Conference (A-SSCC),2014, pp. 81–84.

[21] J. Huang, R. Reddy Puli, P. Majumder, S. Kim, R. Boyapati, K. H.Yum, and E. J. Kim, “Active-routing: Compute on the way fornear-data processing,” in 2019 IEEE International Symposium on HighPerformance Computer Architecture (HPCA), 2019, pp. 674–686.

[22] M. Imani, S. Gupta, Y. Kim, and T. Rosing, “Floatpim: In-memoryacceleration of deep neural network training with high precision,” in2019 ACM/IEEE 46th Annual International Symposium on ComputerArchitecture (ISCA), 2019, pp. 802–815.

[23] H. Jia, M. Ozatay, Y. Tang, H. Valavi, R. Pathak, J. Lee, and N. Verma,“15.1 a programmable neural-network inference accelerator based onscalable in-memory computing,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, 2021, pp. 236–238.

[24] A. Krizhevsky, “Learning multiple layers of features from tiny images,”University of Toronto, 05 2012.

[25] H. Kwon,, A. Samajdar, and T. Krishna, “A communication-centricapproach for designing flexible dnn accelerators,” IEEE Micro, vol. 38,no. 6, pp. 25–35, 2018.

[26] H. Kwon, A. Samajdar, and T. Krishna, “Rethinking nocs for spatialneural network accelerators,” in Proceedings of the EleventhIEEE/ACM International Symposium on Networks-on-Chip, ser.NOCS ’17. New York, NY, USA: Association for ComputingMachinery, 2017. [Online]. Available:https://doi.org/10.1145/3130218.3130230

[27] H. Kwon, A. Samajdar, and T. Krishna, “Maeri: Enabling flexibledataflow mapping over dnn accelerators via reconfigurableinterconnects,” in Proceedings of the Twenty-Third InternationalConference on Architectural Support for Programming Languages andOperating Systems, ser. ASPLOS ’18. New York, NY, USA:Association for Computing Machinery, 2018, p. 461–475. [Online].Available: https://doi.org/10.1145/3173162.3173176

[28] W. Li, P. Xu, Y. Zhao, H. Li, Y. Xie, and Y. Lin, “Timely: Pushingdata movements and interfaces in pim accelerators towards local andin time domain,” in Proceedings of the ACM/IEEE 47th AnnualInternational Symposium on Computer Architecture, ser. ISCA ’20.IEEE Press, 2020, p. 832–845. [Online]. Available:https://doi.org/10.1109/ISCA45697.2020.00073

[29] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao,C. Xue, W. Chen, J. Tang, Y. Wang, M. Chang, H. Qian, and H. Wu,“33.2 a fully integrated analog reram based 78.4tops/wcompute-in-memory chip with fully parallel mac computing,” in 2020IEEE International Solid- State Circuits Conference - (ISSCC), 2020,pp. 500–502.

12

Page 13: Domino: ATailoredNetwork-on-ChipArchitecturetoEnable

[30] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao,C.-X. Xue, W.-H. Chen, J. Tang, Y. Wang, M.-F. Chang, H. Qian, andH. Wu, “33.2 a fully integrated analog reram based 78.4tops/wcompute-in-memory chip with fully parallel mac computing,” in 2020IEEE International Solid- State Circuits Conference - (ISSCC), 2020,pp. 500–502.

[31] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “Flexflow: A flexibledataflow accelerator architecture for convolutional neural networks,”in 2017 IEEE International Symposium on High PerformanceComputer Architecture (HPCA), 2017, pp. 553–564.

[32] S. K. Mandal, G. Krishnan, C. Chakrabarti, J. S. Seo, Y. Cao, and U. Y.Ogras, “A latency-optimized reconfigurable noc for in-memoryacceleration of dnns,” IEEE Journal on Emerging and Selected Topicsin Circuits and Systems, vol. 10, no. 3, pp. 362–375, 2020.

[33] M. Mao, X. Peng, R. Liu, J. Li, S. Yu, and C. Chakrabarti, “Max2: Anreram-based neural network accelerator that maximizes data reuse andarea utilization,” IEEE Journal on Emerging and Selected Topics inCircuits and Systems, vol. 9, no. 2, pp. 398–410, 2019.

[34] X. Peng, R. Liu, and S. Yu, “Optimizing weight mapping and dataflow for convolutional neural networks on rram basedprocessing-in-memory architecture,” in 2019 IEEE InternationalSymposium on Circuits and Systems (ISCAS), 2019, pp. 1–5.

[35] X. Qiao, X. Cao, H. Yang, L. Song, and H. Li, “Atomlayer: Auniversal reram-based cnn accelerator with atomic layer computation,”in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC),2018, pp. 1–6.

[36] K. Qiu, N. Jao, M. Zhao, C. S. Mishra, G. Gudukbay, S. Jose,J. Sampson, M. T. Kandemir, and V. Narayanan, “Resirca: A resilientenergy harvesting reram crossbar-based accelerator for intelligentembedded processors,” in 2020 IEEE International Symposium onHigh Performance Computer Architecture (HPCA), 2020, pp.315–327.

[37] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolutionfor image classifier architecture search,” Proceedings of the AAAIConference on Artificial Intelligence, vol. 33, no. 01, pp. 4780–4789,Jul. 2019. [Online]. Available:https://ojs.aaai.org/index.php/AAAI/article/view/4405

[38] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: Aconvolutional neural network accelerator with in-situ analog arithmeticin crossbars,” in 2016 ACM/IEEE 43rd Annual InternationalSymposium on Computer Architecture (ISCA), 2016, pp. 14–26.

[39] K. Simonyan and A. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” in 3rd International Conference onLearning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds.,2015. [Online]. Available: http://arxiv.org/abs/1409.1556

[40] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelinedreram-based accelerator for deep learning,” in 2017 IEEEInternational Symposium on High Performance Computer Architecture(HPCA), 2017, pp. 541–552.

[41] A. Stillmaker and B. Baas, “Scaling equations for the accurate

prediction of cmos device performance from 180nm to 7nm,”Integration, the VLSI Journal, vol. 58, 02 2017.

[42] V. Sze, Y. Chen, T. Yang, and J. S. Emer, “Efficient processing of deepneural networks: A tutorial and survey,” Proceedings of the IEEE, vol.105, no. 12, pp. 2295–2329, 2017.

[43] S. L. Wu, K. Y. Li, P. T. Huang, W. Hwang, M. H. Tu, S. C. Lung,W. S. Peng, H. S. Huang, K. D. Lee, Y. S. Kao, and C. T. Chuang, “A0.5-v 28-nm 256-kb mini-array based 6t sram with vtrip-trackingwrite-assist,” IEEE Transactions on Circuits and Systems I: RegularPapers, vol. 64, no. 7, pp. 1791–1802, 2017.

[44] C. Xue, W. Chen, J. Liu, J. Li, W. Lin, W. Lin, J. Wang, W. Wei,T. Chang, T. Chang, T. Huang, H. Kao, S. Wei, Y. Chiu, C. Lee, C. Lo,Y. King, C. Lin, R. Liu, C. Hsieh, K. Tang, and M. Chang, “24.1 a1mb multibit reram computing-in-memory macro with 14.6ns parallelmac computing time for cnn based ai edge processors,” in 2019 IEEEInternational Solid- State Circuits Conference - (ISSCC), 2019, pp.388–390.

[45] C. Xue, T. Huang, J. Liu, T. Chang, H. Kao, J. Wang, T. Liu, S. Wei,S. Huang, W. Wei, Y. Chen, T. Hsu, Y. Chen, Y. Lo, T. Wen, C. Lo,R. Liu, C. Hsieh, K. Tang, and M. Chang, “15.4 a 22nm 2mb reramcompute-in-memory macro with 121-28tops/w for multibit maccomputing for tiny ai edge devices,” in 2020 IEEE International Solid-State Circuits Conference - (ISSCC), 2020, pp. 244–246.

[46] J. H. Yoon, M. Chang, W. S. Khwa, Y. D. Chih, M. F. Chang, andA. Raychowdhury, “29.1 a 40nm 64kb 56.67tops/wread-disturb-tolerant compute-in-memory/digital rram macro withactive-feedback-based read and in-situ write verification,” in 2021IEEE International Solid- State Circuits Conference (ISSCC), vol. 64,2021, pp. 404–406.

[47] J. Yue, X. Feng, Y. He, Y. Huang, Y. Wang, Z. Yuan, M. Zhan, J. Liu,J. W. Su, Y. L. Chung, P. C. Wu, L. Y. Hung, M. F. Chang, N. Sun,X. Li, H. Yang, and Y. Liu, “15.2 a 2.75-to-75.9tops/wcomputing-in-memory nn processor supporting set-associateblock-wise zero skipping and ping-pong cim with simultaneouscomputation and weight updating,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64, 2021, pp. 238–240.

[48] J. Yue, Z. Yuan, X. Feng, Y. He, Z. Zhang, X. Si, R. Liu, M. Chang,X. Li, H. Yang, and Y. Liu, “14.3 a 65nm computing-in-memory-basedcnn processor with 2.9-to-35.8tops/w system energy efficiency usingdynamic-sparsity performance-scaling architecture andenergy-efficient inter/intra-macro data reuse,” in 2020 IEEEInternational Solid- State Circuits Conference - (ISSCC), 2020, pp.234–236.

[49] M. Zhang, Y. Zhu, C.-H. Chan, and R. P. Martins, “16.2 a 4×interleaved 10gs/s 8b time-domain adc with 16× interpolation-basedinter-stage gain achieving >37.5db sndr at 18ghz input,” in 2020 IEEEInternational Solid- State Circuits Conference - (ISSCC), 2020, pp.252–254.

[50] Y. Zhang, K. Huang, R. Xiao, and H. Shen, “An 8-bit in resistivememory computing core withregulated passive neuron and bit lineweight mapping,” arXiv preprint arXiv-2008.11669, 2020.

13