efficient and effective sparse lstm on fpga with bank-balanced … · 2019-05-09 · efficient and...

10
Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao Harbin Institute of Technology [email protected] Chen Zhang Microsoft Research [email protected] Zhuliang Yao Tsinghua University [email protected] Wencong Xiao Beihang University [email protected] Lanshun Nie Harbin Institute of Technology [email protected] Dechen Zhan Harbin Institute of Technology [email protected] Yunxin Liu Microsoft Research [email protected] Ming Wu Microsoft Research [email protected] Lintao Zhang Microsoft Research [email protected] ABSTRACT Neural Networks based on Long Short-Term Memory (LSTM) are widely deployed in latency-sensitive language and speech appli- cations. To speed up LSTM inference, previous research proposes weight pruning techniques to reduce computational cost. Unfortu- nately, irregular computation and memory accesses in unrestricted sparse LSTM limit the realizable parallelism, especially when imple- mented on FPGA. To address this issue, some researchers propose block-based sparsity patterns to increase the regularity of sparse weight matrices, but these approaches suffer from deteriorated prediction accuracy. This work presents Bank-Balanced Sparsity (BBS), a novel spar- sity pattern that can maintain model accuracy at a high sparsity level while still enable an efficient FPGA implementation. BBS partitions each weight matrix row into banks for parallel comput- ing, while adopts fine-grained pruning inside each bank to main- tain model accuracy. We develop a 3-step software-hardware co- optimization approach to apply BBS in real FPGA hardware. First, we propose a bank-balanced pruning method to induce the BBS pat- tern on weight matrices. Then we introduce a decoding-free sparse matrix format, Compressed Sparse Banks (CSB), that transparently exposes inter-bank parallelism in BBS to hardware. Finally, we de- sign an FPGA accelerator that takes advantage of BBS to eliminate irregular computation and memory accesses. Implemented on Intel Arria-10 FPGA, the BBS accelerator can achieve 750.9 GOPs on sparse LSTM networks with a batch size of 1. Compared to state- of-the-art FPGA accelerators for LSTM with different compression techniques, the BBS accelerator achieves 2.3 ~3.7x improvement on Contribution during internship at Microsoft Research. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. FPGA ’19, February 24–26, 2019, Seaside, CA, USA © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-6137-8/19/02. . . $15.00 https://doi.org/10.1145/3289602.3293898 energy efficiency and 7.0 ~34.4x reduction on latency with negligible loss of model accuracy. KEYWORDS FPGA; Deep Neural Networks; LSTM; Weight Pruning; Inference; Bank-Balanced Sparsity ACM Reference Format: Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and Ef- fective Sparse LSTM on FPGA with Bank-Balanced Sparsity. In The 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA ’19), February 24–26, 2019, Seaside, CA, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3289602.3293898 1 INTRODUCTION Neural networks based on Long Short-Term Memory (LSTM) have been widely used in interactive and latency-sensitive applications such as machine translation, speech recognition and speech syn- thesis [13, 20, 24]. The size and computational cost of these LSTM models continue to grow in order to achieve better model accuracy. However, the stringent requirement on computational resources makes it challenging to achieve low inference latency for large LSTMs. The most time-consuming part of LSTM inference is matrix- vector multiplication (MxV). As the size of the LSTM network grows, MxV cost grows quadratically, thus significantly increasing the in- ference cost. Weight pruning is a model compression technique to reduce overall memory and computational costs. Early works [8, 10] dis- cover that removing LSTM weights below a small threshold has negligible impact on model accuracy. By clamping a significant por- tion of the weights to 0, weight pruning approach converts dense weight matrices to unstructured sparse matrices, thus reducing the computation and memory required to carry out inference. After pruning, the most significant part of LSTM inference changes from dense MxV to sparse matrix-vector multiplication (SpMxV). Though requiring less computation, the irregularity of SpMxV lim- its the maximum performance and energy efficiency achievable on hardware accelerators [17, 19, 27]. Unstructured sparse matri- ces cannot efficiently utilize underlying hardware resources due to three reasons: 1) the unbalanced non-zero weights distribution

Upload: others

Post on 25-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

Efficient and Effective Sparse LSTM on FPGA withBank-Balanced Sparsity

Shijie Cao∗Harbin Institute of Technology

[email protected]

Chen ZhangMicrosoft [email protected]

Zhuliang Yao∗Tsinghua University

[email protected]

Wencong Xiao∗Beihang University

[email protected]

Lanshun NieHarbin Institute of Technology

[email protected]

Dechen ZhanHarbin Institute of Technology

[email protected]

Yunxin LiuMicrosoft Research

[email protected]

Ming WuMicrosoft [email protected]

Lintao ZhangMicrosoft Research

[email protected]

ABSTRACTNeural Networks based on Long Short-Term Memory (LSTM) arewidely deployed in latency-sensitive language and speech appli-cations. To speed up LSTM inference, previous research proposesweight pruning techniques to reduce computational cost. Unfortu-nately, irregular computation and memory accesses in unrestrictedsparse LSTM limit the realizable parallelism, especially when imple-mented on FPGA. To address this issue, some researchers proposeblock-based sparsity patterns to increase the regularity of sparseweight matrices, but these approaches suffer from deterioratedprediction accuracy.

This work presents Bank-Balanced Sparsity (BBS), a novel spar-sity pattern that can maintain model accuracy at a high sparsitylevel while still enable an efficient FPGA implementation. BBSpartitions each weight matrix row into banks for parallel comput-ing, while adopts fine-grained pruning inside each bank to main-tain model accuracy. We develop a 3-step software-hardware co-optimization approach to apply BBS in real FPGA hardware. First,we propose a bank-balanced pruning method to induce the BBS pat-tern on weight matrices. Then we introduce a decoding-free sparsematrix format, Compressed Sparse Banks (CSB), that transparentlyexposes inter-bank parallelism in BBS to hardware. Finally, we de-sign an FPGA accelerator that takes advantage of BBS to eliminateirregular computation and memory accesses. Implemented on IntelArria-10 FPGA, the BBS accelerator can achieve 750.9 GOPs onsparse LSTM networks with a batch size of 1. Compared to state-of-the-art FPGA accelerators for LSTM with different compressiontechniques, the BBS accelerator achieves 2.3 ~3.7x improvement on

∗Contribution during internship at Microsoft Research.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’19, February 24–26, 2019, Seaside, CA, USA© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6137-8/19/02. . . $15.00https://doi.org/10.1145/3289602.3293898

energy efficiency and 7.0 ~34.4x reduction on latencywith negligibleloss of model accuracy.

KEYWORDSFPGA; Deep Neural Networks; LSTM; Weight Pruning; Inference;Bank-Balanced SparsityACM Reference Format:Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, DechenZhan, Yunxin Liu, Ming Wu, and Lintao Zhang. 2019. Efficient and Ef-fective Sparse LSTM on FPGA with Bank-Balanced Sparsity. In The 2019ACM/SIGDA International Symposium on Field-Programmable Gate Arrays(FPGA ’19), February 24–26, 2019, Seaside, CA, USA. ACM, New York, NY,USA, 10 pages. https://doi.org/10.1145/3289602.3293898

1 INTRODUCTIONNeural networks based on Long Short-Term Memory (LSTM) havebeen widely used in interactive and latency-sensitive applicationssuch as machine translation, speech recognition and speech syn-thesis [13, 20, 24]. The size and computational cost of these LSTMmodels continue to grow in order to achieve better model accuracy.However, the stringent requirement on computational resourcesmakes it challenging to achieve low inference latency for largeLSTMs. The most time-consuming part of LSTM inference ismatrix-vector multiplication (MxV). As the size of the LSTM network grows,MxV cost grows quadratically, thus significantly increasing the in-ference cost.

Weight pruning is a model compression technique to reduceoverall memory and computational costs. Early works [8, 10] dis-cover that removing LSTM weights below a small threshold hasnegligible impact on model accuracy. By clamping a significant por-tion of the weights to 0, weight pruning approach converts denseweight matrices to unstructured sparse matrices, thus reducing thecomputation and memory required to carry out inference.

After pruning, themost significant part of LSTM inference changesfrom dense MxV to sparse matrix-vector multiplication (SpMxV).Though requiring less computation, the irregularity of SpMxV lim-its the maximum performance and energy efficiency achievableon hardware accelerators [17, 19, 27]. Unstructured sparse matri-ces cannot efficiently utilize underlying hardware resources dueto three reasons: 1) the unbalanced non-zero weights distribution

Page 2: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

might cause workload skew among processing elements (PEs); 2)concurrent irregular memory accesses to a dense vector lead tomemory access conflicts, which could stall the parallel execution;and 3) sparse matrix representations such as compressed sparse row(CSR) use indexes to track non-zero values, which require decodingbefore computation.

To address these issues, further works [17, 19] suggest usingcoarser-grained weight pruning methods to induce more structuredsparsity patterns for better hardware acceleration. Coarse-grainedpruning methods prune weights in the granularity of blocks. Fromthe hardware perspective, blocks of non-zero weights can enablecontiguous memory accesses and better utilize parallel computa-tion resources. Unfortunately, it becomes challenging to maintainthe same model accuracy when block sparsity is applied. Blocksparsity constraints the locality of the non-zero weights, and im-portant weights could be mistakenly pruned, resulting in modelaccuracy loss. Furthermore, the block size (i.e., pruning granularity)is application-sensitive, making it another hyper-parameter to tune.Existing work often needs to search a range of block sizes to find atrade-off between model accuracy and hardware efficiency [13, 17].

This work presents Bank-Balanced Sparsity (BBS), a novel spar-sity pattern for pruning LSTM. Bank-balanced pruning splits eachweight matrix row into multiple equal-sized banks, and adopts fine-grained pruning to each bank independently to obtain identicalsparsity among banks. BBS preserves the unstructured distributionof non-zero weights inside each bank, thus maintaining highermodel accuracy than that of block sparsity. Experimental results inSection 6 demonstrate that BBS achieves almost the same modelaccuracy as unstructured sparsity and significantly outperformsblock sparsity when pruning weights at the same sparsity level.

Importantly, BBS is also amenable to FPGA acceleration becauseit inherently provides a balanced matrix partitioning for parallelcomputing. We design an FPGA accelerator to take advantage of thebenefits of BBS to eliminate the computational overheads existedin unstructured sparsity. Specifically: 1) our accelerator utilizes theintrinsic bank-balanced property in BBS to achieve high parallelismin SpMxV with guaranteed load balance; 2) our accelerator supportsconcurrent random access requests to vector elements without con-flicts in SpMxV by adopting banked scratchpad memory to buffervectors; 3) to avoid decoding overheads of sparse matrix formats,we introduce a novel format for BBS matrices that is decoding-freein our FPGA accelerator. Notably, the BBS accelerator is highlyefficient even for inference with a batch size of 1, by exploitingfine-grained parallelism from a single sample which is challengingfor unstructured sparsity.

Overall, this paper makes the following contributions:

(1) We propose Bank-Balanced Sparsity, a novel sparsity pat-tern that can both maintain model accuracy and enable anefficient FPGA accelerator implementation.

(2) We design an FPGA-based accelerator for BBS that eliminatesload-imbalance, irregular memory accesses and decodingoverheads, and achieves good efficiency for LSTM inferenceeven at a batch size of 1.

(3) Implemented on Intel Arria-10 FPGA, the BBS acceleratorachieves 750.9 GOPs on large LSTMs without batching. Com-pared to state-of-the-art LSTMFPGA accelerators, we achieve

2.3 ~3.7x improvement on energy efficiency and 7.0 ~34.4xreduction on latency with negligible loss of model accuracy.

2 BACKGROUND2.1 Long Short-Term Memory.LSTM is one of the most successful cells used in Recurrent NeuralNetworks (RNNs) [11]. An LSTM network computes a mappingfrom an input sequence X = (x1, ...,xT ) to an output sequenceY = (y1, ...,yT ) by using the following equations iteratively from t= 1 to T :

it = σ (Wixxt +Wiryt−1 +Wicct−1 + bi ) (1)ft = σ (Wf xxt +Wf ryt−1 +Wf cct−1 + bf ) (2)дt = σ (Wcxxt +Wcryt−1 + bc ) (3)ct = ft ⊙ ct−1 + дt ⊙ it (4)ot = σ (Woxxt +Woryt−1 +Wocct−1 + bo ) (5)mt = ot ⊙ h(ct ) (6)yt =Wymmt (7)

where theW terms denote weight matrices, the b terms denote biasvectors. The symbols i , f , o and c are respectively the input gate,forget gate, output gate and cell activation (long-term memory).The ⊙ operator denotes element-wise multiplication, and the +operator denotes element-wise addition. σ is the logistic activationfunction and h is the hyperbolic tangent (Tanh) activation function.

Among all operators in LSTM,matrix-vectormultiplication (MxV)is the most memory-intensive and computation-intensive operator.The dimensions of xt , yt and ct are often the same, say D. There-fore, the number of weights is 12×D2. In each step of the inferencecalculation, the number of operations in MxV is 24 × D2, and thenumber of operations in element-wise operators (EWOP) is 9 × D.As a consequence, accelerating MxV is the key to low latency LSTMinference.

2.2 Weight PruningIt is widely observed that deep neural networks (DNNs) have alot of redundancy in weights. Pruning away (forcing to zero) aproper number of unimportant weights won’t affect model accu-racy. Moreover, weight pruning can reduce the model size andcomputational complexity for energy efficient hardware accelera-tion. Deep Compression [9, 10] provides a threshold-based weightpruning technique. This method prunes away small weights whoseabsolute values are less than a predefined threshold and retrains theremaining weights. Pruning and retraining are iteratively appliedto generate the sparse DNN model.

As mentioned in the introduction, unrestricted pruning of weightmatrices is unfriendly to hardware acceleration. Further work [17,19] proposes coarse-grained pruning methods to prune blocks ofweights. They pick the maximum magnitude or the average mag-nitude of the weights within a block as the representative of theentire block. If the representative magnitude is less than a pre-defined threshold, the entire block will be pruned. However, thepruning granularity affects hardware efficiency as well as modelaccuracy. Deep neural network designers struggle to balance modelaccuracy and hardware efficiency.

Page 3: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

3 BANK-BALANCED SPARSITYOur proposed sparsity pattern, Bank-Balanced Sparsity (BBS), achievesboth high model accuracy and high hardware efficiency. In this sec-tion, we first describe the pattern of BBS and the motivation fordesigning it. Then, we present the detailed bank-balanced pruningalgorithm to induce BBS on LSTM weight matrices. Finally, weanalyze the pruning effectiveness of BBS in terms of achievableaccuracy and sparsity. The efficient hardware acceleration designfor BBS will be introduced in the next section.

3.1 Bank-Balanced Sparsity PatternFor matrices represented in BBS, each matrix row is split into mul-tiple equal-sized banks (i.e., sub-rows), and each bank has the samenumber of non-zero values. Figure 1 illustrates BBSwith an exampleand compares it with unstructured sparsity and block sparsity. Inthis example, three sparse matrices with different sparsity patternsare all pruned from the dense example weight matrix in Figure 1(a))with a sparsity ratio of 50%. Fine-grained pruning globally sortsthe weights and prunes the smallest 50% of weights, leading to anunstructured sparse matrix (Figure 1(b)); Coarse-grained pruninginduces a block sparse matrix (Figure 1(c)) by setting the blocksize to 2x2 and the block representative with the block average;Our bank-balanced pruning induces a bank-balanced sparse matrix(Figure 1(d)) by splitting each matrix row into 2 equal-sized banksand applying fine-grained pruning inside each bank independently.

0.2 0.1 0.2 -0.6 0.1 0.4 -0.1 0.6

0.4 -0.3 0.4 0.1 0.2 -0.4 0.1 0.5

0.7 -0.1 -0.3 0.1 0.5 -0.1 0.5 0.1

-0.1 0.6 -0.5 0.3 -0.4 -0.2 0.3 0.6

0.2 -0.6 -0.1 0.6

0.4 0.1 0.1 0.5

0.7 -0.1 0.5 0.1

-0.1 0.6 0.3 0.6

0.2 -0.6 0.4 0.6

0.4 0.4 -0.4 0.5

0.7 -0.3 0.5 0.5

0.6 -0.5 -0.4 0.6

-0.6 0.4 0.6

0.4 0.4 -0.4 0.5

0.7 0.5 0.5

0.6 -0.5 0.3 -0.4 0.3 0.6

(a)Original Dense matrix (b) Unstructured sparse matrix

by global pruning

(c) Block sparse matrix by pruning 2x2

blocks according to block average.(d) Bank-balanced sparse matrix by

local pruning inside each 1x4 bank

Figure 1: Comparing BBS with unstructured sparsity andblock sparsity by pruning a dense matrix with a sparsity ra-tio of 50%.

We design this BBS sparsity pattern with consideration on bothhardware efficiency and model accuracy. In general, partitioningweight matrix into multiple sub-matrices is mandatory for parallelcomputing. In BBS, each matrix row is split into multiple bankswith the same size and same sparsity. This bank-balanced partition-ing enables an efficient SpMxV design to exploit both inter-rowparallelism and intra-row parallelism (i.e., inter-bank parallelism)with guaranteed load balance and no vector access conflicts. Thedetailed SpMxV design for BBS will be described in Section 4.1. Inaddition, since BBS applies fine-grained pruning within each bankindependently, the relatively large weights which contribute moreto model accuracy in each bank can be preserved.

Another potential design for a sparsity pattern would be to splitweight matrices into 2-D blocks like block sparsity and apply fine-grained pruning within each 2-D block. Larger weights within eachblock can be preserved as well in this scheme. However, after prun-ing, each 2-D block is still an unstructured sparse matrix. It is stillchallenging to design an efficient hardware accelerator architec-ture due to the irregularity of sparse sub-matrices. For example,parallelizing SpMxV across 2-D blocks leads to concurrent irregularvector accesses.

3.2 Bank-Balanced Pruning AlgorithmTo induce BBS on LSTMweight matrices, we adopt a bank-balancedpruning method that prunes each bank independently with thesame threshold percentage to obtain the same sparsity ratio amongbanks.

Algorithm 1 Bank-Balanced Pruning AlgorithmInput:

The matrix to be pruned,M ;The number of banks per row, BankNum;The expected sparsity, Sparsity;

Output:The pruned matrix,Mp ;

1: for eachMi ∈ M .rows do2: Divide the rowMi into BankNum blocks;3: for each bank ∈ Mi do4: Sort the elements in bank ;5: Calculate the bank internal threshold T in line with

Sparsity;6: for each element ∈ bank do7: prune element if element < T ;8: end for9: end for10: end for11: return the pruned matrix,Mp ;

Like previous pruning methods, we apply the bank-balancedpruning method iteratively to a pre-trained network, and fine-tunethe network after each pruning iteration to restore the model accu-racy. Algorithm 1 illustrates the detailed bank-balanced pruningmethod to induce BBS on LSTM weight matrices. In each pruningiteration, bank-balanced pruning first partitions each matrix row tomultiple equal-sized banks and sorts the weights within each bankby their absolute values. The importance of weights is representedas their bank internal ranking of absolute values. Iteratively, a per-centage of weights with the smallest absolute values are pruned.We slowly increase the pruning percentage from 0% to the targetsparsity, while the rate of increase decreases with each pruningiteration. During pruning, if the model accuracy drops significantlyand cannot be recovered via fine-tuning, we withdraw this pruningiteration and stop the pruning procedure.

3.3 Analysis of Our Pruning MethodIntuitively, a pruning method should remove only smaller weightsand preserve larger weights that contribute more to model accu-racy. Fine-grained pruning clamps weights of small magnitudes to

Page 4: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

(a) Unstructured Sparsity (b) BBS (c) Block Sparsity

0.00.10.20.30.40.50.60.7

Figure 2: Weight map visualization after pruning with (a) unstructured sparsity, (b) BBS, and (c) block sparsity (Sparsity ratio= 90%). These weight maps are 64 × 64 sub-matrices of the whole 1500 × 1500 matrix.

zero and preserves large weights to maintain model accuracy. Forbank-balanced pruning, we adopt fine-grained pruning inside banksindependently, so large weights inside each bank can be preserved.In contrast, coarse-grained pruning prunes blocks of weights, whichconstraints the locality of preserved non-zero weights, and there-fore some important weights could be mistakenly pruned whilesome unimportant weights are instead preserved. For the examplein Figure 1, the bank-balanced sparse matrix in (d) preserves similarlarger weights as the unstructured sparse matrix in (b), but theblock sparse matrix in (c) removes some large weights (e.g., 0.4 and0.5) but preserves some small weights (e.g. 0.1 and -0.1).

Table 1: Percentages of the largest weights that are pre-served in various sparsity patterns (Sparsity ratio = 90%).

WeightMatrices

UnstructuredSparsity BBS Block

SparsityWix 100.00% 91.30% 42.76%Wf x 100.00% 81.39% 24.26%Wcx 100.00% 84.45% 24.24%Wox 100.00% 85.62% 22.97%

To verify the pruning effectiveness of BBS and compare it withunstructured sparsity and block sparsity, we analyze and visualizethe weight matrices after corresponding pruning methods in a realLSTM model [28]. The hidden size of this LSTM model is 1500. Ta-ble 1 shows the percentage of the largest weights that are preservedin various sparsity patterns. Here we show the results ofWix ,Wf x ,Wcx andWox , other weight matrices have similar results. In thisanalysis, the sparsity ratios are all 90%, the bank size of BBS is 32and the block size of block sparsity is 4 × 4. Unstructured sparsityby fine-grained pruning naturally preserves 100% largest weightsbecause it globally prunes weights with smallest magnitudes. BBSpreserves more than 80% of the largest weights by fine-grainedpruning inside each bank, while block sparsity only preserves lessthan half of (or even quarter of) the largest weights. Figure 2 vi-sualizes these three kinds of sparse weight matrices of a 64 × 64sub-matrix which is randomly selected from the whole 1500 × 1500Wix . Grey grids indicate non-zero parameters and the grey levelindicates the magnitude of the absolute value. For the second matrixrepresented in BBS, each row has two banks (left and right sides

of the dashed line). Each bank has 3 non-zero weights. We can seethat the weight map of BBS is very similar to the weight map ofunstructured sparsity, but the weight map of block sparsity is quitedifferent because of the locality constraint.

In terms of achievable sparsity and accuracy, experimental resultson two typical data sets [7, 18] demonstrate BBS has almost thesame effectiveness as unstructured sparsity and outperforms blocksparsity, described in Section 6.2.

4 SPARSE MATRIX COMPUTATION ANDFORMAT FOR BBS

As mentioned, the irregularity of unstructured sparsity is not hard-ware friendly due to unbalanced computation, irregular memoryaccesses and decoding overheads. In contrast, the intrinsic bank-balanced property of BBS enables effective hardware designs toaddress these issues. For BBS, we introduce a highly parallel Sp-MxV design with guaranteed load balance and no vector accessconflicts, and an associated decoding-free sparse matrix format forthe SpMxV design.

4.1 Highly Parallel SpMxV DesignSpMxV consists of multiple dot product operations, one for eachsparse matrix row and the dense vector. The standard practice ofusing multiple PEs to parallelize dot products across matrix rowscan reduce computation time. However, irregular memory accesspatterns of unstructured sparse matrices restrict further parallelismwithin a dot product.

In addition to inter-row parallelism, BBS enables an efficientSpMxV design to exploit intra-row parallelism (i.e. inter-bank par-allelism) through the bank-balance partitioning. Figure 3 illustrateshow to exploit inter-bank parallelism in computing a dot productof two vectors (i.e., a BBS matrix row and the dense vector). Themultiplication for the non-zero elements inside each bank is per-formed serially, while the multiplications in different banks areperformed in parallel. In this example, the sparse matrix row isdivided into 4 banks, as is shown in different colors. The size ofeach bank is 3 and the sparsity is 1/3. The multiplied dense vec-tor is divided into 4 banks accordingly. Our design computes thedot product of two vectors by accumulating dot products of sub-vectors whose sizes are all the number of banks (N). Each bank

Page 5: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

of the sparse matrix row provides one non-zero element to formone sub-vector (e.g., (A,C,E,G)), while dense vector elements arefetched based on the indices of non-zero values to form anothersub-vector (e.g., (v0,v3,v7,v9)). For computing a dot product ofsub-vectors, N pair-wise multiplications are executed in parallel.Multiple dot products of sub-vectors are calculated in sequentialand accumulated to obtain the dot product of complete vectors.

Accumulate

Partial dot product: V0A+V3C+V7E+V9G

Skip Zeros

V0 V1 V2

V3 V4 V5

V6 V7 V8

V9 V10 V11

Accessed

Vector

Elements

BSB Matrix Row

Dense Vector

T = 0

T = 1

Bank 0 Bank 1 Bank 2 Bank 3

V0 V3 V7 V9

A 0 B C D 0 0 E F G 0 H

V2 V4 V8 V11B D F H

A C E GBank 0

Bank 1

Bank 2

Bank 3Complete dot product

Figure 3: Exploiting inter-bank parallelism in dot productcomputation of one BBS matrix row and the dense vector.

The bank-balanced property in BBS eliminates load imbalanceand irregular memory accesses. In BBS matrices, every row (andevery bank) has the same number of elements which automaticallyguarantees the load balance across rows and banks in SpMxV.Whencalculating a partial dot product, BBS ensures one and only oneelement is accessed in each bank. Therefore, storing each vectorbank in an independently accessible block RAM can supply vectorelements simultaneously with high bandwidth andwithoutmemoryaccess conflicts. The detailed FPGA implementation is shown inSection 5.

4.2 Decoding-Free Sparse Matrix FormatVarious sparse matrix formats have been proposed to reduce thememory footprint of sparse matrices. However, existing formatsintroduce decoding overheads when performing sparse matrix mul-tiplications. For FPGA implementation, decoding sparse formatsconsumes hardware resources and incurs latency. In order to elim-inate decoding overheads, we introduce a sparse matrix formatcalled Compressed Sparse Banks (CSB) that is specifically designedfor BBS.

Compressed Sparse Row (CSR) is a commonly used sparse matrixformat [1]. We use CSR as a representative encoding of existingformats for explanation and comparison. Figure 4(a) shows a bank-balanced sparse matrix represented in dense format. Figure 4(b)shows its corresponding CSR encoding. CSR incurs two types ofoverheads for SpMxV operation. First, CSR format encodes all non-zero elements in a row-major order. Thus, rearranging the non-zeroelements are inevitable in order to exploit inter-bank parallelism inSpMxV. Second, CSR format stores column indices and row pointersto track the location of each non-zero value. Thus, calculating mem-ory addresses is required to fetch vector elements. Other encodingformats, such as CSC and COO have similar limitations [1].

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0

1

A B C D E F G H

I J K L M N O P

(a) Original densely represented matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A B C D E F G H I J K L M N O P

0 2 4 5 8 11 13 14 0 1 4 6 9 11 12 13

0 8 16

CSR

VALUES

COLUMN INDICES

ROW POINTERS

(b) CSR represented matrix

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A C E G B D F H I K M O J L N P

0 0 0 1 2 2 3 2 0 0 1 3 1 2 3 1

CSB

VALUES

BANK INTERNAL

INDICES

Data Rearrangement for

inter-bank parallelization

Physical BRAM Addresses

(c) CSB represented matrix

Figure 4: The comparison between CSB and CSR.

The proposed CSB format takes advantage of the balanced prop-erty of BBS and eliminates the need for decoding. Figure 4(c) showsthe CSB representation of the corresponding matrix. The CSB en-coding uses two arrays to represent a bank-balanced sparse matrix.In the first array (i.e., values), all non-zero values are first arrangedin row-major order. Inside each row, the first non-zero elementsin each banks (e.g., (A,C,E,G)) are listed first, then the secondelements in each bank, and so on. The purpose of this data re-arrangement is to explicitly expose inter-bank parallelism, thusevery successive N elements in CSB can be directly fetched andcomputed upon in parallel. The second array (i.e., indices) lists thebank internal indexes of non-zero values which are column indicesmodulo bank size K. When each of the N vector banks is stored ina separate BRAM block on FPGA, the bank internal indices can bedirectly regarded as physical addresses to fetch the N correspondingvector elements in the BRAM blocks.

5 LSTM ACCELERATORIn this section, we introduce the BBS accelerator, an FPGA-basedaccelerator for LSTM networks with bank-balanced pruning. TheBBS accelerator is implemented as an accelerator on the PCIe I/Obus to serve LSTM inference requests from the host server. Ourdesign specially accelerates LSTM networks at a batch size of oneto reduce inference latency by devoting the on-chip resources toexploiting as much parallelism as possible from one single sample.

5.1 Overall ArchitectureFigure 5 shows the overall architecture of the BBS accelerator, whichconsists of a sparse matrix-vector multiplication unit (SpMxV Unit),an element-wise vector operation unit (EWOP Unit), a direct mem-ory access module (DMA) for load/store operations, on-chip memo-ries for matrices and vectors (Matrix Memory and Vector Memory),and a central controller. Before hardware acceleration, the hostserver uses the bank-balanced pruning method to prune weight ma-trices and represents sparse matrices in our proposed CompressedSparse Banks (CSB) format, then a lightweight compiler generates

Page 6: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

FPGA

SpMxV PE

*

...

*

** +

++

EWOP

ACT

+

ControllerInstruction Buffer

DMA

*Private Vector Buffer

Output

+

DRAMCntlr

PCIeCntlr

Off-chipDRAM

HostServer

Vector Memory

MatrixMemory

Indices

Values

Figure 5: Overall architecture

instructions for the hardware accelerator to accomplish the com-putation of LSTM. The controller receives and stores instructionsfrom the host server in the instruction buffer and dispatches themto their corresponding modules to execute.

The two important types of instructions are load/store instruc-tions and computational instructions:

Load/Store Instructions. Load/Store instructions are executedin the DMA module to transfer weight matrices and input/outputvectors. A load instruction reads data (model weights and inputs)from host memory/off-chip DRAM to on-chip memories. A storeinstruction writes data (outputs) from on-chip memories to hostmemory/off-chip DRAM. In practice, in many cases weight prun-ing can reduce model size enough to fit in on-chip memories. Forserving real-time LSTM with low latency, the default mode is tocompletely rely on on-chip memories. For large models that can notcompletely fit into on-chip memories even with compression, theBBS accelerator uses load/store instructions to read/write weightmatrices from/to off-chip DRAM.

Computational Instructions. As introduced in Section 2, alloperations in sparse LSTM can be put into 2 categories: SpMxVand EWOP (including addition, multiplication and three kinds ofactivations). Therefore, we design two kinds of computational in-struction: SpMxV instruction and EWOP instruction to fulfill LSTMcomputation. The SpMxV instruction is executed in the SpMxVunit to read the required matrix and vector from on-chip memo-ries, then compute dot products for matrix rows, and finally writethe result vector back to the vector memory. The EWOP instruc-tion is executed in the EWOP unit to read required vector(s) fromthe vector memory and write the resulting vector of element-wiseaddition/multiplication/activations back to the vector memory.

5.2 SpMxV UnitThe SpMxV unit implements the highly parallel design describedin Section 4.1. The SpMxV unit consists of M parallel processingelements (PEs) that compute dot products of distinct matrix rows

and the dense vector concurrently to exploit inter-row parallelism,while each PE is designed to exploit intra-row (i.e., inter-bank)parallelism in a single dot product operation.

In the center of Figure 5, we show the detailed architectureof a PE. Each PE contains a private vector buffer (PVB) to bufferthe dense vector being multiplied, because vector elements arerandomly accessed multiple times for all matrix rows in SpMxV.The PE computes the dot product of two vectors by accumulatingdot products of sub vectors. This computation includes 5 steps: (1)The PE reads N matrix row elements from the matrix memory andN vector elements based on the sparse indices from the privatevector buffer. (2) N multipliers operate simultaneously to obtain Nscalar products. (3) An N-input adder tree sums N scalar productsto calculate the partial dot product. (4) One more accumulator isused to obtain the complete dot product. (5) The dot product resultis written back to global vector memory. The PE is fully pipelinedso that one operation can be processed per clock cycle.

With M PEs and N multipliers per PE, this PE array achievesM × N parallelism for a single SpMxV operation.

5.3 Private Vector BufferIn each SpMxV PE, N weight elements can be simultaneously ac-cessed in one clock cycle because non-zero values have alreadybeen rearranged by CSB encoding format and contiguously storedin matrix memory. However, to access dense vector elements, thePVB needs to support N random memory accesses concurrently.Each BRAM in FPGA provides only two read and/or write ports.Using a single BRAM to buffer dense vectors can not supply Nelements from random addresses concurrently. Multi-pumping [25]and vector replication [14] are two alternative solutions. Multi-pumping supplies N elements by running the PEs with N timeslower frequency than the BRAM. This approach decreases clockrate significantly. Vector replication provides more ports by creat-ing replicas of the entire vector. Although this approach is simpleto implement, it is difficult to scale due to limited on-chip storage

Page 7: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

resources in FPGA and generally large input/output/state vectorsin LSTM.

In order to support random vector accesses at a high bandwidthwithout replicas inside a PE, we adopt the banking approach tobuffer vectors [5]. In this approach, the multiplied vector is also splitinto banks according to the bank partitioning of matrix rows in BBS.As shown in Figure 6, N banks of vector elements are stored in Nindependently accessible BRAMs. Therefore, the PVB can provide Nelements simultaneously with N bank internal indices (i.e., physicaladdresses for each BRAM). Weight matrices in LSTMs usually havethe same size, so we use a unified N in pruning and configure Nas the number of BRAMs in PVB. However for some LSTMs thathave weight matrices of different sizes, different Ns are selected inpruning to find an optimal sparsity, and the largest N is configuredas the number of BRAMs in PVB.

V0 V1 V2 V3 V4 V5 V6 V7

V8 V9 V10 V11 V12 V13 V14 V15

V32 V33 V34 V35 V36 V37 V38 V39

···

Accessed

Vector

Elements

Bank

Internal

Indices

···

2

3

6

V2

V11

V38

0 1 2 3 4 5 6 7

Bank0

Bank1

Bank4

······

······

Figure 6: Banked private vector buffer.

In some studies, banking is adopted to support random memoryaccesses to achieve high memory bandwidth [5, 31]. However, dueto the irregularity of data accesses, banked memory cannot handleimbalance workloads across banks and concurrent access requeststo the same BRAM. Addressing these issues requires additionallogic and clock cycles [5, 31]. The biggest difference of our bankedprivate vector buffer is that balanced memory access requests andno memory access conflicts are automatically guaranteed becauseof the intrinsic bank-balanced property of BBS. The SpMxV PEaccesses one and only one element in each BRAM per cycle.

Before a SpMxV operation, the vector to be multiplied requiresto be duplicated in each PE’s private vector memory to exploitinter-row parallelism. This brings new challenges. First, broadcast-ing vector elements to various PEs leads to high fan-out and thusresults in a low achievable clock frequency. We use a systolic arraystructure to achieve high clock frequency, similar to [22]. The sec-ond is the additional access latency. We double-buffer the privatevector buffer for pipelined data transfer and computation.

5.4 EWOP UnitThe EWOP unit performs various element-wise operations on vec-tors based on the instruction opcode. Vector addition and multipli-cation generate one result vector by reading two source vectors.Activation functions only read one source vector and apply non-linear functions to it to generate one result vector. The EWOPunit contains M operators operating in parallel for each kind ofoperations to reduce latency.

5.5 ControllerIn the computation flow of LSTM, some SpMxV operations andEWOP operations among different gates can be performed simulta-neously. The software compiler analyzes the dependencies and in-dicates the dependencies to instructions. The controller parallelizesinstructions according to their dependent instructions indicated bythe software compiler. When the SpMxV unit or the EWOP unit isidle (which means an instruction is finished), the controller checkswhether the next instruction has a dependency on the instructionbeing executed on the other unit. If not, the controller dispatchesthe next instruction to the idle unit, so that the SpMxV unit andEWOP unit can work simultaneously.

6 EVALUATIONOur evaluation centers around two aspects: the model accuracy ofBBS and the hardware efficiency of BBS accelerator.

6.1 Experimental SetupWe implemented the BBS accelerator in System Verilog, synthesizedwith Quartus Prime 17.1, and evaluated on a custom FPGA PCIecard with an Intel-Arria 10 FPGA [3]. The FPGA has 4 GB DDR3-1600 DRAM external memory. The host CPU is an Intel Xeon E52650 processor which is only responsible for data pre-processingand result collecting. The FPGA communicates with the host CPUthrough a PCIe Gen 3x8 bus, which supports up to 16 GB/s ofbidirectional bandwidth.

We evaluate the system with an LSTM language model of thePTB dataset [18] and an LSTM speech recognition model of theTIMIT dataset [7]. PTB dataset is widely used in Natural LanguageProcessing (NLP) research. It consists of 929k training words, 73kvalidation words, and 82k test words and it has 10k words in itsvocabulary. We adopt the LSTM model in [28], which achieves verygood quality on the PTB dataset. The small model has 200 hiddenunits per layer, while the medium one has 650 and the large onehas 1,500. The TIMIT corpus is designed to provide speech data foracoustic-phonetic studies. It contains broadband recordings of 630speakers of eight major dialects of American English, each readingten phonetically rich sentences. For the LSTM speech recognitionmodel, we set the input size to 153, the hidden size to 1024, andprojection size to 512 which are consistent with previous studies[8, 21].

6.2 BBS Model Accuracy6.2.1 ComparisonwithUnstructured Sparsity andBlock Spar-sity. We first evaluate the model accuracy of BBS and compare itwith unstructured sparsity and block sparsity. Figure 7 and Figure8 show the sparsity-accuracy trade-off results of various sparsitypatterns on PTB and TIMIT data sets, respectively. We use 64 banksin BBS and 4 × 4 blocks in block sparsity. For experiments of theLSTM language model, we use the large model with the hidden sizeof 1,500. Perplexity is a metric to quantify language model quality(lower is better). As shown in Figure 7, the perplexity curve of ourBBS is very close to the perplexity curve of unstructured sparsity.Both unstructured sparsity and BBS can preserve the perplexityuntil 80% of weights are pruned away. These two patterns even

Page 8: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

achieve slightly better model accuracy at around 60% sparsity com-pared to the dense baseline one. The perplexity of block sparsitystarts to increase significantly at 40% sparsity. Experiments on theLSTM speech recognition model show similar results (shown inFigure 8). BBS and unstructured sparsity can achieve 90% sparsitywithout accuracy loss, while block sparsity can only achieve 70%sparsity. These experimental results demonstrate that BBS has al-most the same effectiveness as random sparsity and outperformsblock sparsity in terms of achievable accuracy or sparsity duringpruning.

0.0% 10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%Sparsity

78

79

80

81

82

83

84

85

Perp

lexi

ty(th

e lo

wer

the

bette

r)

Block SparsityBBSUnstructured SparsityDense Baseline

Figure 7: Sparsity-Perplexity trade-off of various sparsitypatterns on PTB dataset.

0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0% 90.0%Sparsity

23.023.524.024.525.025.526.026.527.0

Phon

e Er

ror R

ate

(the

low

er th

e be

tter) Block Sparsity

BBSUnstructured SparsityDense Baseline

Figure 8: Sparsity - Phone Error Rate trade-off of varioussparsity patterns on TIMIT dataset.

6.2.2 Sensitivity to Bank Size. We further explore the accuracysensitivity of BBS to the bank size. As a comparison, we also explorethe accuracy sensitivity of block sparsity to the block size. Table 2shows the model accuracy at varying block/bank sizes for the largeLSTM language model. As shown, BBS achieves almost the samemodel accuracy regardless of the change of bank size. For blocksparsity, however, increasing the block size adversely affects modelaccuracy.

6.2.3 Quantization onPrunedModel. Quantization can achievemore compression rate and hardware efficiency for deep learningmodels by reducing the number of bits that represents a weight[9, 15]. In this work, we study the accuracy sensitivity of BBS toquantization bits. We apply the linear quantization method to LSTMmodels after bank-balanced pruning with 16-bit, 8-bit, and 4-bitfixed points. Both weights and activations are quantized. Table 3

Table 2: Perplexity sensitivity to the block size in block spar-sity and the bank size in BBS.

Model Perplexity on Sparsity60% 70% 80%

Block Sparsityblock size: 4×4 80.6 83.2 88.1block size: 8×8 82.4 86.4 95.2block size: 16×16 83.7 88.3 99.5

BBSbank size: 25 78.3 78.6 79.4bank size: 50 78.4 78.7 79.2bank size: 100 78.4 78.6 79.2

shows the effects of quantization under different bits on the largeLSTM language model after bank-balanced pruning. The perplex-ity is 78.8 for the original dense model and slightly increases to79.2 after pruning away 80% weights with BBS. 16-bit quantizationon the pruned model maintains the same perplexity, while moreaggressive quantization deteriorates perplexity.

Table 3: Language model perplexity after quantization un-der different bits.

Quantization Scheme Perplexity (%)float-32 dense model 78.8float-32 BBS model 79.2fixed-16 BBS model 79.2fixed-8 BBS model 79.8fixed-4 BBS model 143.1

6.3 BBS Accelerator Efficiency6.3.1 ResourceUtilization, ClockRate andPowerConsump-tion. Table 4 shows the resource utilization, clock rate and powerconsumption of our BBS accelerators. The reported results are basedon post-fit results from Quartus Prime 17.1. The operator bits (i.e.,data precision) is 16-bit since 16-bit is accurate enough to maintainmodel accuracy. The BBS accelerator sets to M = 64, N = 64, andthus the accelerator contains 64 PEs in the SpMxV unit, and eachPE has 64 multipliers executing in parallel. The Intel Arria 10 FPGAcontains 1518 DSPs which can be implemented as 3036 multipliers.The LSTM accelerator fully utilizes DSPs for multipliers, and useadditional ALMs for extra multipliers. We use M20Ks for the matrixmemory, and use MLABs for the private vector buffer because itconsists of relatively small memories that require independentlyaccessible ports.

Table 4: Resource utilization, clock rate and power consump-tion

ALMs (%) M20Ks (%) DSPs (%)289k (68%) 2509 (92%) 1518 (100%)Clock Rate (MHz) Power (Watt)

200 19.1

Page 9: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

6.3.2 Latency and Throughput. Our accelerator is highly effi-cient even with a batch size of 1, so we measure the latency of ourBBS accelerator without batching and calculate the correspondingthroughput. For small, medium and large LSTM language modelson the PTB data set, we also use three different numbers of banks(16,32,64) to prune models. Pruning away 80% weights incurs noeffect on model accuracy. Table 5 shows the latency of one LSTMand its corresponding throughput. The achievable performance in-creases as the model scale or the number of bank increases becauseof higher hardware utilization of the underlying PEs. In the caseof the large model with 1,500 hidden units and using 64 banks inmatrix partitioning, our accelerator takes 4.8us to finish a wholeLSTM layer, corresponding to 750.9 GOPS at a batch size of one.

Table 5: Latency and throughput results of running LSTMlanguage networks of various scales and various numbersof banks.

LSTM hidden size Num of Banks Latency(us)

Throughput(GOPS)

200 (small)16 1.7 37.332 1.4 43.464 1.3 47.4

650 (medium)16 4.3 158.832 2.8 238.064 2.1 318.5

1500 (large)16 13.9 257.732 7.8 458.564 4.8 750.9

6.3.3 Comparison with state-of-the-art LSTM Accelerators.We compare the performance of our BBS accelerator with threestate-of-the-art LSTM accelerators on FPGA: ESE [8], C-LSTM [21]and DeltaRNN [6]. These three studies adopt different optimiza-tion techniques to reduce computation requirements. ESE [8] usesthe weight pruning based compression technique and improve in-ference efficiency through batching multiple samples, but lacksoptimization of irregular memory accesses to reduce latency fora single batch request. C-LSTM [21] represents weight matriceswith block-circulant matrices and proposes an accelerator with anFFT-based computing kernel. DeltaRNN [6] uses the delta networkalgorithm to reduce MxV operations and corresponding weightfetches by skipping dispensable neuron activation changes below athreshold.

Table 6 shows the comparison results. We apply BBS to thesame LSTM model on the TIMIT dataset as ESE and C-LSTM adopt.We use the accuracy and performance numbers of ESE, C-LSTMand GRU reported in their papers. The performance numbers ofDeltaRNN are based on GRU which is an optimistic estimationbecause GRU is simpler than LSTM. With the same model on thesame data set, BBS achieves comparable compression rate andmodelaccuracy as ESE and C-LSTM. While our BBS accelerators achieves2.3x and 3.7x improvement on energy efficiency, and 34.4x and 7.0xspeedup on latency (or throughput at a batch size of one) comparedto ESE and C-LSTM. The reason why BBS accelerator can achievebetter single batch performance than ESE is that it enables the extra

dimension of parallelism and addresses the low memory bandwidthissue of irregular memory access in SpMxV.

7 RELATEDWORKNetwork Compression. Network compression can reduce themem-

ory and computation requirements of a neural network, increaseits inference speed and save energy [9]. Compression algorithmsmainly include pruning [10], sparsity-inducing regularization [23],quantization [15]. Based on the original sparsity method, furtherstudies propose structured sparsity methods by adding constraintson the locality of non-zero weights [17, 19, 26]. Structured sparsityis more amenable to hardware acceleration compared to unstruc-tured sparsity.

DNN accelerators. Hardware acceleration of DNNs has receivedsignificant attention from both industry and academia [4, 12, 16, 29].Due to the widely adopted pruning-based compression techniques,many accelerators for sparse neural networks are proposed [8, 21,30]. These works explored specialized sparse matrix multiplicationmodule that directly operates on sparse neural networks. Althoughthese accelerators achieve higher performance than general pro-cessors, the irregular computation and memory accesses in sparseneural networks still restrict the maximum parallelism achievableon customized accelerators.

SpMxV accelerators. SpMxV is most computation-intensive andmemory-intensive part in LSTM inference. Many FPGA and GPU ac-celerators for SpMxV have been proposed [2, 5]. However, SpMxV ishard to optimize due to its irregular memory access characteristics.By contrast, neural network pruning methods bring a restrictedfreedom to define the sparsity structure (e.g. hardware friendlysparsity) in weight matrices. BBS is a kind of structured sparsitypattern that increases hardware efficiency, while incurs negligibleloss on model accuracy.

8 CONCLUSIONThis paper proposes a novel sparsity pattern, BBS (bank-balancedsparsity), that achieves both high model accuracy for pruning LSTMand high hardware efficiency on FPGA. Our insight into designingBBS is partitioning weight matrix rows into banks for parallelcomputing and adopting fine-grained pruning inside each bankto maintain model accuracy. Evaluated on speech recognition andlanguage model tasks, BBS achieves the same model accuracy aspurely unstructured sparsity at various sparsity levels. Our BBSaccelerator on FPGA takes advantage of the intrinsic bank-balancedproperty of BBS, achieving high efficiency even for a batch sizeof 1. Compared to state-of-the-art FPGA accelerators for LSTMwith different compression techniques, BBS accelerator achieves2.3 ~3.7x improvement on energy efficiency and 7.0 ~34.4x reductionon latency with negligible loss of model accuracy.

9 ACKNOWLEDGEMENTSWe would like to thank Ningyi Xu, Wenqiang Wang, Bojie Li andYun Wang for all technical discussions and valuable suggestions onimproving this paper. We thank the anonymous reviewers for theirinsightful feedbacks and comments. Shijie Caowas partly supportedby National Nature Science Foundation of China (No.61772159).

Page 10: Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced … · 2019-05-09 · Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity Shijie Cao∗ Harbin Institute

Table 6: Speedup comparison with state-of-the-art LSTM accelerators

ESE[8] C-LSTM[21] DeltaRNN[6] OursPlatform XCKU060 Virtex-7 XC7Z100 Arria 10 GX1150

Frequency (MHz) 200 200 125 200Sparsity (%) 88.7 87.5 - 87.5Quantization fixed-12 fixed-16 fixed-16 fixed-16

Accuracy Degradation 0.30% 0.32% - 0.25%Throughput (GOPS) 282.2 131.1 192.0 304.1

Power (W) 41.0 22.0 7.3 19.1Energy Efficiency (GOPS/W) 6.9 6.0 26.3 15.9

Latency(us) 82.7 16.7 - 2.4Throughput at batch 1 (GOPS) 8.8 43.7 192.0 304.1

Effective Throughputat batch 1 (GOPS) 79.2 349.6 1198.0 2432.8

REFERENCES[1] 2018. Sparse Matrix Formats. https://docs.scipy.org/doc/scipy/reference/sparse.

html/. (2018).[2] Nathan Bell and Michael Garland. 2008. Efficient sparse matrix-vector multiplica-

tion on CUDA. Technical Report. Nvidia Technical Report NVR-2008-004, NvidiaCorporation.

[3] AdrianMCaulfield, Eric S Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers,Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim,and others. 2016. A cloud-scale acceleration architecture. In The 49th AnnualIEEE/ACM International Symposium on Microarchitecture. IEEE Press, 7.

[4] Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Todd Massengill, MingLiu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Logan Adams, Mahdi Ghandi,and others. 2018. A Configurable Cloud-Scale DNN Processor for Real-Time AI.In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture(ISCA). IEEE.

[5] Jeremy Fowers, Kalin Ovtcharov, Karin Strauss, Eric S Chung, and Greg Stitt.2014. A high memory bandwidth fpga accelerator for sparse matrix-vectormultiplication. In Field-Programmable Custom Computing Machines (FCCM), 2014IEEE 22nd Annual International Symposium on. IEEE, 36–43.

[6] Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbruck. 2018.DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator. In Proceed-ings of the 2018 ACM/SIGDA International Symposium on Field-ProgrammableGate Arrays. ACM, 21–30.

[7] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David SPallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM.NIST speech disc 1-1.1. NASA STI/Recon technical report n 93 (1993).

[8] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, DongliangXie, Hong Luo, Song Yao, Yu Wang, and others. 2017. Ese: Efficient speech recog-nition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays. ACM, 75–84.

[9] Song Han, Huizi Mao, andWilliam J Dally. 2015. Deep compression: Compressingdeep neural networks with pruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149 (2015).

[10] Song Han, Jeff Pool, John Tran, and William Dally. 2015. Learning both weightsand connections for efficient neural network. In Advances in neural informationprocessing systems. 1135–1143.

[11] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-termmemory. Neuralcomputation 9, 8 (1997), 1735–1780.

[12] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, and others.2017. In-datacenter performance analysis of a tensor processing unit. In Proceed-ings of the 44th Annual International Symposium on Computer Architecture. ACM,1–12.

[13] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande,Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, andKoray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. arXiv preprintarXiv:1802.08435 (2018).

[14] Charles Eric LaForest, Ming G Liu, Emma Rae Rapati, and J Gregory Steffan. 2012.Multi-ported memories for FPGAs via XOR. In Proceedings of the ACM/SIGDAinternational symposium on Field Programmable Gate Arrays. ACM, 209–218.

[15] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed pointquantization of deep convolutional networks. In International Conference onMachine Learning. 2849–2858.

[16] Shaoli Liu, Zidong Du, Jinhua Tao, Dong Han, Tao Luo, Yuan Xie, Yunji Chen,and Tianshi Chen. 2016. Cambricon: An instruction set architecture for neuralnetworks. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press,393–405.

[17] Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, and William JDally. 2017. Exploring the regularity of sparse structure in convolutional neuralnetworks. arXiv preprint arXiv:1705.08922 (2017).

[18] Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor.1999. Treebank-3 LDC99T42. CD-ROM. Philadelphia, Penn.: Linguistic DataConsortium (1999).

[19] Sharan Narang, Eric Undersander, and Gregory Diamos. 2017. Block-SparseRecurrent Neural Networks. arXiv preprint arXiv:1711.02782 (2017).

[20] Haşim Sak, Andrew Senior, and Françoise Beaufays. 2014. Long short-term mem-ory recurrent neural network architectures for large scale acoustic modeling. InFifteenth annual conference of the international speech communication association.

[21] Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and YunLiang. 2018. C-LSTM: Enabling Efficient LSTM using Structured CompressionTechniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Sym-posium on Field-Programmable Gate Arrays. ACM, 11–20.

[22] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, HanHu, Yun Liang, and Jason Cong. 2017. Automated systolic array architecturesynthesis for high throughput CNN inference on FPGAs. In Design AutomationConference (DAC), 2017 54th ACM/EDAC/IEEE. IEEE, 1–6.

[23] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learningstructured sparsity in deep neural networks. In Advances in Neural InformationProcessing Systems. 2074–2082.

[24] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, andothers. 2016. Google’s neural machine translation system: Bridging the gapbetween human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

[25] Hasan Erdem Yantir, Salih Bayar, and Arda Yurdakul. 2013. Efficient implemen-tations of multi-pumped multi-port register files in FPGAs. In Digital SystemDesign (DSD), 2013 Euromicro Conference on. IEEE, 185–192.

[26] Zhuliang Yao, Shijie Cao, andWencong Xiao. 2018. Balanced Sparsity for EfficientDNN Inference on GPU. arXiv preprint arXiv:1811.00206 (2018).

[27] Jiecao Yu, Andrew Lukefahr, David Palframan, Ganesh Dasika, Reetuparna Das,and Scott Mahlke. 2017. Scalpel: Customizing dnn pruning to the underlyinghardware parallelism. In Proceedings of the 44th Annual International Symposiumon Computer Architecture. ACM, 548–560.

[28] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neuralnetwork regularization. arXiv preprint arXiv:1409.2329 (2014).

[29] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong.2015. Optimizing fpga-based accelerator design for deep convolutional neuralnetworks. In Proceedings of the 2015 ACM/SIGDA International Symposium onField-Programmable Gate Arrays. ACM, 161–170.

[30] Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo,Tianshi Chen, and Yunji Chen. 2016. Cambricon-X: An accelerator for sparseneural networks. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACMInternational Symposium on. IEEE, 1–12.

[31] Shijie Zhou, Rajgopal Kannan, Yu Min, and Viktor K Prasanna. 2018. FASTCF:FPGA-based Accelerator for STochastic-Gradient-Descent-based CollaborativeFiltering. In Proceedings of the 2018 ACM/SIGDA International Symposium onField-Programmable Gate Arrays. ACM, 259–268.