enabling secure nvm-based in-memory neural network...

Enabling Secure NVM-Based in-MemoryNeural Network Computing by Sparse

Fast Gradient EncryptionYi Cai , Xiaoming Chen ,Member, IEEE, Lu Tian,

Yu Wang , Senior Member, IEEE, and Huazhong Yang , Fellow, IEEE

Abstract—Neural network (NN) computing is energy-consuming on traditional computing systems, owing to the inherent memory wall

bottleneck of the von Neumann architecture and the Moore’s Law being approaching the end. Non-volatile memories (NVMs) have

been demonstrated as promising alternatives for constructing computing-in-memory (CIM) systems to accelerate NN computing.

However, NVM-based NN computing systems are vulnerable to the confidentiality attacks because the weight parameters persist in

memory when the system is powered off, enabling an adversary with physical access to extract the well-trained NN models. The goal

of this article is to find a solution for thwarting the confidentiality attacks. We define and model the weight encryption problem. Then we

propose an effective framework, containing a sparse fast gradient encryption (SFGE) method and a runtime encryption scheduling

(RES) scheme, to guarantee the confidentiality security of NN models with a negligible performance overhead. Moreover, we improve

the SFGE method by incrementally generating the encryption keys. Additionally, we provide variants of the encryption method to better

fit quantized models and various mapping strategies. The experiments demonstrate that only encrypting an extremely small proportion

of the weights (e.g., 20 weights per layer in ResNet-101), the NN models can be strictly protected.

Index Terms—Non-volatile memory (NVM), compute-in-memory (CIM), neural network, security, encryption

Ç

1 INTRODUCTION

DEEP learning has recently made significant advances inthe field of artificial intelligence (AI) [1]. The growing

computing capability givesmore opportunities for the devel-opment of neural networks (NNs). However, the trendstowardwidening and deepening neural network (NN) archi-tectures have put a tremendous pressure on the computinghardware. Conventional von Neumann architecture is con-strained by the inherent memory wall bottleneck, i.e., spend-ing substantial time and energy on moving data between thememory and the processors. Moreover, the Moore’s Law ismoving towards the end [2], restricting the further optimiza-tion of CMOS technologies. Thus, many researchers haveturned their attention to the fields of emerging devices andarchitectures.

Non-volatile memories (NVMs), e.g., resistive random-access memory (RRAM) and phase change memory (PCM),have emerged as promising alternatives for constructing

future NN accelerators [3], [4]. Among the advantages ofthe NVMs, the non-volatility allows the system to fastrestore from hibernation and tolerate power failures. Thehigh density and low leakage power of the NVMs also pro-vide larger capacity and incur less power consumption.Most importantly, NVMs can construct crossbars to per-form matrix-vector multiplications at the location of mem-ory to avoid data moving [5], referred to as computing-in-memory (CIM). Many studies have explored CIM-basedarchitectures, especially for both NN inference [4] andtraining [6].

Despite the desirable characteristics of NVMs, there arealso significant disadvantages and security vulnerabilitiesin CIM-based NN computing systems. One disadvantage isthat the data persist in the memory even when the system ispowered off, rendering a security risk of leaking NN mod-els. An adversary with physical access to the devices cansimply read the memory and extract the weight parametersof the NN models even without powering up the systems[7]. Another disadvantage is that many NVMs have disap-pointed endurance, typically ranging from 106 to 1010 [8],[9]. Therefore, the NVM-based systems are vulnerable tofrequent and massive write operations. Even under normaluse, the lifetime of NVM-based memory and/or computingsystems rarely reach the expectation [10], [11].

The risk of confidentiality leakage motivates the dataencryption. There are two general encryption schedulingapproaches to protect the confidentiality. One approach isbulk encryption, which encrypts the entire memory when thesystems are powered down and decrypts all when the sys-tems continueworking. However, such approach incurs large

� Yi Cai, Yu Wang, and Huazhong Yang are with the Department ofElectronic Engineering, Tsinghua University, Beijing National ResearchCenter for Information Science and Technology (BNRist), Beijing 100084,China. E-mail: [email protected], {yu-wang, yanghz}@tsing-hua.edu.cn.

� Xiaoming Chen is with the State Key Laboratory of Computer Architecture,Institute of Computing Technology, Chinese Academy of Sciences, Beijing100864, China. E-mail: [email protected].

� Lu Tian is with Xilinx Inc., Beijing, China. E-mail: [email protected].

Manuscript received 15 Jan. 2020; revised 5 June 2020; accepted 26 July 2020.Date of publication 19 Aug. 2020; date of current version 8 Oct. 2020.(Corresponding author: Yu Wang.)Digital Object Identifier no. 10.1109/TC.2020.3017870

1596 IEEE TRANSACTIONS ON COMPUTERS, VOL. 69, NO. 11, NOVEMBER 2020

0018-9340� 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht_tps://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Tsinghua University. Downloaded on October 08,2020 at 09:01:39 UTC from IEEE Xplore. Restrictions apply.

https://orcid.org/0000-0001-8264-3357

https://orcid.org/0000-0001-8264-3357

https://orcid.org/0000-0001-8264-3357

https://orcid.org/0000-0001-8264-3357

https://orcid.org/0000-0001-8264-3357

https://orcid.org/0000-0002-7337-1844

https://orcid.org/0000-0002-7337-1844

https://orcid.org/0000-0002-7337-1844

https://orcid.org/0000-0002-7337-1844

https://orcid.org/0000-0002-7337-1844

https://orcid.org/0000-0002-2931-8958

https://orcid.org/0000-0002-2931-8958

https://orcid.org/0000-0002-2931-8958

https://orcid.org/0000-0002-2931-8958

https://orcid.org/0000-0002-2931-8958

https://orcid.org/0000-0003-2421-353X

https://orcid.org/0000-0003-2421-353X

https://orcid.org/0000-0003-2421-353X

https://orcid.org/0000-0003-2421-353X

https://orcid.org/0000-0003-2421-353X

mailto:[email protected]





energy overhead and long encryption/decryption latency.Another approach is incremental encryption, which argues thatthe amount of data involved in an application ismuch smallerthan the entire data, so only a small percentage of thememoryneeds to be decryptedwhen the program runs, andmost inertdata can be kept encrypted [7]. However, the computation ofNN requires all the weight parameters involved because theinputs are propagated through all the layers. Thus, the systemneeds to decrypt the entire weights before starting work, andencrypt them again after the work completed. Taking VGG-16 network [12] as an example, approximately 138M parame-ters need to be encrypted/decrypted. Because each encryp-tion or decryption will write the NVM cells to tune theconductance to change the stored weight values, bulk encryp-tion on all weights incurs intensivewrite operations. Such tre-mendous writes are unacceptable and also challenge theendurance. Therefore, there must be a solution to substan-tially reduce the complexity and amount of the weightencryption.

Some researchers have designed encryption techniquesto thwart the data confidentiality attacks for NN weights.For instance, at the hardware level, P3M [13] has been pro-posed based on the physical unclonable functions (PUFs),which aims to protect the NN models in edge acceleratorsembedded with eDRAMs. Through their approach, onlythe authorized device can decrypt the model and make itwork normally. However, two drawbacks prevent P3Mfrom being transferred to the NVM-based NN accelerators:1) P3M is dedicated for the eDRAM-based accelerators;and 2) the encryption/decryption still operate on allweights. At the algorithm level, encryption methods suchas homomorphic encryption [14] have also been proposedto protect the privacy of NN models. However, thesemethods are inappropriate for normal NNs and NVM-based accelerators, and are usually with high complexity.

The goal of this work is to find an efficient solution forprotecting CIM- and NVM-based NN accelerators from thevulnerability of lingering NN models. The contributions aresummarized as follows.

� We analyze the principles of designing the protectivesolution. To search optimal solutions, we also defineand model the weight encryption problem.

� We propose a sparse fast gradient encryption (SFGE)method for encrypting the weights with negligibleoverhead to strictly protect the confidentiality.

� We introduce incremental SFGE (i-SFGE) to achievemore efficient encryption. We also specify theapproach and provide variants for adapting thequantized models and different mapping strategies.

� We propose a runtime encryption scheduling (RES)method that disperses the time of encryption/decryp-tion of different layers, to ensure the security of NNmodels at all time and hide the latency.

� We propose an efficient and robust protectingframework for thwarting the NN confidentialityattacks based on SFGE and RES. Thorough experi-ments have been made and demonstrate onlyencrypting an extremely small proportion of theweights can prevent the attackers from obtainingthe NN models.

2 PRELIMINARIES AND RELATED WORK

2.1 NVM-Based Neural Computing

An NVM cell has multiple conductance levels, and multiplecells can construct crossbar arrays. As shown in Fig. 1, bymapping a matrix onto the conductance of the cells in thecrossbars, and a vector onto the input voltages, the crossbarcan perform the matrix-vector-multiplications (MVMs) in anextremely high parallelism, without any moving of thematrix data. Assuming the crossbar size as R� C, accordingto the Ohm’s Law and the Kirchhoff’s Current Law, the rela-tionship of the input voltages and the output currents canbe formulated as: ioutðcÞ ¼

PRr¼1 gðr; cÞ � vinðrÞ, where vin

denotes the input voltage vector (indexed by r ¼ 1; 2; . . . ; R),iout denotes the output current vector (indexed by c ¼1; 2; . . . ; CÞ, gðr; cÞ denotes the matrix data (i.e., the conduc-tance of the cell) which is located in the rth row and cth col-umn of the crossbar [5].

The MVMs dominate the main operations in NN comput-ing because both convolution and fully-connected layers canbe decomposed to multiple MVMs. For fully-connectedlayers, the computation is exactly anMVM computation. Forconvolution layers, the weight kernel can be unfolded andthe computation can be decomposed as multiple MVMs.Through mapping the weights on the conductance of thecells and the feature maps on the input voltages, the NVMcrossbars can efficiently implement the operations in NN.This leads to tremendous opportunities for the NN accelera-tion by using theNVM crossbars.

2.2 AI Hardware Attacks

With AI’s landing, the security of AI has attracted increas-ing attention, especially in safety-critical applications, e.g.,autonomous driving. In the algorithm side, numerous stud-ies have explored the adversarial examples to fool NNs andalso discussed corresponding countermeasures [15], [16]. Inthe hardware side, there are also studies that discover vul-nerabilities in AI hardware, mainly including two categoriesof threat models: integrity attack and confidentiality attack.

The integrity attacks aim at undermining the integrity ofthe deployedmodels and cause them unavailable. HardwareTrojans can be injected into the hardware to achieve thisgoal. Hu-fu [17] injects hidden neurons in the networkwhichworks as a backdoor. When triggering the backdoor, the hid-den neurons will be activated and the NN models will out-put wrong predictions. Similarly, [18] designs a memoryTrojan and a trigger based on the the carefully designed

Fig. 1. The structure of NVM crossbar and neural network mapping,including (a) a semantic diagram of neurons and connections in neuralnetworks; (b) mapping the neurons and connection weights on the NVMcrossbar.

CAI ETAL.: ENABLING SECURE NVM-BASED IN-MEMORY NEURAL NETWORKCOMPUTING BY SPARSE FASTGRADIENT ENCRYPTION 1597


input images. When the trigger is detected, the memory con-troller will activate the payload and produce wrong outputs.Meanwhile, fault injections can be utilized to make the NNmodels misclassified by injecting faults into the memory bylaser beam or row hammer attacks [19].

The confidentiality attacks aim at extracting the deployedNN models on a variety of hardware accelerators. Sidechannel attacks (SCAs) are frequently utilized to obtain theNN architectures. For example, Hua et al. [20] utilizes thememory access pattern during the NN inference to revealthe network structures. DeepSniffer [21] is the first workthat explicitly and quantitatively evaluates model extrac-tion. It has been demonstrated in the GPU platforms thatsignificantly boosts adversarial attack effectiveness, sound-ing alarms for model protection. Similar approaches havealso emerged to reverse the NN architectures by countingthe GEMM calls via cache SCA [22], observing the patternsand timing of the operations [23], etc.

The confidentiality attacks can also be conducted byexploiting the data persistence. In 2003, the MIT researchersacquired 158 discarded hard disks from eBay. They success-fully uncovered old data from 117 out of the 158 disks [24].Later, another group of researchers found that through coldboot, the data stored in DRAM could also retain for severalminutes. And within such a short time, the attackers wouldbe able to dump the DRAM and extract the private informa-tion [25]. These studies demonstrate that data remanencewill lead to severe security vulnerabilities.

Protecting the confidentiality of neural network modelsis of paramount importance, as the leakage of NN modelswill lead to severe consequences. First, from a business per-spective, the NN models are definitely treated as core intel-lectual properties for the companies who run the algorithmsas key products. If the deployed NN models are cracked,the attackers will be able to duplicate the models withoutauthorization, which greatly harms the benefits of modelproviders. Second, the well-trained models are usuallytrained on private and sensitive data. Therefore, confiden-tial information may be encoded into the NN models andthe attackers may reverse sensitive information if theyobtain the models. Third, adversarial example attacks [26]have been demonstrated as a huge risk to the neural net-work security. If the attackers get the exact architecturesand parameters of the NN models, they can launch white-box adversarial attacks to force the NN to make wrong deci-sions, which is much easier than black-box attacks. There-fore, the confidentiality protection techniques are highlydemanded for the NVM-based NN computing systems.

2.3 Compensation for the NVM Vulnerabilities

Recall that the vulnerabilities of the NVMs mainly includethe risk of data leakage and limited endurance. Previousstudies have also tried to compensate for the vulnerabilities.

An active research area of NVM is dealing with the lim-ited endurance problem. There are two main types of solu-tions for dealing with the limited endurance problem. Onetype is reducing the write frequency, such as Flip-N-writescheme [27]. The other type is using wear leveling techni-ques to make the writes uniform across the entire memory,e.g., the Start-Gap wear leveling scheme [11]. There are alsostudies aiming at the NN application, in particular for the

NN training. For example, the work in [10] proposes a struc-tured gradient sparsification (SGS) schemewhich reduceswritefrequency, together with an aging-aware row swapping (ARS)method for the wear-leveling. These approaches provide sat-isfactory compensation for the limited endurance of NVMunder normal use. In addition, there are also approaches thatprotecting the NVM systems from the malicious wearing-out attacks [11], [28]. So far, the limited endurance is not amajor vulnerability that threatens the NVM-based systemsecurity.

There are also many attempts on protecting the linger-ing confidentiality in the NVMs. However, to our bestknowledge, most of the prior approaches are proposed forthe main memory applications, which are not adapt forthe CIM architecture and NN application. Encryption tech-niques are widely applied to protect the confidential dataremained in the NVM. Generally, an encryption solutionought to include three basic components, controlling how,when, and where to encrypt respectively. For the first com-ponent, typical approaches include advanced encryptionstandard (AES) [29] and counter mode encryption (CME)[30], which ensure that only the owners can access theplaintext data by keeping secret keys. For the last two com-ponents, as introduced before, bulk and incremental [7]encryption methods give the location and timing to per-form the encryption. However, as the NN computationinvolves heavy weights, the above conventional encryp-tion approaches will incur tremendous performance over-head. Therefore, the amount of encrypted weights shouldbe substantially reduced. As shown in Fig. 2, the goal is toencrypt fewest possible weights to make the NN modeldisabled. Then even if the weights are stolen, the attackersget only a bunch of meaningless numbers.

3 ATTACK MODEL

3.1 Goal of Security Protection

Owing to the data persistence of NVM, an adversary withthe physical access to a CIM system can extract the NNweights and infer the architecture by bypassing the OS pro-tection and physically reading the memory [7], [20]. Thus,the attack model in this work is confidentiality attack on the

Fig. 2. The goal of the confidentiality protection for NN models. When theweights are mapped on the NVM crossbars in plaintext form, the systemwill be vulnerable to the confidentiality attacks. Our goal is to encrypt thefewest possible weights to make the NN model misclassified. Note thateach circle represent a connection weight in the figure.



NN weights. With the AI devices becoming increasinglyubiquitous and mobile, the attackers have many opportu-nities to obtain the physical access to the CIM hardware.Therefore, it is necessary to find solutions for protectingthe confidential NN weights. The goal is to thwart thethreat of leaking the deployed NN models at negligibleoverhead.

Algorithm 1. Exhaustive Analysis of Channel-WiseEncryption

Input: Validation dataset xInput: Encryption function EncryptInput:Neural network weightsW1: Wback W2: L W:Layer num3: for l ¼ 1; 2; . . . ; L do4: C WðlÞ:Channel num5: for c ¼ 1; 2; . . . ; C do6: W Wback

7: Wðl; cÞ ¼ EncryptðWðl; cÞÞ8: Validate the encrypted model9: end for10: end for

3.2 Principles of Security Solution

A satisfactory solution to protecting the NN models in CIMsystems should satisfy the following principles:

1) Functionality of the NN models shall be guaranteedunder normal use, which means that when perform-ing the computation at some memory locations, allcorresponding weights should be decrypted.

2) Fast Restore. The solution must preserve the instant-on benefit of the non-volatility, i.e., once poweredup, the system must fast restore and start workinginstantly. Since the NN computation starts from thefront layers to the end, it is preferable to encrypt few-est weights in the front layers.

3) Low Overhead. The solution shall not incur large per-formance overhead. The steps of an en/decryptionare: reading the weight from the memory, sending itto the cryptographic engine, executing the en/decryption, and writing the weight back to the mem-ory. Each en/decryption incurs one read, two datamovings, and one write. Thus, the encrypted weightamount should be restricted.

4) No Vulnerability Window. The solution shall keep thesystem secure at all time, i.e., at any moment, mostof the weights should be encrypted and eliminatethe attack window. Whenever the attackers inter-rupt the system, they are unable to obtain the entireweights.

5) Hard to Crack. The solution should be strong enoughto prevent being easily cracked. Therefore, two basicrequirements must be satisfied. One requirement isthat the encrypted elements should be sufficientlyconcealed to make the encrypted weights undetect-able. Another requirement is that the encryptionshould disable the NN models, and ensure that theattacker cannot reveal the weights.

4 MOTIVATIONAL EXAMPLES

4.1 Where and How to Encrypt

Encrypting all weights can intuitively be more secure. How-ever, as mentioned before, it is inefficient to encrypt all theweights due to the unacceptable overhead. Moreover, italso widens the vulnerability window because when thesystem is powered off, it will take much more time to re-encrypt all the weights. Therefore, we must identify themost significant weights and partially encrypt them toreduce overhead. A straightforward idea of identifying thesignificant weights is analyzing the sensitivity of eachweight by exhaustive search. Sensitivity is defined as theimpact on the accuracy. However, exhaustive search has thefollowing drawbacks.

On one hand, the exhaustive search incurs a high com-plexity. Assuming that we divide the weights into G groupsand encrypt them independently to observe their sensitivi-ties, the complexity of the analysis will approach OðGÞ �OðTestÞ, where OðTestÞ represents the complexity of a vali-dation round, andwill linear increasewith the instance num-ber in the test dataset. As the group numberG increases, thetime required for the search will be linearly enlarged. Forinstance, to analyze a VGG-16 on the ImageNet dataset, pervalidation round needs to test 50,000 pictures. Assuming thethroughput of the validation system (e.g., GPU) as 200 FPS,per analysis round will consume 250G seconds. Even atcoarser grouping case that each group contains 1,000weightsin VGG-16, without considering the overlapping, the analy-sis will consume 250 x 138k = 34.5 million seconds, approxi-mately 399 days. Moreover, coarsely grouping the weightswill introduce quantities of ineffective encryption on theinsignificant weights. The situation will become more com-plicated if considering irregular and cross-layer grouping.Thus, such exhaustive approach is impractical for identify-ing significant weights.

On the other hand, the encryption effectiveness is not sat-isfactory. We make experiments of encrypting a ResNet-18network [31] trained on the CIFAR-10 dataset, as shown inFig. 3. The adopted encryption methods include encrypt-at-0,1, and random, which encrypt the target weights as zero, max-imum values, and random values respectively. Due to thesparsity nature of NN, encrypting a single channel at zerohas little impact on the recognition results. The encrypt-at-1

Fig. 3. The sensitivity of different channels in ResNet-18 versus thechannels indexes. The validation accuracy presents the performance ofthe encrypted models. The attached algorithm shows the process of thesensitivity analysis.



method can identify the sensitivities of different channelsmost significantly. However, two problems are raised. First,encrypting an entire channel will be easily detected by theattackers. Second, the analysis does not tell the sensitivity ofthe back layers, because the channel number increases withthe layer depth so that each channel shares limited contribu-tion to the computation. Therefore, we must find an efficientmethod to identify the significant weights.

4.2 When to Encrypt

Another critical problem is when to encrypt the weights. Weconsider a typical application scenario that is often encoun-tered in internet of thing (IoT), wearable devices and edgecomputing with intermittent working mode. When an exter-nal signal wakes up the system, it begins to restore and han-dle the incoming tasks. As shown in Fig. 4, in volatilememory-based systems, at most given time, the NN modelsare protected, because when powered up, there is a softwaresecurity solution for the data protection; when powereddown, the data will not linger. While an encrypted NVM-based system needs to first decrypt the data, then execute thetask, and encrypt the data again after the work done. Onedrawback is that the fast-restore benefit is removed. Anotherdrawback is that there exists an attackwindowwhen the sys-tem starts working and the weights will remain in plaintextform at this time. Therefore, an adversary still has opportu-nity to obtain the network weights. An ideal encrypted CIMcan keep secure at all time, simultaneously preserve theinstant-on property. Due to the staggered computing of dif-ferent layers, the encryption can be scheduled at run-time toensure the security at all time.

5 METHODOLOGY

Motivated by the aforementioned observations, we designan efficient solution for protecting the NN models in theNVM-basedNN computing systems. The overall frameworkis shown in Fig. 5, which contains two main parts: the sparsefast gradient encryption method for deciding where andhow to encrypt, and the runtime encryption scheduling fordeciding when to encrypt. The whole process goes as fol-lows. Before deploying the NNmodels, we first perform theSFGE to generate the encryption keys, in which the signifi-cant weight location and corresponding fast gradient signare contained. For each NNmodel, the key generation needsonly to be performed once offline. Then, the keys will be kept

in the key store. These keys can be double encrypted toenhance the security level by conventional encryption solu-tions, e.g., AES or CME. During run-time, the RES modulewill control the time of encryption/decryption, take the keysfrom the key store, and implement the cryptographic opera-tions. The followswill introduce the solution inmore detail.

5.1 SFGE: Sparse Fast Gradient Encryption

Inspiration. The fast gradient sign method (FGSM) [26] wasfirst proposed to generate misclassified adversarial exam-ples. An intriguing discovery has been made that a widevariety of NN models are vulnerable to adversarial pertur-bation on the input because of their linear nature. By addinga small vector whose elements are equal to the sign of thegradient of the cost function with respect to the input, theNN will misclassify the target absolutely. Inspired by that,it is reasonable to argue that the NNmodels are also vulner-able to the adversarial perturbation added on the weightparameters. Recall that a critical problem of the encryptionis to identify the key weights, then make small changes onthe weights to cause rapid deterioration on the NN perfor-mance. The fast gradient method can help to find the mostsignificant gradient descent direction.

Another interesting discover is the sparse nature of thegradient with respect to the weights. Many studies haveexplored the gradient sparsification approaches [32] to miti-gating the bandwidth requirements in distributed NN train-ing system. For example, deep gradient compression (DGC)[32] demonstrates that one can only preserve approximately0.1 percent of the gradients to achieve comparable accuracywith normal training. This enlightens us that the modelscan also be greatly impacted by sparse perturbations. Recallthat the third principle is not incurring large overhead. It isnecessary to apply a sparsification on the fast gradient.

Problem Formulation. Let Q be the weight parameters ofan NN model, eQ be the perturbation matrix (whose ele-ments are also the encryption keys) added on the originalweights, x be the validation dataset, x be the input to themodels sampled from x, y be the corresponding label associ-ated with x, and JðQ; x; yÞ be the cost function used to trainthe NN. There are also constraints proposed by the design

Fig. 4. Contrasting the working order and protection window of DRAM-based processor, bulk/incremental encrypted CIM, and ideal encryptedCIM.

Fig. 5. The overall framework of the encryption solution, which containstwo main parts: the sparse fast gradient encryption (SFGE) for decidingwhere and how to encrypt, and the runtime encryption scheduling (RES)for deciding when to encrypt.



principles. First, the number of encrypted weights should bewithin an acceptable boundary owing to the constrainedencryption budget. Here we denote the encryption budgetas N . We set a selection matrix Mask which contains only 0and 1, to generate the sparse encryption key matrix eQ. Thus,the number of 1s inMask should be smaller thanN .

Second, the encrypted weights should not significantlyfall outside the normal distribution range, otherwise theywill become outliers and easily detected by the adversaries.Conventional encryption solutions ensure that there is nomethod that can be more efficient than brute-force collision,e.g., the cracking complexity of 256-bit AES key approaches2256. In our solution, as we only encrypt an extreme smallproportion of the weights, the cracking complexity dependson the difficulty of finding where the encrypted weights arelocated. For example, when encrypting N weights from Mweights, the complexity of brute-forcely finding the loca-tions will be CðM;NÞ. Therefore, the encrypted weightsshould appear the same with other weights, or the adversar-ies are easier to crack the keys by distinguish the outliers. Aswe apply a perturbation on each encrypted weight, the val-ues of encrypted weights eQ should be constrained, and weassume the maximum perturbation intensity as �. In sum-mary, since our goal is to find the optimal eQ to degrade per-formance, the encryption problem be modeled as

maxXðx;yÞ2x

JðQþ eQ; x; yÞ

s:t:

eQ ¼ Q� �Mask

1ðMaskÞ � N

maxðjeQjÞ � �:

8><>:(1)

Fast Gradient. Due to the black-box nature of the NN, it isof great difficulty to find an optimal solution for the aboveoptimization problem. Therefore, we give an approximatesolution for the problem. In the NN optimization, the mostutilized optimization method is gradient backpropagation.The fast gradient with respect to the weights can be obtainedby the following equation:

Q� ¼Xðx;yÞ2x

5QJðQ; x; yÞ: (2)

However, the fast gradient Q� is still dense. Since there is anencryption amount constraint, we further sparsify the gra-dients to preserve a small portion of them.

Sparsification. A critical problem of sparsification is how tofind the significant gradient that impacts the performancemost. Because the partial derivatives for some variable con-tained in Q shows the rate of change of the function JðQÞalong the direction, themagnitude of the partial gradient canreflect the descent speed of the cost function along the corre-sponding variables. Therefore, preserving the gradients withthe largest magnitudes can enable a sparse gradient toenlarge the loss. We sort the fast gradients by their absolutevalues, and preserve the top-N for each layer. Let thr be thethreshold of the top-N gradients. At final, to reduce the com-plexity of the keys, we only preserve the sign of correspond-ing fast gradient. Therefore, the fast gradients preservedbecomes

Mask jQ�j � thr; (3)

eQ ¼ � � sign X

ðx;yÞ2x5QJðQ; x; yÞ

!�Mask

!: (4)

Opposite to the normal neural network training which aimsat minimizing the loss function, the encryption goal is toenlarge the loss to make the NNmisclassified. Therefore, weadd the generated sparse fast gradient on the vanilla parame-ters, then the encryption will be done. We refer to thismethod as the “sparse fast gradient encryption”. Algorithm2 concludes the overall algorithm and process.

Algorithm 2. SFGE: Sparse Fast Gradient Encryption

Input: Validation dataset x and size bInput:Neural network weights Q (L layers)Input: Cost function: JðQ; x; yÞInput: Constraints: encryption amount per layer N , encryptionintensity �Output: Encryption keys: eQ1: Q� 02: for ðx; yÞ in enumerateðxÞ do3: Q� Q� þ 5QJðQ; x; yÞ4: end for5: for Q�i in enumerateðQ�Þ ði ¼ 1; 2; . . . ; LÞ do6: Select threshold : thri top N of jQ�i j7: Mask jQ�i j � thri

8: fQi � � signðQ�i �MaskÞ9: end for10: for Qi in enumerateðQÞ ði ¼ 1; 2; . . . ; LÞ do11: Qi Qi þfQi

12: end for

5.2 i-SFGE: Incremental SFGE

The SFGE scheme generates all the keys in one forward-back-propagation round. While the SFGE method is fast and effi-cient, the encryption effectiveness is not the most prominentbecause the model sensitivity may vary when subtle changesare applied on the weights. When modifying a weight, thesensitivities of the overall weights will probably be differentfrom that of the vanilla weights. Motivated by this, we intro-duce an incremental SFGE, referred as i-SFGE, to generate astronger group of keys that can more effectively disable theNNmodels by incrementally generating the keys one by one.The process goes as follows. The nth key for each layer will begenerated based on the model that has been partiallyencrypted by the already generated keys.

Qn ¼ Qn1 þ eQn1: (5)

Then the historical gradients will be zeroed, and anotherround of gradient generation will be performed on currentmodel. Note that each gradient generation round will pro-duce only one key for each layer, thus the maximum gradi-ent of every layer will be selected out.

Q� ¼Xðx;yÞ2x

5QnJðQn; x; yÞ (6)

Mask jQ�j ¼ maxðjQ�jÞ: (7)



Finally, we take the sign of the selected gradient sign as thekey, with multiplying the configured encryption intensity,following the same rule with SFGE. So far, we successfullyget the nth key.eQn ¼ � � signðQ� �MaskÞ: (8)

We conclude the process by pseudo-code as shown inAlgorithm 3. Overall, the gradient generation process will beperformed for N times to obtain the expected number ofkeys, thus increases the key generation complexity by Ntimes.Meanwhile, a suppressionmechanism is set to preventthe weights from being selected twice ormore. i-SFGE greed-ily select the most sensitive points at the present model state,which is promising to eliminate ineffective encryption andachieve better encryption effectiveness.

Algorithm 3. i-SFGE: Incremental Sparse Fast GradientEncryption

Input: Validation dataset x and size bInput:Neural network weights Q (L layers)Input: Cost function: JðQ; x; yÞInput: Constraints: encryption amount per layer N , encryptionintensity �Output: Encryption Keys:K1: for n in range(N) do2: Q� 03: for ðx; yÞ in enumerateðxÞ do4: Q� Q� þ 5QJðQ; x; yÞ5: end for6: for Q�i in enumerateðQ�Þ ði ¼ 1; 2; . . . ; LÞ do7: Select key : indi max of jQ�i j8: while indi in Ki do9: Q�i ðindiÞ ¼ 010: Reselect key : indi max of jQ�i j11: end while12: Mask ¼ ZerosðQ�i :shapeÞ13: MaskðindiÞ ¼ 114: fQi � � signðQ�i �MaskÞ15: Ki:appendðindi; signðQ�i ðindiÞÞ16: end for17: for Qi in enumerateðQÞ ði ¼ 1; 2; . . . ; LÞ do18: Qi Qi þfQi

19: end for20: end for

5.3 Details of the Encryption

The Composition of the SFGE Keys. The keys of the encryptionare composed of two parts: the encrypted location and theencrypted sign. Each key requires dlog 2ðLÞe þ dlog 2ðMÞe þ1 bits to record the encryption location, where the L repre-sents the layer number of the NN model, the M representsthe number of the weights in this layer, and 1 bit for indicat-ing the encryption direction of the weight (+ or -).

Decryption. The only operation introduced by the encryp-tion is the addition on the weights. Therefore, the decryptiononly needs to add a negative item of the sparse fast gradientperturbations on the encrypted weights, then the NN modelwill work normally. Thus, the decryption keys can beobtained by simply flipping the last bit (sign bit) of the encryp-tion keys.

Complexity Analysis. The key generation is an one-shotprocess for each model, i.e., for a well-trained model to bedeployed, the SFGE or i-SFGE process only needs to per-form once to generate the keys. When the systems run, thekeys will be strictly protected and utilized to encrypt ordecrypt the weights. Two basic operations should be imple-mented: the forward-backward propagation to generate thegradients, and the gradient sorting to select the most signifi-cant weight. Therefore, the complexity will be OðST Þ þOðMlog 2ðNÞÞ, where S denotes the instance number in thesampled dataset, T denotes the time overhead of processingone instance, and N represents the key number for eachlayer. The former one refers to the complexity of one gener-ating the gradients for the weights, and the latter one repre-sents the complexity of the sorting operations to select thegradients with largest absolute values. As for i-SFGE,because the key generation process will be performed for Ntimes, the complexity will increase to OðNST Þ þOðNMÞ.

Trade-Offs. There are two constraints that need to be con-sidered. One constraint is the encryption budget N , i.e., themax number of weights that are allowed to be encrypted.Another constraint is the perturbation intensity limitation,which should not exceed an � to enhance the concealment.There exist trade-offs for balancing the overhead and theencryption effectiveness. An increasing N will incur moreoverhead, because each en/decryption needs to write on thecorresponding weight location. While encrypting moreweights certainly results in a higher security level. Concur-rently, the encryption intensity � also affect the encryptioneffectiveness. Larger � has greater impact on the perfor-mance, while it also increases the probability of beingdetected because the encrypted weights will exceed the orig-inal weight distribution and become outliers. Therefore, �must be carefully designed based on theweight distribution.

5.4 Variants of the Encryption

The above encryption design describes general cases thatconsider floating-point models. Specifically, we can furthermake variants on the approach, consider the NN acceleratorcharacteristics, and make the encryption more practical andefficient. For example, in quantized NN models, when ansingle NVM device represents a multi-bit value, the pertur-bation can also be quantized and discretized, tunning theconductance from one level to another. The following wewill discuss more general bit-wise encryption variant.

The most common-seen accelerators, including the NVM-based CIM designs [33], [34] or CMOS-based FPGA [35] andASICs [36], [37], generally utilize fixed-point numbers withbit-width of 8 or less, to implement the computations. Mean-while, to achieve higher reliability, many CIM architectures[38] propose to design accelerators based on single-bit NVMdevices, rather than multi-bit devices. They distribute a sin-gle weight to multiple cells for the value representation.Therefore, we further reduce the granularity of the encryp-tion and implement bit-wise operations.

Instead of adding the floating-point perturbations, weonly perform the encryption on the first two bits of theselected weights. We assume that the binary numbers areencoded by 2’s complement, and the encryption transforma-tions are shown in Table 1. Assuming the scale of the quan-tization as 2s, the encryption is actually equivalent to



adding a perturbation with intensity of 2s1. Such bit-wiseencryption operations bring two benefits. First, the encryp-tion only modifies two bits for each encrypted weight, thusdecreases the write overhead because the operated cells willbe substantially reduced. Second, the encryption will beintensity-flexible, because the distributed ranges of thequantized weights in different layers always reflect onthe quantization scale. Therefore, as we always operate onthe first two bits, the intensity will become adaptive. Thisenhances the adaptability, and effectiveness of encryption.On one hand, it fully utilizes the full distributed space toapply the largest possible perturbation. On the other hand,it also constrains the perturbation intensity and preventsthe encrypted weights from jumping out the normal rangeand being easily detected.

Note that the encryption can be implemented by simplelogic or look-up tables. Assuming the two bits as B0B1, thegradient sign bit as P (P ¼ 1 represents positive and P ¼ 0represents negative), the logic representation will be eB0 ¼ðB0 þB1ÞP þB0B1, eB1 ¼ B1. Meanwhile, as shown inTable 1, when eliminating the cases “01” and “10” respec-tively, the transformations under positive sign is exactly thereverse of negative sign. Therefore, the decryption processwill be the same as illustrated in Section 5.3. The decryptionlogic will be exactly the same as the encryption, just flippingthe sign bit. Moreover, we follow the same process as SFGE ori-SFGE to generate the significant weight locations and theirgradient signs. Therefore, the SFGE keys and correspondingkey generation process and complexity remain unchanged.

5.5 RES: Runtime Encryption Scheduling

A runtime encryption scheduling is highly demanded tokeep the CIM system secure all the time, simultaneouslynarrowing the vulnerability window when re-encryptingthe weights. The conventional way, which decrypts beforestarting work and encrypts after ending the work, has a vul-nerability window that the attackers still have opportunitiesto steal the model by interrupting the system during run-time or during the re-encryption window. While the NNcomputing always starts from the first layer, and end in thelast layer. Dependencies exist between the layers, i.e., theinput of a layer is the output of its previous layer. Com-monly-seen scheduling among the layers includes the layer-by-layer scheduling and cross-layer co-scheduling.

The layer-by-layer scheduling performs the computationof each layer in sequence, i.e., a layer starts computingwhen its previous layer has finished the computation. Suchworking order provides much convenience for the runtimeencryption scheduling, because the encryption can also besimply done layer-by-layer. The whole process is shown asFig. 5, the layer will only be decrypted when the program

comes, and be encrypted after the work has been done. Pre-decryption can also hide the decryption latency during run-time. At most running time, only one layer is in plaintextform. Only in the very short interleaved moments will thetwo adjacent layers be in plaintext form simultaneously.

Another commonly-seen method is cross-layer schedul-ing, which fully utilizes the parallelism across the layers[39]. Because sliding windows are used to convolve the fea-ture maps, a layer can start computing when fetching a win-dow of the outputs from the previous layer. Although theparallelism of different layers can be exploited, their com-putation time slices are always staggered due to the depen-dencies. Therefore, we can profile the working and idlecycles, then fully utilize the idle cycles to perform the cryp-tographic operations. While we only consider the layer-by-layer scheduling in the following experiments.

Discussion. RES will incur write operations for the en/decryption in every inference round. This raises concernabout the lifetime of the systems. Because the SFGE keys arefixed, frequently operating on the same cells will certainlywear out them. While two solutions can help overcome thisproblem: 1) applying the RES only in the intermittent work-ing mode with infrequent activities, such as the energy-har-vesting edge devices or some embedded applications (e.g.,face identification module in phones); and 2) applyingwear-leveling techniques to uniform the writes across thewhole memory. It is not difficult to design the wear-levelingstrategy because the writing behavior is totally predictable.

Another concern about the security of RES is that theadversaries may still have opportunity to obtain the wholeweights by disrupting the system multiple times. While theprerequisites are that: 1) the adversaries have prior knowl-edge of the layer scheduling and then they can interrupt thesystem exactly when the desired layer is running; and 2)once powered off, the weights of the running layer willremain in plaintext form. Therefore, three ways can com-pensate for the vulnerability. First, introduce obfuscation inthe layer scheduling to hide the running information, assimilarly discussed in [21]. Second, circuit-level innovationscan avoid the adversary from correctly reading the datawithout authorization [13], while require additional circuitsupport. Third, set up emergency encryption mechanismwith build-in temporary battery to tolerant the maliciousinterruption or unexpected power failure. Once detected asinterrupted, the module can be activated and encrypt therunning layer immediately. Because the encryption weightnumber of a single layer is around only 20 to 30, the emer-gency encryption can be finished in a very short time, andthe equipped temporary battery capacity can be small.

6 EXPERIMENTS

6.1 Experiment Settings

We investigate the accuracy influence and protection effec-tiveness of our solution. The experiments are constructed onthe ResNet [31] (with 18, 50, and 101 layers) and VGG-16models [12] with the ImageNet dataset, and the SSD [40]model with the VOC dataset [41]. The evaluation metricsinclude: 1) the accuracy influence; 2) the security analysis;and 3) the overhead. There are twomainparameters involvedin the experiments: the encryption amount per layer N and

TABLE 1Bit-Wise Encryption Variant Schematic Table

Gradient Sign Encryption

+1 00!01,01!01�,10!11,11!001 00!11,01!00,10!10�,11!10

�To avoid overflow, the encryption will ignore the case of “01” when the gradi-ent sign is positive and the case of ‘10’ when the gradient sign is negative, andeliminate the encryption.



the encryption intensity � added on the weights. Consideringthat the number of weights in the front layer is usuallysmaller than the back, we set the encryption amount asminð0:1%�M;NÞ, where M represents the number ofweights in corresponding layer. In default, the encryptionkeys are generated by SFGE method and based on 32-bitfloating-point models. Reasonably, the conclusions still holdtrue for the fixed-pointmodels.

6.2 Accuracy Influence of Encryption

Encrypting the NNmodels by SFGE has a significant impacton the accuracy performance. Recall that our goal is todestroy the prediction ability of the NN models, more influ-ence on the accuracy indicates more effective encryption.

Classification Models. Fig. 6 shows the validation accuracyof the NNmodels versus the encryption amount per layerN .Four interesting conclusions can be figured out from theresults. First, the accuracy influence increases with N .Because with encryption amount increases, the encryptedweights will increasingly vary from the original weights, andmore computational errors will be introduced. Second, thereexists a turning point on the curves. For example, in the curveof ResNet-101, whenN approaches 15, continuing to enlargeN will not be profitable because the performance has beenalready fully deteriorated. Third, the trend of accuracy drop-ping is closely related to the NN depth. See the results ofResNet-18, 50, and 101, the ResNet-101 curve demonstratesthe fastest accuracy decline trend. And the decline speedbecomes slower when the layer number decreases. A

reasonable explanation is that the errors caused by theencryption will grow when propagating through the layers.Therefore, deeper NNs are influenced more because theerrors will be explosively accumulated. Forth, heavy modelshows less sensitivity to the encrypting perturbation.ResNet-18 and VGG-16 have similar layer numbers, whileVGG-16 shows much better adaptability than ResNet-18.This may be resulted from the 10� heavier weights of VGG-16 than ResNet-18. In this scenario, increasingN or the inten-sity � can improve the encryption effects.

Fig. 7 shows the validation accuracy of encryptedResNet-18 versus the encryption intensity �. The perfor-mance of encrypted models with larger � degrades morequickly than the ones with smaller ones. However, therestill exist an intensity limit to enhance the encryption con-cealment, as will be discussed below.

Contrasting SFGE and i-SFGE.We compare the encryptioneffectiveness of SFGE and i-SFGE methods as introduced inSection. Fig. 8 shows the accuracy when encrypting theVGG-16 network by the two key generation methods. Ascan be seen, i-SFGE demonstrates much better performancethan SFGE. Under the same encryption budget, the keysgenerated by i-SFGE can degrade the accuracy more signifi-cantly, i.e., can better identify the significant weights.

Object Detection Model. We also evaluate the effectivenessof our solution on object detection models. We select a sin-gle-shot multibox detector (SSD) [40], to test on the widely-used Pascal VOC dataset [41] to demonstrate the encryptioneffect. As shown in Table 2, with an encryption amount Nof 30 or 40, and an intensity of 0.1 or 0.2, the evaluated mAPdemonstrates a significant drop. Therefore, our solutionalso works for the object detection applications.

Visualization. We also visualize the predictions of bothclassification and object detection models under plaintextform and ciphertext form. As can be seen in Fig. 9, theencryption will completely disable the NN models. The pre-dictions are absolutely incorrect and cannot provide any use-ful information. Therefore, even if the encrypted models areleaked, the attackers are still unable to make use of them.

Evaluating Bit-Wise Encryption. When considering thefixed-point weights that are frequently utilized in NN accel-erators, we apply bit-wise encryption to protect the deployedmodels. As shown in Table 3, two conclusions can be figuredout. First, under the same encryption number, bit-wise

Fig. 6. The validation accuracy of the encrypted models versusN (under�=0.1). The accuracy first declines as N increases, then will saturatewhen reach a turning point. The declining trend also becomes fasterwhen the NN depth increases. Moreover, VGG-16 demonstrates betteradaptability under encryption than the ResNets due to the heavy weights.

Fig. 7. The validation accuracy of the encrypted VGG-16 models versusNunder different intensity �. As � increases, the accuracy dropsmuch faster.

Fig. 8. Contrasting the encryption effectiveness of SFGE and i-SFGE.We show the top-1 accuracy of the encrypted models, of which the keysare generated based on SFGE and i-SFGE, respectively. The i-SFGEmethod far outperforms SFGE as it demonstrates more lower accuracyunder the same configured encryption number and intensity.



encryption provides an adaptive intensity, thus fully utilizesthe dynamic range and degrades the accuracy more. Second,the perturbation ranges are limited, which ensures that theencrypted weights will not fall outside the normal range.Most importantly, the bit-wise encryption can greatly reducethe write complexity and overhead, especially in the NVMarchitectures bymapping a single weight withmultiple cells.By encrypting less than 0.0001 percent of the overall bits, theaccuracy can drop to near zero. Moreover, it is reasonablethat the conclusions mentioned in floating-point cases stillestablished under bit-wise encryption because both theessence is disturbing the keyweights.

Effectiveness Under Variation. As multi-level NVM cellsusually suffer from process variation, the mapped valueswill deviate from expected. We demonstrate that the encryp-tion solution is still effective whenmeeting the variation. We

consider a generic variationmodel [42], i.e., a cell havemulti-ple conductance levels and the variations satisfy Gaussiandistribution with zero mean and specific standard devia-tions. As can be seen in Fig. 10, the accuracy becomes increas-ingly unstable when the standard deviation of the variationincreases. While we can figure out that the encryption is stilleffective on the deployed models under variation. Theunderlying reason is that although the process variationmaychange the deployed weight values, the significance prop-erty of the weights still remain.

6.3 Security Analysis

SFGE encrypts the weight parameters of neural networkmodels by using the generated SFG keys to protect the well-trained models from the confidentiality attacks. To analyzethe security, we consider from various aspects as following.

TABLE 2Single-Shot Multi-Box Detector (SSD) [40] With the Backbone

of VGG-16 is Used to Demonstrate the Effectivenessof SFGE in Object Detection Tasks

Model� N �b �h backbone head mAP

SSD

Plaintext 77.4330 0.1 @ 62.8030 0.1 0.025 @ @ 51.0730 0.2 @ 12.2530 0.2 0.05 @ @ 10.0240 0.1 @ 49.1340 0.1 0.025 @ @ 36.1540 0.2 @ 5.4740 0.2 0.05 @ @ 3.15

�We partition the network into two parts: “backbone” for the main part thatproduces the features, and “head” for the part that implements the predictions.We find that the dynamic range of head layers is smaller than the backbone,thus we decay the encryption intensity of the head. �b represents the intensityfor encrypting backbone, and �h for encrypting head.We evaluate the mean average precision (mAP) on VOC2007 test set [41].

Fig. 9. Visualization of the predictions based on the models in plaintext form and ciphertext form. (a) Classification task. ResNet-18 network isencrypted under the configurations of N ¼ 20; � ¼ 0:2. (b) Object detection task. The visualization confidence threshold is set to 0.6. SSD network isencrypted by N ¼ 40; � ¼ 0:2. Under plaintext form, the network can precisely predict the locations and categories of the instances. While underencrypted form, the predictions will be completely wrong.

TABLE 3The Accuracy of Encrypted Models Under DifferentEncryption Number N With Bit-Wise Encryption

Model� Param. Bits N Encrypt Bits Top-1 Top-5

ResNet-18 11.7 MB

0 0 67.93 87.911 48b 15.61 33.572 96b 0.242 1.5623 144b 0.124 0.7164 192b 0.148 0.520

VGG-16 138 MB

0 0 71.35 90.155 0.16Kb 66.40 86.9310 0.32Kb 52.59 76.3115 0.48Kb 24.49 45.6820 0.64Kb 6.268 16.1125 0.80Kb 1.570 4.95230 0.96Kb 0.512 1.928

�The weights of ResNet-18 and VGG-16 networks are both uniformly quan-tized to 8-bit numbers.



Concealment of the Encrypted Weights. The SFG keys shouldbe tightly concealed to make the encrypted weights unde-tectable. The concealment is crucial because once the attack-ers know the exact locations of encrypted weights, they cancrack the encryption with a much lower complexity. Animportant indicator of the concealment is that the values ofencryptedweights should be within the distribution range ofthe original weights. We show the weights of ResNet-18 inFig. 11, in which each weight is figured as one point. With asmall encryption intensity, such as � ¼ 0:1, the encryptedweights will still fall within the original range. It is almostimpossible for the attackers to detect the encrypted weightsfrom the chaotic weight parameters. However, while larger �brings better encryption effect, it also increases the risk ofbeing detected. As shown in Fig. 11, when � reaches 0.5,many encryptedweights will jump out the normal range andbecome outliers. Thus, when � lies in a reasonable range, theadversaries will have to brute-forcely search the encryptedweights. The complexity far exceeds 2256 (the complexity ofAES with 256-bit key), e.g., recovering 20 weights from 3 x 3x 64 x 64 weights in a convolution layer takes approximately8� 1072 tries.

Impact on the Statistical Distribution. A robust encryptionrequires the weight distribution remaining insignificant

changed. We draw the statistical probability distribution ofthe original and encrypted weights of layer2.0.conv2 inResNet-18, as shown in Fig. 12. there are extremely small fluc-tuations on the mean m and variance s. Moreover, we calcu-late the norm squared difference of the original histogramand the encrypted one, which reaches only 5:43� 105.Therefore, the adversary cannot distinguish the encryptedweights by observing the distribution difference.

Recall of the Encrypted Weights. Another indicator of theconcealment is the recall of the the encrypted weights. The“recall” is defined as the rate of the searched encryptedweights when performing the fast gradient generation againon the encrypted model. Besides, we define top-100(1000)recall which limited the search range to the weights with top-100(1000) largest gradients, because it will be meaningless tocontinue enlarging the search space as the search complexityin top-1000 has already approached 1000

20

� � 3:4� 1041. Thisconcern is raised by that the encrypted models may stillmaintain the same sensitivity as the plaintext models, so theattackersmay collidewith the encryptedweights by perform-ing the gradient generation again. The mean recall representsthe mean value of the recall rates of all layers. As shown inTable 4, the recall of ResNet models are low. Although theVGG-16 shows much larger recall, it is still difficult to restorethe vanilla weights. Thus, it is almost impossible to re-generatethe same gradient keys through the encryptedmodels.

Defence Against Adversarial Examples.An important goal ofour protection is to defend the white-box adversarial attacks.

Fig. 10. The validation accuracy of ResNet-18 models with and withoutencryption under variation. We quantize the weights to 6-bit (64 levels),and the encryption configurations are N ¼ 20, � ¼ 8 levels. Ten indepen-dent experiments are made for each standard deviation. The bold lineindicates the average, and the colored area indicates themax-min range.

Fig. 11. The weight distribution of ResNet-18 model, in which each weight is figured as a point. We select the top-5 encrypted weights as examples.When � ¼ 0:1, the encrypted weights still fall in the original range and will be extremely hard to be detected; when � ¼ 0:5, many encrypted weightswill jump out the normal range and become outliers, increasing the risk of being detected.

Fig. 12. The weight distribution of layer2.0.conv2 in ResNet-18 beforeand after the encryption. The encryption configurations are N=20,�=0.2. The mean and variance of the weight distribution remains insignif-icantly changed after encryption.



Thus, we evaluate the defence effectiveness against the adver-sarial examples respectively generated based on the encryptedweights and the originalweights byusing the FGSM.The inten-sity we apply in the attacks is set as 0.05. As shown in Table 4,the NN models are vulnerable to the adversarial examplesgenerated by performing FGSM on the weights in plaintextform. The situation will be greatly improved under theadversarial examples generated based on the encryptedweights.While the performance still degrades, whichmainlyresults from two aspects: 1) the transfer ability of the adver-sarial examples; and 2) the partially preservedNN character-istic. Therefore, the white-box adversarial attacks can bedefended under SFGE.

Security of Runtime Encryption Scheduling. To ensure thecomputational correctness, the weights must be decryptedback to plaintext form when performing the correspondingcomputations. Owing the the proposed RES method, at mosttime will one layer of weights be decrypted. Therefore, eventhe attackers disrupt the system when working, they onlyobtain a model with only one layer decrypted. We evaluatethe accuracy when decrypting one of the layers in Fig. 13. Ascan be seen, the accuracy still remain at a extremely low levelwhen themodel is with one layer decrypted. However, as thedepth and sensitivity of the layers vary, there will be slightdifferences in the accuracywhen decrypting different layers.

6.4 Encryption Time Overhead

We further evaluate the encryption/decryption latency andthe time complexity of the key generation process.

Pre-Deployment: Key Generation Complexity. To distinguishthe significance of all the weights, gradients will be utilizedto obtain the keys. Therefore, to generate the gradients, weneed to sample a subset of data from the training or valida-tion dataset to execute the forward and backward propaga-tions. There exist a trade-off between the sampled datasetsize and the encryption effectiveness. On one hand, as eachdata instance will be processed, the complexity of gradientgeneration linearly increases with the amount of sampleddata. On the other hand, to obtain more generalized gra-dients to distinguish the important weights, the sampleddataset should have a good generalization, which requires asufficient amount of data instances. As shown in Fig. 14,when increasing the data amount, the encryption effective-ness becomes better as the accuracy dropsmore stably, whileaccompanied with a longer key generation time. Accordingto the experimental observations, 10,000 samples are ade-quate to distinguish generalized gradients for ImageNetdataset. We test the generation time of various models basedon an NVIDIA GeForce RTX 2080 Ti platform. As Table 5shows, generating the keys for one model consumes from22.5 seconds to 77.8 seconds when using SFGE scheme. i-SFGE will consume much more time because the keys aregenerated one-by-one, but still within 1 hour.

Runtime: Encryption Latency. Considering the layer-by-layer scheduling, the latency is mainly introduced by thedecryption of the first layer in NN when the system starts,because RES will perform the decryption of the followinglayers during runtime to hide the addition latency. As

TABLE 4Experimental Results of the Encryption for Different Neural Network Models

Fig. 13. The accuracy of encrypted ResNet-18 when decrypting one ofthe layers. The “baseline” represents the accuracy of the fully encryptedmodel. The tag of x-axis represents the name of the decrypted layer andthe bar shows the accuracy.

Fig. 14. The average validation accuracy versus the number of sampleswhen generating the keys for floating-point ResNet-18 model. Theencryption configurations are N=20, �=0.2. Each point is averagedbased on the results of 10 independent experiments. The data are ran-domly sampled from the ImageNet train dataset which includes 1.28 mil-lion images in total.



presented in NVSim [43], the write latency of PCM andRRAM achieves 416.2 ns and 100.6 ns respectively. Thedecryption only incurs one write on each encrypted weight,therefore the overall decryption latency will be expectedas 416:2Nns (PCM-based) and 100:6Nns (RRAM-based)respectively. As shown in Table 5, taking the ResNet-101 asan example, the encryption amount of the first layer ismin(7 x 7 x 3 x 64 x 0.1%, 20)10. Thus, the latency will be4:16ms and 1:01ms respectively. Such latency overhead is neg-ligible in most applications, e.g., the video frame rate usuallyranges from 30 to 200 frames per second, consuming 5 ms perframe at least. Compared to 5ms/frame, the encryption latency(� 5ms) only takes less than 0.1 percent of the time.

7 FURTHER DISCUSSION

While we have discussed and demonstrated the advantagesof our solution and the application on NVM-based in-memory NN computing systems, we still need to focus onthe potential application on the CMOS-based acceleratorsand the drawback of our solution.

The Drawback: Good Weight Initialization. Although wehave demonstrate the effectiveness and security of the pro-posed solution, there are also drawbacks that should beconsidered. Because only a extremely small proportion ofweights are modified, the structure and distribution of theweights are preserved, thus the encrypted models still pro-vide good weight initialization. As shown in Fig. 15, whenthe models are initialized with the encrypted weights, thelosses will drop much faster than random initialization. Thismeans that the attackers can still obtain some informationfrom the adversarial attacks, as the encrypted models canhelp to accelerate and improve the convergence of trainingsimilar or same tasks. Our future work will further improvethe encryption effectiveness and make the encrypted modelsclosely equivalent to random-initializedmodels.

Potential Application on CMOSAccelerators. Themainmoti-vation of our work starts from the vulnerability of NVM sys-tems rendered by the non-volatility. However, the volatilememory-based accelerators may also be threatened byadversaries. Taking FPGA as an example, when powereddown, the configuration file (bitstream), weight file, andother necessary files are stored in non-volatile memory (e.g.,disk or Flash). When booting the system, the bitstream willbe loaded to the FPGA chip for configuration, and the otherfiles will be loaded into a faster, volatile memory (usuallyDRAM) to accommadate the intermediate data and weightsduring computing because the on-chip memory capacity isfar insufficient. However, we generally regard that only theFPGA chips are trusted, and both the peripheral Flash orDDR are untrusted. Current security approaches protect the

designs andmodels by applying encryption on the bitstream[44] and model files in Flash, while the weights remain inplaintext form in DRAM,which is still vulnerable tomemoryattacks such as cold boot attack [25]. To ensure a higher secu-rity level, the data in DDR should also be encrypted. Whilethe bitstream decryption is one-shot, the weight decryptionwill be done repeatedly as the FPGA cores only fetch a smallproportion of them each time. This brings a challenge thatthe encryption will incur considerable latency. Therefore,reducing the encryption amount is also beneficial for reduc-ing the encryption/decryption overhead, and our solutioncan also be transferred for protecting the model security insuch accelerators.

8 CONCLUSION

We have modeled the NN encryption problem and presentedan efficient protecting solution to thwart the confidentialityattacks which threatens the privacy of the well-trained NNmodels deployed in CIM- and NVM-based computing sys-tems. An efficient framework has been proposed based on theSFGE method for efficient encryption of the weights and theRES scheme for the runtime scheduling of the weights encryp-tion. Further improvements have also been made to enhancethe encryption strength and efficiency by incrementally gener-ating the keys or performing bit-wise encryption. Experimen-tal results have demonstrated the effectiveness and robustnessof our solution.

ACKNOWLEDGMENTS

This work was supported in part by the National KeyResearch and Development Program of China under Grant2018YFB0105000, and Grant 2017YFA0207600, in part byNational Natural Science Foundation of China under GrantU19B2019, Grant 61832007, Grant 61621091, and Grant61720106013, in part by Beijing National Research Centerfor Information Science and Technology (BNRist), and inpart by Beijing Innovation Center for Future Chips. The

TABLE 5Encryption Time Overhead Evaluation

Generation Time Runtime Latency

Model Key Num. SFGE i-SFGE PCM RRAM

ResNet-18 20 22.5s 434.8s 4.16 ms 1.01 msResNet-50 30 50.6s 1447.9s 4.16 ms 1.01 msResNet-101 20 77.8s 1534.9s 4.16 ms 1.01 msVGG-16 30 73.6s 2246.8s 416 ns 101 ns

Fig. 15. The loss curves of training models based on the randomly initial-ized weights, pre-trained weights or the encrypted weights. Top: trainingResNet-18 on ImageNet dataset based on randomly initialized weightsand the encrypted weights. Bottom: training SSD on VOC2007 datasetbased on the VGG-16 pretrained weights and the encrypted weights.



work of Xiaoming Chen was supported by Beijing Academyof Artificial Intelligence (BAAI).

REFERENCES

[1] Y. LeCun et al., “Deep learning,” Nature, vol. 521, no. 7553, 2015,Art. no. 436.

[2] M. M. Waldrop, “The chips are down for moore’s law,”Nat. News,vol. 530, no. 7589, 2016, Art. no. 144.

[3] S. Ambrogio et al., “Equivalent-accuracy accelerated neural-networktraining using analogue memory,” Nature, vol. 558, no. 7708, 2018,Art. no. 60.

[4] P. Chi et al., “PRIME: A novel processing-in-memory architecturefor neural network computation in ReRam-based main memory,”in Proc. ACM/IEEE 43rd Annu. Int. Symp. Comput. Archit., 2016,vol. 43, pp. 27–39.

[5] L. Xia et al., “Technological exploration of RRAM crossbar arrayfor matrix-vector multiplication,” J. Comput. Sci. Technol., vol. 31,no. 1, pp. 3–19, 2016.

[6] M. Cheng et al., “TIME: A training-in-memory architecture formemristor-based deep neural networks,” in Proc. 54th ACM/EDAC/IEEE Des. Autom. Conf., 2017, pp. 1–6.

[7] S. Chhabra and Y. Solihin, “i-NVMM: A secure non-volatile mainmemory system with incremental encryption,” in Proc. 38th Annu.Int. Symp. Comput. Archit., 2011, pp. 177–188.

[8] K. Beckmann et al., “Nanoscale hafnium oxide RRAM devicesexhibit pulse dependent behavior and multi-level resistance capa-bility,”Mrs Advances, vol. 1, pp. 1–6, 2016.

[9] C. H. Cheng, A. Chin, and F. S. Yeh, “Novel ultra-low powerRRAM with good endurance and retention,” in Proc. Symp. VLSITechnol., 2010, pp. 85–86.

[10] Y. Cai et al., “Long live TIME: Improving lifetime for training-in-memory engines by structured gradient sparsification,” in Proc.55th ACM/ESDA/IEEE Des. Autom. Conf., 2018, pp. 1–6.

[11] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras,and B. Abali, “Enhancing lifetime and security of phase changememories via start-gap wear leveling,” in Proc. 42nd Annu. IEEE/ACM Int. Symp.Microarchit., 2009, pp. 14–23.

[12] K. Simonyan et al., “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations,2015.

[13] W. Li et al., “P3M: A PIM-based neural network model protectionscheme for deep learning accelerator,” in Proc. 24th Asia SouthPacific Des. Autom. Conf., 2019, pp. 633–638.

[14] C. Orlandi et al., “Oblivious neural network computing via homo-morphic encryption,” EURASIP J. Inf. Secur., vol. 2007, no. 1, 2007,Art. no. 037343.

[15] X. Yuan, P. He, Q. Zhu and X. Li, “Adversarial examples: Attacksand defenses for deep learning,” IEEE Trans. Neural Netw. Learn.Syst., vol. 30, no. 9, pp. 2805–2824, Sep. 2019.

[16] N. Akhtar and A. Mian, “Threat of adversarial attacks on deeplearning in computer vision: A survey,” IEEE Access, vol. 6,pp. 14 410–14 430, 2018.

[17] W. Li et al., “Hu-Fu: Hardware and software collaborative attackframework against neural networks,” in Proc. IEEE Comput. Soc.Annu. Symp. VLSI, 2018, pp. 482–487.

[18] Y. Zhao et al., “Memory trojan attack on neural network accel-erators,” in Proc. Des. Autom. Test Eur. Conf. Exhib., 2019, pp.1415–1420.

[19] Y. Liu, L. Wei, B. Luo, and Q. Xu, “Fault injection attack on deepneural network,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des.,2017, pp. 131–138.

[20] W. Hua, Z. Zhang, and G. E. Suh, “Reverse engineering convolu-tional neural networks through side-channel information leaks,”in Proc. 55th ACM/ESDA/IEEE Des. Autom. Conf., 2018, pp. 1–6.

[21] X. Hu et al., “DeepSniffer: A DNN model extraction frameworkbased on learning architectural hints,” in Proc. 25th Int. Conf. Archi-tectural Support Program. Lang. Operating Syst., 2020, pp. 385–399.

[22] M. Yan, C. W. Fletcher, and J. Torrellas, “Cache telepathy:Leveraging shared resource attacks to learn architectures,” in 29thUSENIX Secur. Symp., pp. 2003–2020, 2020.

[23] L. Batina et al., “CSI neural network: Using side-channels to recoveryour artificial neural network information,” 2018, arXiv: 1810.09076.

[24] P. Roberts, “MIT: Discarded hard drives yield private info,”ComputerWorld, vol. 16, 2003. [Online]. Available: https://www.computerworld.com/article/2580013/mit–discarded-hard-drivesyield-private-info.html

[25] J. A. Halderman et al., “Lest we remember: Cold-boot attacks onencryption keys,” Commun. ACM, vol. 52, no. 5, pp. 91–98, 2009.

[26] I. J. Goodfellow et al., “Explaining and harnessing adversarialexamples,” in Proc. Int. Conf. Learn. Representations, 2015.

[27] S. Cho and H. Lee, “Flip-N-Write: A simple deterministic techniqueto improve PRAM write performance, energy and endurance,” inProc. 42nd Annu. IEEE/ACM Int. Symp. Microarchit., 2009, pp.347–357.

[28] N. H. Seong et al., “Security refresh: Prevent malicious wear-outand increase durability for phase-change memory with dynami-cally randomized address mapping,” ACM SIGARCH Comput.Archit. News, vol. 38, no. 3, pp. 383–394, 2010.

[29] J. Daemen and V. Rijmen, The Design of Rijndael: AES-the AdvancedEncryption Standard. Berlin, Germany: Springer, 2013.

[30] H. Lipmaa et al., “CTR-mode encryption,” in Proc. 1st NISTWorkshop Modes Operation, 2000, vol. 39, pp. 1–4.

[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-nit., 2016, pp. 770–778.

[32] Y. Lin et al., “Deep gradient compression: Reducing the communica-tion bandwidth for distributed training,” in Proc. Int. Conf. Learn.Representations, 2018.

[33] W.-H. Chen et al., “A 65 nm 1 MB nonvolatile computing-in-mem-ory ReRam macro with sub-16 ns multiply-and-accumulate forbinary DNN AI edge processors,” in Proc. IEEE Int. Solid - StateCircuits Conf., 2018, pp. 494–496.

[34] Y. Cai, T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, “Low bit-width convolutional neural network on RRAM,” IEEE Trans. Com-put.-Aided Des. Integr. Circuits Syst., vol. 39, no. 7, pp. 1414–1427,Jul. 2020.

[35] J. Qiu et al., “Going deeper with embedded FPGA platform forconvolutional neural network,” in Proc. ACM/SIGDA Int. Symp.Field-Programmable Gate Arrays, 2016, pp. 26–35.

[36] X. Si et al., “24.5 a twin-8T SRAM computation-in-memory macrofor multiple-bit CNN-based machine learning,” in Proc. IEEE Int.Solid- State Circuits Conf., 2019, pp. 396–398.

[37] N. P. Jouppi et al., “In-datacenter performance analysis of a tensorprocessing unit,” in Proc. ACM/IEEE 44th Annu. Int. Symp. Comput.Architecture, 2017, pp. 1–12.

[38] Z. Zhu et al., “A configurable multi-precision CNN computingframework based on single bit RRAM,” in Proc. 56th ACM/IEEEDes. Autom. Conf., 2019, pp. 1–6.

[39] T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang, “Binary convolu-tional neural network on RRAM,” in Proc. 22nd Asia South PacificDes. Autom. Conf., 2017, pp. 782–787.

[40] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur.Conf. Comput. Vis., 2016, pp. 21–37.

[41] M. Everingham et al., “The pascal visual object classes (VOC)challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.

[42] P. Yao et al., “Fully hardware-implemented memristor convolu-tional neural network,”Nature, vol. 577, no. 7792, pp. 641–646, 2020.

[43] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “NVSim: A circuit-levelperformance, energy, and area model for emerging nonvolatilememory,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.,vol. 31, no. 7, pp. 994–1007, Jul. 2012.

[44] Xilinx. Using encryption and authentication to secure an ultrascale/ultrascale+ FPGA bitstream, Accessed: Oct. 12, 2018. [Online].Available: https://www.xilinx.com/support/documentation/application_notes/xapp1267-encryp-efuse-program.pdf

Yi Cai received the BS degree in electronic engi-neering from Tsinghua University, Beijing, China, in2017.He is currentlyworking toward thePhDdegreewith the Department of Electronic Engineering,Tsinghua University, Beijing, China. His researchmainly focuses on deep learning acceleration andemerging non-volatilememory technology.



https://www.computerworld.com/article/2580013/mit--discarded-hard-drivesyield-private-info.html



https://www.xilinx.com/support/documentation/application_notes/xapp1267-encryp-efuse-program.pdf

https://www.xilinx.com/support/documentation/application_notes/xapp1267-encryp-efuse-program.pdf

Xiaoming Chen (Member, IEEE) received the BSand PhD degrees in electronic engineering fromTsinghua University, Beijing, China, in 2009 and2014, respectively. He is now an associate profes-sor with the Institute of Computing Technology,Chinese Academy of Sciences, China. His currentresearch interests include electronic design auto-mation and computer architecture design for proc-essing-in-memory systems. He served on theOrganization Committee of Asia and South PacificDesign Automation Conference (ASP-DAC) 2020

and also served on the Technical Program Committees (TPCs) of DesignAutomation Conference 2020, International Conference On ComputerAided Design (ICCAD) 2019, ASP-DAC 2019, International Conferenceon VLSI Design 2019 & 2020, Asian Hardware Oriented Security andTrust Symposium (AsianHOST) 2018 & 2019, and IEEE Computer Soci-ety Annual Symposium onVLSI (ISVLSI) 2018& 2019. Hewas a recipientof the 2015 European Design and Automation Association (EDAA) Out-standing Dissertation Award and the 2018 DAMOAcademyYoung FellowAward.

Lu Tian received the BS and PhD degrees in elec-tronic engineering from Tsinghua University, China,in 2011 and 2017, respectively. From 2017 to 2019,shewas a post-docwith NICS Lab at TsinghuaUni-versity, China. Since 2019, she has been aresearcher at AI group, Xilinx Inc., China. Her cur-rent research interests include pattern recognitionandmachine learning.

Yu Wang (Senior Member, IEEE) received the BSdegree and PhD degree (with honor) from Tsing-hua University, Beijing, China, in 2002 and 2007,respectively. He is currently a tenured professorand chair with the Department of Electronic Engi-neering, Tsinghua University, China. His researchinterests include application specific hardwarecomputing, parallel circuit analysis, and power/reli-ability aware system design methodology. He hasauthored and coauthored more than 250 papers inrefereed journals and conferences. He has

received Best Paper Award in ASPDAC 2019, FPGA 2017, NVMSA17,ISVLSI 2012, and Best Poster Award in HEART 2012 with nine Best PaperNominations. He is a recipient of DAC Under-40 Innovator Award, in 2018.He served as TPC chair for ICFPT 2019, ISVLSI 2018, ICFPT 2011 andfinance chair of ISLPED 2012–2016, and served as the program commit-tee member for leading conferences in EDA/FPGA area. Currently he isserving as an associate editor for the IEEE Trans on Circuits and Systemsfor Video Technology, IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, and ACM Transactions on EmbeddedComputing Systems. He is an ACMsenior member. He is the co-founder ofDeephi Tech (acquired by Xilinx in 2018), which is a leading deep learningcomputing platform provider.

Huazhong Yang (Fellow, IEEE) received the BSdegree in microelectronics, in 1989, the MS andPhD degrees in electronic engineering, in 1993and 1998, respectively, all fromTsinghuaUniversity,Beijing, China. In 1993, he joined the Departmentof Electronic Engineering, Tsinghua University, Bei-jing, China, where he has been a professor since1998. He was awarded the Distinguished YoungResearcher by NSFC, in 2000, Cheung KongScholar by the Chinese Ministry of Education(CME), in 2012, Science and Technology Award

First prize by China Highway and Transportation Society, in 2016, andTechnological Invention Award first prize by CME, in 2019. He has been in-charge of several projects, including projects sponsored by the national sci-ence and technologymajor project, 863 program, NSFC, and several inter-national research projects. He has authored and co-authored more than500 technical papers, 7 books, and more than 180 granted Chinese pat-ents. His current research interests include wireless sensor networks, dataconverters, energy-harvesting circuits, nonvolatile processors, and braininspired computing. He has also served as the chair of Northern ChinaACM SIGDAChapter science 2014, general co-chair of ASPDAC’20, navi-gating committee member of AsianHOST’18, and TPC member for ASP-DAC’05, APCCAS’06, ICCCAS’07, ASQED’09, and ICGCS’10.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/csdl.



enabling secure nvm-based in-memory neural network...

Documents