chet: compiler and runtime for homomorphic ... - arxiv

CHET: Compiler and Runtime for Homomorphic Evaluation of TensorPrograms

Roshan Dathathri*, Olli Saarikivi†, Hao Chen†, Kim Laine†, Kristin Lauter†, Saeed Maleki†, MadanlalMusuvathi†, and Todd Mytkowicz†

*Department of Computer Science, University of Texas at Austin, [email protected]

†Microsoft Research, USA{olsaarik,haoche,kilai,klauter,saemal,madanm,toddm}@microsoft.com

AbstractFully Homomorphic Encryption (FHE) refers to a set of

encryption schemes that allow computations to be applieddirectly on encrypted data without requiring a secret key.This enables novel application scenarios where a client cansafely offload storage and computation to a third-party cloudprovider without having to trust the software and the hard-ware vendors with the decryption keys. Recent advances inboth FHE schemes and implementations have moved suchapplications from theoretical possibilities into the realm ofpracticalities.

This paper proposes a compact and well-reasoned inter-face called the Homomorphic Instruction Set Architecture(HISA) for developing FHE applications. Just as the hard-ware ISA interface enabled hardware advances to proceedindependent of software advances in the compiler and lan-guage runtimes, HISA decouples compiler optimizations andruntimes for supporting FHE applications from advancementsin the underlying FHE schemes.

This paper demonstrates the capabilities of HISA by build-ing an end-to-end software stack for evaluating neural networkmodels on encrypted data. Our stack includes an end-to-endcompiler, runtime, and a set of optimizations. Our approachshows generated code, on a set of popular neural networkarchitectures, is faster than hand-optimized implementations.

1. IntroductionMany applications benefit from storing and computing on datain the cloud. Cloud providers centralize and manage computeand storage resources, which lets developers focus on the logicof their application and not on how to deploy and manage theinfrastructure that surrounds it. A large class of applications,however, have strict data secrecy and privacy requirementsranging from government regulations to gaining competitiveadvantage from consumer protection. In these cases, data se-crecy and privacy issues end up limiting otherwise convenientcloud adoption.

Fully Homomorphic Encryption (FHE) allows computa-tions to be applied directly on encrypted data and thus enables

a wide variety of cloud applications that may not be availabletoday. A cloud based FHE application encrypts its plaintextinput a and stores it in encrypted form in cloud storage as α .At a later time, the application requests the cloud to perform anFHE operation f on the encrypted data α and to store the (stillencrypted) result β = f (α). The application can downloadβ from cloud storage at any point and decrypt the result toobtain a plaintext result b = f (a).

FHE provides a simple trust model: the owner of the datadoes not need to trust the cloud software provider or the hard-ware vendor. Homomorphic encryption schemes rely on well-studied mathematical hardness assumptions, i.e. if there is apolynomial time algorithm for breaking the encryption thenthere is also a polynomial time algorithm for solving these hardmathematical problems. Other cryptographic solutions suchas Secure Multi-Party Computation (MPC) [29, 30] requirecomplicated trust models and typically have larger communi-cation costs. Non-cryptographic technologies such as secureenclaves [24] (such as Intel SGX [18]) requires one to trust ahardware vendor.

Of course, there is no free lunch: FHE is usually consid-ered impractical due to its performance overhead. However,recent advances in FHE schemes and implementations dramat-ically improve performance, which makes many applicationspractical today. For example, initial FHE schemes [22] onlyoperated on bits, requiring logic circuits to perform arithmetic.Modern FHE schemes [10, 20, 16] organically support integerand fixed-precision arithmetic dramatically reducing the sizeof the circuits required. Similarly, optimized implementationsof FHE schemes such as [27] have further reduced the cost ofperforming individual FHE operations.

Despite these improvements, FHE operations are still sev-eral orders of magnitude slower than their plaintext counter-parts. Nevertheless, certain privacy-sensitive applications arewilling to pay the cost. For instance, a bank based in Europeis unlikely to use secure enclaves for sensitive computation asthat requires trusting Intel, an US company. For such appli-cations, it is important to build FHE applications that are asefficient as feasible, despite plaintext overhead.

arX

iv:1

810.

0084

5v1

[cs

.LG

] 1

Oct

201

8

Today, building an efficient FHE application still requiresmanual and error-prone work. FHE encryption schemes usu-ally require a careful setting of encryption parameters thattrade off performance and security. Moreover, current FHEschemes introduce noise when operating on encrypting data.Managing the growth of this noise while satisfying the preci-sion requirements of the application currently requires labo-rious effort from the developer. Also, popular FHE schemessupport vectorized operations (called batching in the FHE liter-ature) with very large vector sizes (of thousands). Applicationefficiency crucially relies on maximizing the parallelism avail-able in the application to use the vector capabilities. Lastly,FHE schemes limit the operations allowed in an application(i.e., no branches or control flow) and thus require a devel-oper to creatively structure their program to work around suchlimitations.

In many respects, programming FHE applications directlyon FHE schemes today is akin to low-level assembly program-ming on modern architectures. Our central hypothesis behindthis paper is future FHE applications will benefit from a com-piler and runtime that targets a compact and well-reasonedinterface, which we call the Homomorphic Instruction SetArchitecture (HISA). Just as the hardware ISA interface en-abled hardware advances to proceed independent of softwareadvances in the compiler and language runtimes, we posit thatHISA isolates future advances in FHE schemes from the com-piler optimizations and runtime improvements for supportingFHE applications.

To demonstrate and evaluate the utility behind such an in-terface, this paper focuses on the task of fully-homomorphicdeep-neural network (DNN) inference. This application isparticularly suited as the first domain of study. First, theDNN inference computation does not involve discrete controlflow which is a perfect match for FHE. While DNN inferenceinvolve non-polynomial activation functions, they can be effec-tively replaced with polynomail approximations [23, 13]. Theability of DNNs to tolerate errors provides us the flexibility ofquantizing and reducing the precision of the computation forperformance.

This paper presents, to the best of our knowledge, the firstcompiler and runtime called CHET for developing FHE DNNinference. CHET guarantees sound semantics by selecting theright encryption parameters that maximize performance whileguaranteeing security and application-intended precision ofthe computed results. CHET builds on the recent encyrptionscheme [16] that provides approximate arithmetic to emulatefixed-point arithmetic required for DNN inference. Finally,CHET explores a large space of optimization choices, suchas the multiple ways to pack data into ciphertexts as well asmultiple ways to implement computational kernels for eachlayout, that are currently done manually by experts today.

In addition to a compiler, CHET includes a runtime, akinto the linear algebra libraries used in unencrypted evaluationof neural networks. We have developed a set of layouts and

a unified metadata representation for them. For these newlayouts we have developed a set of computational kernels thatimplement the common operations found in convolutionalneural networks (CNNs). All of these kernels were designedto use the SIMD-style capabilities of modern FHE schemes.

We evaluate CHET with a set of real-world CNN mod-els and show how different optimization choices available inour compiler can significantly improve latencies. We furtherdemonstrate how the optimized kernel implementations inCHET allow neural networks of unprecedented depth to behomomorphically evaluated.

In summary our main contributions are as follows:• A homomorphic instruction set architecture (HISA) that

cleanly abstracts and exposes features of FHE libraries.• A set of tiled layouts and a metadata format for packing

tensors into ciphertexts.• Novel kernel implementations of homomorphic tensor oper-

ations.• A compiler and runtime for homomorphic evaluation of

tensor programs, known as CHET.• An evaluation of CHET on a set of real-world CNN models,

including the homomorphic evaluation of the deepest CNNto date.

2. Background

This Section provides a concise background of homomorphicencryption with of its limitations and capabilities, followed bya brief introduction into tensor programs.

2.1. Homomorphic Encryption

We will begin by describing the properties of leveled fullyhomomorphic encryption (FHE) that are essential for a pro-grammer. We first introduce general concepts followed by adescription of the HEAAN scheme [16], which is used in ourevaluations.

Leveled FHE can evaluate an arithmetic circuit of a pre-determined multiplicative depth L. The reason for such arestriction is that typically in homomorphic encryption eachciphertext carries a inherent “noise” which increases whencomputations are performed. Once the noise reaches a certainbound the ciphertext becomes corrupted and the message can-not be recovered anymore, even with the correct decryptionkey. In Leveled FHE, the parameters of the encryption schemeare chosen so that the noise remains below this threshold forcircuits of depth up to the chosen value L. Unfortunately, dueto the increase in the parameters of the scheme, the complexityof each homomorphic operation also grows with L.

Encryption Keys: FHE schemes have two sets of keys: publickeys and private keys. Private keys are used for decryptionand public keys are used for encryption and in homomorphicoperations. As long as a user does not reveal their private

2

key, FHE schemes leak no information about the user’s data.1

Public keys, on the other hand, are meant to be shared forsecure computation.Supported Operations: Existing FHE schemes on integerssupport only two basic algebraic operations: addition andmultiplication. The operands can be any combination of ci-phertexts and plaintexts. Techniques for fixed point operandsare discussed later in this Section. However, all other op-erations need to be approximated in terms of additions andmultiplications. For example, functions such as exp, log, andtanh can be approximated by polynomials using Taylor expan-sion. On the other hand, branching is not possible in FHE andthe only possible solution is executing all cases and maskingthe result.Noise and Multiplicative Depth: We denote the amount ofnoise associated with ciphertext α by noise(α). The amountof noise for a ciphertext depends on its operands and the op-erator and excessive noise completely destroys the encryptedmessage. Any freshly encrypted value has a small amount ofnoise. For any given ciphertexts, α and β and plaintext c, thenoise contribution is roughly as follows:• noise(α× c)≈ noise(α + c)≈ noise(α)• noise(α +β )≈max{noise(α),noise(β )}• noise(α×β )≈max{noise(α),noise(β )}·rwhere r is a constant number that depends on parameter se-lection of the encryption. The multiplicative noise increasebetween two ciphertexts greatly affects the design of a FHEcomputation. Multiplicative depth of a ciphertext α is the max-imum number of ciphertext multiplications that contributedto the computation of α and they are all a part of a chain.For example, for ciphertext α = β ∗ (γ ∗ γ +a+λ ) where β ,γ and λ are ciphertexts and a is a plaintext, the multiplica-tive depth is 2. For a multiplicative depth of L, the noise isO(rL). The maximum noise tolerance can be adjusted by theencryption parameters but that negatively impacts computa-tional complexity of operations. Therefore, it is crucial to keepthe multiplicative depth minimal to avoid the costs.Floating Point: Due to the nature of the homomorphic encryp-tion schemes, it is not easy to support floating point operations.Instead, we can use fixed point arithmetic by combining in-tegers with a scaling factor. To avoid accumulating scalingfactor, fixed point arithmetic requires scaling the result downafter each multiplication. However, the BFV/BGV schemesdo not support division, and hence applications using theseschemes will see an accumulation of scaling factors. Thisrequires increasing space to store the intermediate results, andthe required encryption parameters becomes exponential in themultiplicative depth of the circuit, which makes deep circuitevaluations infeasible. Section 2.2 gives background on anapproximate encryption scheme that avoids this problem.FHE Vectorization: FHE multiplication and addition are ex-tremely costly and usually not practical if one pair of operands

1Encryption schemes in general base their security on the hardness ofsome underlying problem, e.g., learning with errors [28].

are considered at a time. Fortunately, in the BFV [20],BGV [10] and HEAAN schemes, multiple ciphertexts canbe packed into a single ciphertext to benefit from Single In-struction Multiple Data (SIMD) parallelism [21]. This tech-nique allows packing multiple plaintext values into a singleciphertext. Given two α and β , that encode the plaintext vec-tors (a1,a2, . . . ,an) and (b1,b2, . . . ,bn), respectively, α×β orα +β correspond to elementwise multiplication and additionof the underlying individual ai and bi. Similar to shufflinginstructions in the SIMD units of modern processors, the ele-ments of a vector can be rotated in a packed ciphertext. How-ever, random access is not directly supported. Instead randomaccess can be implemented by multiplying with a plaintextmask, followed by a rotation. This unfortunately introducesadditional multiplicative depth and should thus be avoided ifpossible.Bootstrapping: True Fully Homomorphic Encryption startswith a Leveled FHE scheme and introduces an additional noisemanagement technique traditionally referred to as bootstrap-ping. Bootstrapping reduces noise, and when performed fre-quently enough it allows in principle an unlimited number ofoperations to be performed. A recent work by Cheon et al. [15]describes a bootstrapping algorithm for the HEAAN [16]scheme, and best results takes about 1 second to bootstrapa single number. Therefore, we leave using bootstrapping forfuture work once it is more practical.

2.2. HEAAN

There are a few state-of-the-art leveled FHE schemes, forexample the Brakerski-Gentry-Vaikuthanathan scheme [10]and the Fan-Vercauteren variant of the Brakerski scheme [20],which we will refer to as BGV and BFV. These two schemesare similar in many aspects. In particular, both scheme sup-ports vector plaintext type Fk where F is some finite field, andvectorized additions/multiplications. More recently, Cheon etal. proposed the HEAAN scheme [16], which supports betterapproximate computation over encrypted data.

The HEAAN scheme uses a novel idea to mix the encryp-tion noise with the message. Then, one can use the modulusswitching techniques for BGV scheme to achieve the function-ality of “scaling down”, i.e., given an encryption of m and apublic integer K, output an encryption of m/K + ε for someε > 0 of fixed size. This gives the ability to control the scalingfactor and hence the precision throughout the homomorphiccircuit evaluation, and allows encryption parameters that arelinear in multiplicative depth. However, the HEAAN schemeonly supports approximate computations: first, even a freshlyencrypted ciphertext decrypts to a perturbed message; also,homomorphic operations insert noise into the result. There-fore, one needs to carefully control these noises in order tomaintain a desired precision of the computation output.

In the HEAAN scheme, two operations require additionalpublic key materials. After each multiplication between cipher-texts, the output ciphertext is actually larger than the two input

3

ciphertexts, and it’s common to run an operation called re-linearization to reduce the size of output. Both re-linearizationand rotation require certain evaluation keys. These keys canbe generated from the secret key, and each rotation key canperform rotation by a fixed amount. In the public implementa-tion of HEAAN, by default only power-of-2 rotation keys aregenerated, and general rotations are written as a compositionof power-of-2 rotations. This presents an interesting problemof choosing the best rotation keys for a given program, whichbalances evaluation time with storage/communication costs ofadditional rotation keys.

To instantiate the HEAAN scheme, we need to select twointeger parameters N (the degree of polynomial modulus, apower of 2) and Q (the ciphertext modulus). The security ofthe scheme is based on the ring learning-with-errors problem,and for a fixed Q, the security level increases with N. Oneadditional complication is: the noise generated by each ho-momorphic operation increases with N, which could increaserequired precision and thus Q, possibly requiring a larger Nthan the initial guess.

The number of elements that can be packed into a singleciphertext scales linearly with the parameter N. In particular,HEAAN supports encrypting N/2 complex numbers in a sin-gle ciphertext. We note that the vector encryption in HEAANcomes with an additional encoding error, i.e., even a freshencoding of a vector will lose some precision of the inputdue to a rounding process. In HEAAN, the encoding error isO(√

N), hence we need to scale up the input numbers to makesure they are sufficiently larger compared to this error.

2.3. Tensor Programs

A tensor is a multidimensional array with regular dimensions.A tensor t has a data type dtype(t), which defines the repre-sentation of each element, and a shape shape(t), which is alist of the tensor’s dimensions. For example, a single 32 by32 image with 3 channels of color values between 0 and 255could be represented by a tensor I with dtype(I) = int8 andshape(I) = [3,32,32].

In machine learning, neural networks are commonly ex-pressed as programs operating on tensors. Some commontensor operations in neural networks include:Convolution calculates a cross correlation between an inputand a filter tensor.

Matrix multiplication is used to represent neurons that areconnected to all inputs, which are found in dense layerstypically at the end of a convolutional neural network.

Pooling combines adjacent elements using a reduction opera-tion such as maximum or average.

Elementwise operations such as ReLUs and batch normal-ization.

Reshaping reiterprets the shape of a tensor, for example, toflatten preceding a matrix multiplication.Tensor programs are used as models for large datasets in

regression and classification tasks. To define a space of models

Encryptor &

Decryptor

OptimizedHomomorphicTensor Circuit

CHETCompilerTensor Circuit

Schema of Input & Weights

Desired OutputPrecision

Figure 1: Overview of the CHET system at compile-time.

some constant tensors in the program are designated as train-able weights. Since operations are differentiable (almost every-where), backpropagation can be used to update weights. Thisis used as a step in an optimization algorithm, e.g., stochasticgradient descent.

Tensor programs have attractive properties for executionon homomorphic encryption: typically no branching, veryregular data access patterns making vectorization possible. Weconsider the tensor program as a circuit of tensor operationsand this circuit is a Directed Acyclic Graph (DAG).

3. Software Stack for Homomorphic Evaluationof Tensor Programs

There are many open problems in building a full software stackfor FHE applications and the systems, OS, and PL communitycan help bridge the gap between an application programmerand the underlying crypto library that implements secure prim-itives. This section presents an overview of an end-to-endsoftware stack for evaluating tensor programs on homomor-phically encrypted data. The target architecture for this stackis a Fully Homomorphic Encryption (FHE) scheme and ourtensor compiler, CHET, interacts with this scheme using theHomomorphic Instruction Set Architecture (HISA) defined inSection 4. At the top of stack rests the user-provided tensorprogram or circuit. CHET consists of a compiler and runtimeto bridge the gap in-between them. We describe the runtimeand the compiler in detail in Sections 5 and 6, respectively.In this Section, we present an overview of the end-to-endsoftware stack.

Figure 1 shows the overview of the CHET compiler. Inaddition to the tensor circuit, CHET requires the schema ofthe input and weights to the circuit. The schema specifies thedimensions of the tensors as well the floating-point precisionrequired of the values in those tensors. CHET also requiresthe desired floating-point precision of the output of the circuit.Using these constraints, CHET generates an equivalent, op-timized homomorphic tensor circuit as well as an encryptorand decryptor. Both of these executables encode the choicesmade by the compiler to make the homomorphic computationefficient.

To evaluate the tensor circuit on an image, the client firstgenerates a private key and encrypts the image using the en-cryptor (which can also generate private keys) provided by thecompiler, as shown in Figure 2. The encrypted image is then

4

Encryptor &

Decryptor

Image EncryptedImage

Encryptor &

Decryptor

EncryptedPrediction Prediction

Client ClientOptimized

HomomorphicTensor Circuit

CHET Runtime

FHE Scheme

EncryptedImage

EncryptedPrediction

Server

WeightsWeights

Private Key Public Keys Public Keys Private Key

Figure 2: Overview of the CHET system at runtime.

sent to the server along with unencrypted weights and pub-lic keys required for evaluating homomorphic operations (i.e.,multiplication and rotation). The server executes the optimizedhomomorphic tensor circuit generated by the CHET compiler.The homomorphic tensor operations in the circuit are exe-cuted using the CHET runtime, which uses an underlying FHEscheme to execute homomorphic computations on encrypteddata. The circuit produces an encrypted prediction, which itthen sends to the client. The client decrypts the encryptedprediction with its private keys using the compiler generateddecryptor. In this way, the client runs tensor programs likeneural networks on the server without the server being privyto the data, the output (prediction), or any intermediate state.

While Figure 2 presents the flow in CHET for homo-morphically evaluating a tensor program on a single image,CHET supports evaluating the program on multiple images si-multaneously, which is known as batching in image inference.Batching increases the throughput of image inference. In con-trast, CHET ’s focus is decreasing image inference latency. Inthe rest of this paper, we consider a batch size of 1, althoughCHET trivially supports larger batch sizes.

The FHE scheme exposes several policies or heuristics suchas the encryption parameters. Similarly, the CHET runtimealso exposes several policies that could be specific to or inde-pendent of the FHE scheme. This is analogous to Intel MKLlibraries having different implementations of the same oper-ation, where the most performant implementation dependson the size of the input or the target architecture. In the caseof homomorphic computation, these policies not only affectperformance but can also affect security and accuracy. Akey design principle of CHET is the separation of concernbetween the policies of choosing the secure, accurate, andmost efficient homomorphic operation from the mechanismsof executing those policies. Some of these policies are inde-pendent of the input or weights schema, so they are entrustedto the CHET runtime. On the other hand, most policies mayrequire either the input or weights schema or global analysisof the program. In such cases, the CHET compiler choosesthe appropriate policy and the CHET runtime implements themechanisms for that policy. We describe the policies exploredand mechanisms exposed by the CHET runtime (Section 5)and then describe how the CHET compiler (Section 6) choosesthe appropriate policy.

4. Homomorphic Instruction Set ArchitectureThis section introduces a design for a Homomorphic Instruc-tion Set Architecture (HISA), that provides a useful interfaceto fully homomorphic encryption libraries. We have designedthe HISA with the following objectives in mind:• Abstract details of HE schemes, such as use of evaluation

keys and management of moduli.• Provide a common interface to shared functionality in FHE

libraries, while exposing unique features.• Not to include the complexity of plaintext evaluation. In-

stead the HISA is intended to be embedded into a hostlanguage.The HISA is split into multiple profiles and all libraries are

expected to implement at least the Encryption profile. EachFHE library implementing the HISA provides two types, ptfor plaintexts and ct for ciphertexts, and additional ones de-pending on the profiles implemented by the library.

Figure 3 presents instructions available in the HISA. TheEncryption profile provides core functionality for encryption,decryption, and memory management. Usage of encryptionand evaluation keys is left as a responsibility of FHE librariesand is not exposed in the HISA.

The Integers profile provides operations for encoding andcomputing on integers. The profile covers both FHE schemeswith batched encodings and ones without by having a config-urable number of slots s, which is provided as an additionalparameter during initialization of the FHE library.

The Division profile exposes extra functionality provided bythe HEAAN family of encryption schemes. For libraries imple-menting the base HEAAN scheme [16], maxScalarDiv(c,ub)will return the largest power-of-two d such that d ≤ ub andd ≤ q, where q is the modulus of c. For a variation of HEAANbased on residue number systems (RNS) [7], on the otherhand, maxScalarDiv(c,ub) would return the largest coprimemodulus of c less than ub, or 1 if there is none. This way theDivision profile cleanly exposes the division functionality ofboth variants of the HEAAN encryption scheme.

The Relin profile provides the capability to separate mul-tiplication from re-linearization. While re-linearization is se-mantically a no-op, it is useful to expose relinearize calls tothe compiler, as their proper placement is a highly non-trivial(in fact NP-complete) problem [14].

Finally, the Bootstrap profile exposes bootstrapping in FHElibraries that support it. While bootstrapping is semanticallya no-op, it must be exposed in the HISA because selection of

5

Instruction Signature Semantics Profile

encrypt(p) pt→ ct Encrypt plaintext p into a ciphertext. Encryptiondecrypt(c) ct→ pt Decrypt ciphertext c into a plaintext. Encryptioncopy(c) ct→ ct Make a copy of ciphertext c. Encryptionfree(h) ct∪pt→ void Free any resources associated with handle h. Encryption

encode(m) Zs→ pt Encode vector of integers m into a plaintext. Integersdecode(p) pt→ Zs Decode plaintext p into a vector of integers. IntegersrotLeft(c,x), rotLeftAssign(c,x) ct,Z→ ct Rotate ciphertext c left x slots. IntegersrotRight(c,x), rotRightAssign(c,x) ct,Z→ ct Rotate ciphertext c right x slots. Integers

add(c,c′),addAssign(c,c′) ct,ct→ ct

Add ciphertext, plaintext, or scalar to ciphertext c.Integers

addPlain(c, p),addPlainAssign(c, p) ct,pt→ ct IntegersaddScalar(c,x),addScalarAssign(c,x) ct,Z→ ct Integers

sub(c,c′),subAssign(c,c′) ct,ct→ ct Subtract ciphertext, plaintext, or scalar from ciphertextc.

IntegerssubPlain(c, p),subPlainAssign(c, p) ct,pt→ ct IntegerssubScalar(c,x),subScalarAssign(c,x) ct,Z→ ct Integers

mul(c,c′),mulAssign(c,c′) ct,ct→ ct Multiply ciphertext c with ciphertext, plaintext, orscalar.

IntegersmulPlain(c, p),mulPlainAssign(c, p) ct,pt→ ct IntegersmulScalar(c,x),mulScalarAssign(c,x) ct,Z→ ct Integers

divScalar(c,x),divScalarAssign(c,x) ct,Z→ ct Divide ciphertext c with scalar x. Undefined if x 6=maxScalarDiv(c,ub) for some ub ∈ Z.

Division

maxScalarDiv(c,ub) ct,Z→ Z Gives largest valid divisor d s.t. 1≤ d ≤ ub. Division

mulNoRelin(c,c′),mulNoRelinAssign(c,c′) ct,ct→ ct Ciphertext-ciphertext multiplication without re-linearization.

Relin

relinearize(c) ct→ void No-op, FHE library performs re-linearization. Relin

bootstrap(c) ct→ void No-op, FHE library performs bootstrapping. Bootstrap

Figure 3: Instructions of the HISAencryption parameters depends on when bootstrapping is per-formed. Furthermore, the optimal placement of bootstrappingoperations [8] depends on the program being evaluated: “wide”programs with a lot of concurrent should bootstrap at lowerdepths than “narrow” programs that are mostly sequential.

The HISA may be implemented either with precise or ap-proximate semantics. With approximate semantics all opera-tions that return a pt or ct may introduce an error term. Itis expected that this error is from some random distributionwhich may be bounded given a confidence level. However,the HISA does not offer instructions for estimating the error,as we expect these estimates to be required during selectionof fixed point precisions and encryption parameters. Insteadwe recommend that approximate FHE libraries provide imple-mentations of the HISA with no actual encryption and either:a way to query safe estimates of errors accumulated in eachpt and ct, or sampling of error from the same distribution asencrypted evaluation would produce. Both approaches may beused by a compiler for selecting fixed point precisions. Thesampling approach is more flexible for applications where itis hard to quantify acceptable bounds for error, such as neuralnetwork classification, where only maintaining the order ofresults matters.

Initialization of FHE libraries is concerns that falls outsideof the HISA, and is specific to the encryption and encodingscheme used. This step will include at least generating orimporting encryption and evaluation keys. FHE libraries im-plementing leveled encryption schemes also require selectingencryption parameters that are large enough to allow the com-

putation to succeed. Such FHE libraries should provide analy-sis passes as additional implementations of the HISA to helpwith parameter selection. We have implemented such a passfor the HEAAN library [16], which we present in Section 6.

HISA is not an IR: The HISA does not and nor is it meantto provide a unified interface to FHE libraries. To draw aparallel to processors implementing the x86 instruction set, ifa program includes instructions from Advanced Vector Exten-sions (AVX) then that program will not run on any processorbought before 2011. Conversely, running a program that doesnot use AVX on a processor that does may be leaving a lotof potential performance unused. Similarly, a program us-ing the Division HISA profile will only run on FHE librariesimplementing a HEAAN style encryption scheme [16] andagain, conversely, a program using HEAAN for fixed pointarithmetic without ever calling divScalar would likely achievemuch better performance (through smaller encryption parame-ters). The role of a unified interface should instead be playedby an intermediate representation (IR). For the HISA, thisIR would include fixed point datatypes and specifications ofrequired precisions. This kind of an IR could then be loweredto target FHE libraries that support different subsets of profiles.This is similar to how for example the LLVM intermediaterepresentation can be lowered both to processors that supportAVX and ones that do not.

6

5. Runtime for Executing Homomorphic TensorOperations

Intel MKL libraries provide efficient implementations ofBLAS operations. In the same way, we design the CHET run-time to provide efficient implementations for homomorphictensor primitives. While the interface for BLAS and tensoroperations are well-defined, the interface for a tensor operationon encrypted data is not apparent because the encrypted datais an encryption of a vector, whereas the tensor operation ison higher-dimensional tensors. The types need to be recon-ciled to define a clean interface. In the rest of this section, weconsider a 4-dimensional tensor (batch, channel, height, andwidth dimensions) to illustrate this, but the same concepts ap-ply to other higher-dimensional tensors. We first describe ourcipher tensor datatype. Each tensor operation on unencryptedtensors corresponds to an equivalent homomorphic tensor op-eration on cipher tensor(s) and plain tensor(s), if any. We thenbriefly describe implementations of some homomorphic tensoroperations using this clean interface and datatype.

5.1. Cipher Tensor Datatype:

A naive way to encode a 4-d tensor as a vector is to lay out allelements in the tensor contiguously in the vector and encryptit. This approach may not be feasible for a few reasons: (i) thevector size is determined by the encryption parameters (likeN in HEAAN) and the 4-d tensor might not fit in the vector,or (ii) tensor operations like convolution and matrix multi-plication on this input vector (or to produce such an outputvector) become complicated and very inefficient because onlypoint-wise operations or rotations are supported by the HISA.Another option is to encrypt each inner dimension (width)element in a separate cipher. This creates a vector of vec-tors, where the inner vector is a cipher and the outer vectorcontiguously stores pointers to these ciphers. The number ofhomomorphic operations would increase significantly becausea cipher operation may be required for each element, therebynot utilizing the vector width and making it very inefficient.

The problem of encoding a 4-d tensor as a vector of vectorsis similar to tiling or blocking the data but with constraintsthat differ from that of locality. For example, a 4-d tensor canbe blocked as a 2-d tensor of 2-d tensors. This correspondsto the 4-d tensor being laid out as a vector of vectors, whereeach inner vector has the inner two dimensions (height andwidth) encrypted. Similar issues as stated earlier may ariseif a particular data layout is fixed. Some tensor operationsmight be more efficient in some layouts than in others. Moreimportantly, the most efficient layout may be dependent onthe schema of the tensor or on future tensor operations, sodetermining that is best left to the compiler. Thus, we wanta uniform cipher tensor datatype that is parametric to differ-ent data layouts, so that the CHET compiler can choose theappropriate data layout.

Some tensor operations enforce more constraints on the

cipher tensor datatype, primarily for performance reasons.For example, to perform convolution with same padding, theinput 4-d tensor is expected to be padded. Such a paddingon unencrypted data is trivial, but adding such padding toan encrypted cipher on-the-fly involves several rotation andpoint-wise multiplication operations. Similarly, a reshape ofthe tensor is trivial on unencrypted data but very inefficient onencrypted data. All these constraints arise because using onlypoint-wise operations or rotations on the vector may be highlyinefficient to perform certain tensor operations. Therefore,the cipher tensor datatype needs to allow padding or logicalre-shaping without changing the elements of the tensor.

To satisfy these requirements, we define an cipher tensordatatype, that we term CipherTensor, as a vector of ciphertexts(vectors) with associated metadata that captures the way tointerpret the vector of ciphertexts as the corresponding unen-crypted tensor. The metadata includes: (i) physical dimensionsof the (outer) vector and those of the (inner) ciphertext, (ii) log-ical dimensions of the equivalent unencrypted tensor, and (iii)physical strides for each dimension of the (inner) ciphertext.

The metadata is stored as plain integers and can be modifiedeasily. Nevertheless, the metadata does not leak any informa-tion about the data because it is agnostic to the values of thetensor and is solely reliant on the schema or dimensions of thetensor. The metadata can be used to satisfy all the requiredconstraints of the datatype:• The metadata are parameters that the compiler can choose

to instantiate and use a specific data layout. For example,blocking or tiling the inner dimension (height and width) ofa 4-d tensor corresponds to choosing 2 dimensions for the(outer) vector and 2 dimensions for the (inner) ciphertext,where the logical dimensions match the physical dimen-sions. We term this as HW-tiling. Another option is to blockthe channel dimension too, such that multiple, but not all,channels may be in a cipher. This corresponds to choosing2 dimensions for the (outer) vector and 3 dimensions forthe (inner) ciphertext with only 4 logical dimensions. Weterm this as CHW-tiling.

• The physical strides of each dimension can be specifiedto include sufficient padding in-between elements of thatdimension. For example, for an image of height (row) andwidth (column) of 28, a stride of 1 for the width dimensionand a stride of 30 for the height dimension allows a paddingof 2 (zero or invalid) elements between the rows.

• Reshaping the cipher tensor only involves updating themetadata to change the logical dimensions and does notperform any homomorphic operations.

5.2. Homomorphic Tensor Operations:

The interface for a homormorphic tensor operation is mostlysimilar to that of its unencrypted tensor operation counterpart,except that the types are replaced with cipher tensors or plaintensors, which we call CipherTensor and PlainTensor, respec-tively. The homomorphic tensor operation also exposes the

7

Algorithm 1: HW-Tiled Homomorphic Convolution 2-dof a 4-d tensor (with valid padding)

Input :CipherTensor inputInput :PlainTensor filterOutput :CipherTensor output

1 foreach [b, oc] in output.getOuterDims() do2 output.ciphers[b, oc] = zeroCipher3 foreach ic in input.getChannelDims() do4 hStride, wStride = input.getCipherStrides()5 foreach [fh, fw] in filter.getHeightWidth() do6 weight = filter[fh, fw, ic, oc]7 weightFP = FixedPrecision(weight, plainLogP)8 rotate = fh * hStride + fw * wStride9 temp = input.ciphers[b, ic]

10 FHE.rotLeftAssign(temp, rotate)11 FHE.mulScalarAssign(temp, weightFP)12 FHE.addAssign(output.ciphers[b, oc], temp)13 end14 end15 FHE.divScalarAssign(output.ciphers[b, oc], plainLogP)16 end

output CipherTensor in the interface as a pass-by-referenceparameter. This enables the compiler to specify the data layoutof both the input and output CipherTensors. In addition, theinterface exposes parameters to specify the scaling factors touse for the input CipherTensor(s) and PlainTensor(s), if any.When a compiler specifies all these parameters, it chooses animplementation specialized for this specification. There mightbe multiple implementations for the same specification thatare tuned for specific FHE libraries (for those that have divS-calar and for those that do not). In such cases, the compileralso chooses the implementation to use. Nevertheless, thereare several algorithm choices for a given specification that havesignificant impact on the performance of the operation. Thesecould be quite involved (similar to MKL implementations).We explore some of these briefly next.

HW-tiled Homomorphic Convolution 2-d with VALIDpadding : Consider a HW-tiled CipherTensor that repre-sents a 4-d tensor. Algorithm 1 shows pseudocode for a 2-dconvolution with valid padding of this CipherTensor into an-other HW-tiled CipherTensor. Zeros (of the size of the cipher-text) encrypted into a ciphertext is assumed to be available.In convolution, the first element in a HW cipher needs to beadded with some of its neighboring elements and before theaddition, each of these elements need to be multiplied withdifferent filter weights. To do so, we can left rotate each ofthese neighboring elements to the same position as the first el-ement. Rotating the ciphertext vector in such a way moves theposition appropriately for all HW elements. For each rotatedvector, the same weight needs to be multiplied for all HWelements, so mulScalar of the ciphertext is sufficient. Beforemultiplication, the filter weight needs to be scaled using thecompiler-provided scaling factor (plainLogP). These cipher-text vectors can then be added for different input channels toproduce a ciphertext for an output channel. The rotations ofthe input ciphertexts are invariant to the output batch and chan-

nel, so they can be code motioned out in the implementationto reduce the number of rotations (omitted in the pseudocodefor the sake of exposition).

HW-tiled Homomorphic Convolution 2-d with SAMEpadding : If a similar 2-d convolution has to be imple-mented, but with same padding, then each HW ciphertextis expected to be padded by the compiler so that the amount torotate (left or right) varies but the number of rotations remainthe same. However, after one convolution, there are invalid ele-ments where the padded zeros existed earlier because those areadded to the neighboring elements as well. The runtime keepstrack of this using an additional metadata on the CipherTen-sor. The next time a convolution (or some other operation)is called on the CipherTensor, if the CipherTensor containsinvalid elements where zeros are expected to be present, thenthe implementation can mask out all invalid elements withone mulPlain operation (the plaintext vector contains 1 wherevalid elements exist and 0 otherwise). This not only increasesthe time for the convolution operation, but it also increases themodulus Q required because divScalar may need to be calledafter such a masking operation. For security reasons, a largerQ can increase the N that needs to be used during encryption,thereby increasing the cost of all homomorphic operations.

CHW-tiled Homormorphic Convolution 2-d : Let us nowconsider a CHW-tiled CipherTensor that represents a 4-d ten-sor. To perform a 2-d convolution on this, even with VALIDpadding, mulPlain is required because different weights needto be multiplied to different channels that are all in the samecipher. In some FHE libraries like HEAAN, the asymptoticcomplexity of mulPlain is much higher than that of mulScalar,thereby increasing the cost of the convolution operation. More-over, after multiplication, the multiple input channels in theciphertext need to be summed up into a single output channel.Such a reduction can be done by rotating every channel tothe position of the first one in the ciphertext and adding themup one by one. If C is the number of channels in the cipher,this involves C−1 rotations. However, such a reduction canbe done more efficiently by exploiting the fact that the stridebetween the input channels is the same. This requires at themost 2log(C) rotations with additions in-between rotations,similar to vectorized reduction on unencrypted data. To pro-duce an output ciphertext with multiple channels C, the inputciphertext needs to be replicated C−1 times. Instead of C−1rotations serially, this can also be done in 2log(C) rotations,by adding the ciphertexts in-between.

Homomorphic matmul : Different FHE operations havedifferent latencies. Although rotLeft and mulPlain have simi-lar algorithmic complexity in HEAAN, the constants may varyand we observe that, mulPlain is more expensive than rotLeft.Due to this, it may be worth trading multiplications for rota-tions. This trade-off is most evident in homormorphic matmuloperation. The number of mulPlain required reduce propor-tional to the number of replicas of the data we can add in the

8

TransformedHomomorphicTensor Circuit

SymbolicCHET Runtime

SymbolicAnalyser

Transformer AnalysisResults

Figure 4: Overview of an analysis and transformation pass inthe CHET compiler.

same cipher. Adding replicas increase the number of rotationsbut decrease the number of multiplications. This yields muchmore benefit because replicas can be added in log number ofrotations instead of the linear number of multiplications.

6. Compiler For Transforming HomomorphicTensor Operations

In this section, we describe the CHET compiler, which is acritical component of the end-to-end software stack to ho-momorphically evaluate tensor programs. The compiler isresponsible for generating an optimized homomorphic tensorcircuit that not only produces accurate results but also guar-antees security of the data being computed. We describe theanalysis and transformation framework used by the CHETcompiler and then use this framework to describe a few trans-formations that are required for accuracy and security, anda few optimizations that improve performance significantly.We present the analysis and transformations specifically forHEAAN but it can extended for other FHE libraries trivially.

6.1. Analysis and transformation framework

In contrast to most traditional optimizing compilers, the inputtensor circuit has two key properties that the CHET compilercan exploit: the data flow graph is a Directed Acyclic Graph(DAG) and the tensor dimensions are known at compile-timefrom the schema provided by the user (similar to High Per-formance Fortran compilers). The CHET compiler must alsoanalyze multiple data flow graphs corresponding to differentimplementations. Fortunately, the design space is not thatlarge and we can explore it exhaustively, one at a time.

The CHET compiler only needs to determine the policy toexecute homomorphic computation while the CHET runtimehandles the mechanisms of efficient execution. This separa-tion of concerns simplifies the code generation tasks of thecompiler but it could complicate the analysis required. Thecompiler generates high-level homomorphic tensor operationsbut needs to analysis the low-level HISA instructions exe-cuted. To resolve this, we exploit the CHET runtime directlyto perform the analysis.

Figure 4 shows the flow of a single analysis and transfor-mation pass in the compiler. The transformer is a simple toolthat specifies the parameters to use in the homomorphic ten-sor circuit (the input tensor circuit is directly mapped to an

equivalent homomorphic tensor circuit with parameters leftunspecified). This transformed homomorphic tensor circuitcan be symbolically executed using the CHET runtime. Thiswould execute the same HISA instructions that would be ex-ecuted at runtime because the tensor dimensions are known.Instead of executing these HISA instructions using the FHEscheme, the HISA instructions invoke the symoblic analyser.This tracks the data that flows through the circuit. Repeatediteration to a fix-point is not needed since the circuit is a DAG.The DAG is unrolled on-the-fly to dynamically track the dataflow, without having to construct an explicit DAG. Differentanalysers can track different information flows. The analyserreturns its results as the output of the symoblic execution.The transformer then uses these results to instantiate a new,transformed specification of the homomorphic tensor circuit.

6.2. Parameter Selection

To choose encryption parameters such that the output is accu-rate, the compiler needs to analyse the depth of the circuit. InHEAAN, divScalar consumes modulus Q from the ciphertextit is called on. The output should have sufficient modulusQ left to capture accurate results. We implement a symoblicanalyser with a dummy ciphertext datatype that incrementsthe modulus Q of the input ciphertext and copies into the out-put ciphertext whenever divScalar is called. All other HISAinstructions only copy the modolus from intput to output. Themodulus Q in the dummy output ciphertext is the depth ofthe circuit. The input ciphertext should be encrypted withmodulus Q that is at least the sum of the depth of the circuitand the desired output precision. This ensures accurate results.For the input modolus Q, a large enough N must be chosen toguaratee security of the data. This is a deterministic map fromQ to N. In some cases, Q might require prohibitively large N,in which case the compiler needs to introduce bootstrappingin-between. We do not explore this in this paper.

6.3. Padding Selection

This analysis does not need to track the data flow in the HISAinstructions. It requires analysing the metadata in the Ci-pherTensor that flows through the homomorphic tensor cir-cuit. This is quite straight-forward. If tensor operations likeconvolution need padding, then the previous operations untilthe input should maintain that padding. Some tensor opera-tions may change strides, in which case the padding requiredscales by that factor. The input padding (in each ciphertextdimension) selected is the maximum padding required for anyhomomorphic tensor operation in the circuit.

6.4. Optimization: Rotation Keys Selection

By default, HEAAN inserts public evaluation keys for power-of-2 left and right rotations, and all rotations are performedusing a combination of power-of-2 rotations. The ciphertextsize is N/2, so HEAAN stores 2log(N)−2 rotation keys bydefault. The rotation keys consume significant memory, so

9

No. of layersNetwork Conv FC Act # FP operations

LeNet-5-small 2 2 4 159960LeNet-5-medium 2 2 4 5791168LeNet-5-large 2 2 4 21385674Industrial 5 2 6 -SqueezeNet-CIFAR 10 0 9 37759754

Figure 5: DNNs used in our evaluation.

this is trade-off between space and performance. By storingonly these keys, any rotation can be performed. However, thisis too conservative. In a given homomorphic tensor circuit,the distinct slots to rotate would not be in the order of N. Weuse the analyser to track the distinct slots to rotate used inthe homomorphic tensor circuit. All rotate HISA instructionsstore the constant slots to rotate in the analyser (right rotationsare converted to left rotations) and other HISA instructions areignored. The analyser returns the distinct slots to rotate thatwere used. The compiler generates the encryptor such that itwould generate evaluation keys for these rotations using theprivate key on the client (Figure 2). We do not need power-of-2 rotation keys and we observe that the rotation keys chosenby the compiler are a constant factor of log(N).

6.5. Optimization: Data Layout Selection

Different homomorphic operations have different costs, evenwithin the same FHE scheme. The cost of homomorphic op-erations could also vary a lot based on small variations ofthe scheme. For example, by using a full-RNS variant of theHEAAN scheme and storing the ciphertexts and plaintextsin the number-theoretic-transform (NTT) domain, we can de-crease the complexity of mulPlain* from O(n log(n)) to O(n).The compiler can encode the cost of each operation eitherfrom asymptotic complexity or from microbenchmarking eachoperation. These costs can then be used to decide which imple-mentation or data layout to choose for a given homomorphictensor circuit. The analyser in this case counts the number ofoccurrences of each operation and then uses the asymptoticcomplexity to determine to the total cost of the circuit. Thetransformer creates a homomorphic tensor circuit correspond-ing to a data layout and gets its cost using the analyser. Itrepeats this for different possible data layout options and thenchooses the one with the lowest cost.

7. Evaluation

Our evaluation targets a set of neural network architectures forimage classification tasks described in Figure 5.LeNet-5-like is a series of networks for the MNIST [3]

dataset. We use three versions with different number of neu-rons: LeNet-5-small, LeNet-5-medium, and LeNet-5-large.The largest one matches the one used in the TensorFlow’stutorials [5]. These networks have two convolutional layers,each followed by ReLU activation and max pooling, and twofully connected layers with a ReLU in between.

Model CHET Hand-written

LeNet-5-small 8 14LeNet-5-medium 51 140LeNet-5-large 265 -Industrial 312 2413SqueezeNet-CIFAR10 1342 -

Figure 6: Average latency (in seconds) of CHET and hand-written versions.

Model log(N) log(Q) log(Pc) log(Pp)

LeNet-5-small 14 240 30 16LeNet-5-medium 14 240 30 16LeNet-5-large 15 400 40 20Industrial 16 705 35 25SqueezeNet-CIFAR10 16 940 30 20

Figure 7: Encryption parameters selected by CHET and theuser-provided precisions for each model.

SqueezeNet-CIFAR is a network for the CIFAR-10dataset [1] that follows the SqueezeNet [25] architecture.This version has 4 Fire-modules [4] for a total of 10 convolu-tional layers.

Industrial is an pretrained HE-compatible network from anindustry partner for a privacy-sensitive image classificationtask. We are unable to reveal the details of the network otherthan the fact that it has 5 convolutional layers and 2 fullyconnected layers.All networks other than Industrial use ReLUs and max-

pooling, which are not compatible with homomorphic eval-uation. For these networks, we modified the activation func-tions to a second-degree polynomial [23, 13]. The key dif-ference with prior work is that our activation functions aref (x) = ax2 + bx with learnable parameters a and b. Duringthe training phase, the DNN adjusts these parameters automat-ically to appropriately approximate the ReLU function. Toavoid exploding the gradients during training (which usuallyhappens during the initial parts of training), we initialized a tozero and clipped the gradients when large. We also replacedmax-pooling with average-pooling.

We trained LeNet-5-large to an accuracy of 0.993, whichmatches that of the network in the TensorFlow’s tuto-rial [5]. For SqueezeNet-CIFAR, we additionally introduceL2-regularization with a scale of 1×10−5 to improve perfor-mance. Our resulting accuracy is 0.815 which is close to theaccuracy of 0.84 of the non-HE compatible model. To the bestof our knowledge SqueezeNet-CIFAR is the deepest neuralnetwork that has been homomorphically evaluated.

All experiments were run on a dual socket Intel Xeon [email protected] with 224 GiB of memory. Hyperthreadingwas off for a total of 16 hardware threads. All runtimes arereported as averages over 20 different images. We present theaverage latency of image inference with a batch size of 1.

Comparison with hand-written: Figure 6 compares hand-written implementations and CHET with all optimizations.CHET clearly outperforms hand-written implementations. The

10

Model HW CHW HW-conv CHW-fcCHW-rest HW-before

LeNet-5-small 8 12 8 8LeNet-5-medium 82 91 52 51LeNet-5-large 325 423 270 265Industrial 330 312 379 381SqueezeNet-CIFAR 1342 1620 1550 1342

Figure 8: Average latency (in seconds) with different layouts.

Model Unoptimized Optimized

LeNet-5-small 14 8LeNet-5-medium 73 51LeNet-5-large 426 265Industrial 645 312SqueezeNet-CIFAR 2648 1342

Figure 9: Average latency (in seconds) with and without rota-tion keys optimization.

hand-written implementations lack some of the optimizationsin the CHET compiler and runtime. Moreover, it is diffi-cult to scale these hand-written implementations to large net-works like LeNet-5-large and SqueezeNet-CIFAR10, so we donot have hand-written implementations for these to compareagainst.

Parameter Selection: In Figure 7, the last columns showprecision (the number of decimal digits) required for the imageor ciphertext (Pc) and the weights or the plaintext (Pp), thatare provided by the user. The precision provided is used bythe CHET compiler to select the encryption parameters N andQ. The values of these parameters grow with the depth ofthe circuit, as shown in the figure. With these parameters,CHET generated homomorphic tensor circuits achieve thesame accuracy as the unencrypted circuits. In addition, thedifference between the output values of these circuits is withinthe desired precision of the output.

Data Layout Selection: We evaluate four different datatiling layouts choices for the three largest networks: (i) HW:each ciphertext has all height and width elements of a singlechannel, (ii) CHW: each ciphertext has multiple channels (allheight and width elements of each), (iii) HW-conv and CHW-rest: same has CHW, but move to HW before each convolutionand back to CHW after each convolution, and (iv) CHW-fcand HW-before: same as HW, but switch to CHW duringthe first fully connected layer and CHW thereafter. Figure 8presents the average latency of each network for each layout.We can see that for each network, a different layout providesthe lowest latency. It is very difficult for the user to determinewhich is the best data layout and more importantly, it is dif-ficult to implement each network manually using a differentdata layout. This highlights how the compiler should searchthe space of possible layouts and kernel implementations foreach program separately, while entrusting the runtime to im-plement it efficiently. In this case, the compiler chooses thebest performing data layout for each network based on the costmodel of HEAAN.

Rotation Keys Selection: We evaluate the efficacy of ourrotation keys optimization on the three largest networks. Fig-ure 9 presents the average latencies with the optimization onor off. The optimization provides significantly improved per-formance for all networks and should be always used.2 Havingthis optimization implemented as an automatic compiler passremoves the burden of adding proper rotation keys in eachprogram separately.

8. Related Work

FHE is currently an active area of research. See Acar et al [6]for more details. Many have observed the need and suitabilityof FHE for machine learning tasks. We survey the most relatedwork below.

Cryptonets [23] was the first tool to demonstrate a fully-homomorphic inference of a DNN by replacing the (FHEincompatible) RELU activations with a quadratic function.They demonstrated an accuracy of 98.95% for MNIST with anetwork with two hidden layers, with a latency of 250 secondsper image. Chabanne et al. [13] improve upon Cryptonetsby using batch normalization [26] to bound the inputs to theactivations into a small interval where the quadratic approx-imation is valid. They report improvement in accuracy forMNIST to 99.30% with a network with 6 activation layers.Our method of retraining the quadratic activation function isinspired by these works. Our emphasis in this paper is pri-marily on automating the manual and error-prone hand tuningrequired to ensure that the networks are secure, correct, andefficient.

Bourse et al. [9] use the TFHE library [17] for DNN infer-ence. TFHE operates on bits and is thus slow for multi-bitinteger arithmetic. To overcome this difficulty, Bourse et al.instead use a discrete binary neural network (where activa-tion functions output either -1 or 1) with a hidden layer usingthe sign function as the activation. Our goal in this paper isto build a framework for compiling larger general-purposeDNNs.

Similarly, Cingulata [12, 2] is a compiler for converting C++programs into a Boolean circuit, which is then evaluated usinga backend FHE library. Despite various optimizations [11],this approach is unlikely to scale for large DNNs.

DeepSecure [30] and SecureML [29] use secure multi-partycomputation techniques to respectively perform DNN infer-ence and training. Such techniques employ communicationbetween multiple entities that are assumed to not collude andthus do not have the simple trust model of FHE. On the otherhand, such approaches are more flexible in the operations theycan perform and are computationally cheaper.

Finally, prior work [31, 19] have used partially homomor-phic encryption schemes (which support addition or multipli-cation of encrypted data but not both) to determine the encryp-tion schemes to use for different data items so as to execute

2Barring very memory constrained environments.

11

a given program. While they are able to use computationallyefficient schemes, such techniques are not applicable for eval-uating DNNs that require both multiplication and addition tobe done on the same input.

9. Conclusion

Good abstractions separate concerns to either side of thatabstraction. For example, x86 lets hardware manufacturersinnovate independently of the software developers and com-piler writers that target that abstraction. Likewise, cuDNN letsmachine learning experts target various hardware implemen-tations and reap the benefits of new GPU hardware withoutchanging their software.

This paper introduces such an abstraction for FHE appli-cations that cleanly abstracts and exposes features of FHEimplementations. We demonstrate a compiler and runtimethat targets FHE tensor programs, evaluate that compiler andruntime on real-world CNN models, and demonstrate our com-piler is able to significantly optimize the performance of FHEtensor programs. Because of these optimizations, this paperdemonstrates the deepest FHE based CNNs to date.

This paper demonstrates the fruitful application of systemsand compiler research for FHE applications. We posit there aremany open problems that such areas can help solve (i.e., com-piler support that turns expressions into FHE compatible ones,or more optimization passes specific to encryption parame-ters). We expect these opportunities excites other researchersas much as it does the authors.

References[1] The CIFAR-10 dataset. https://www.cs.toronto.edu/~kriz/

cifar.html.[2] Cingulata. https://github.com/CEA-LIST/Cingulata/.[3] The MNIST database of handwritten digits. http://yann.lecun.

com/exdb/mnist/.[4] Squeezenet for CIFAR-10. https://github.com/kaizouman/

tensorsandbox/tree/master/cifar10/models/squeeze.[5] TensorFlow’s LeNet-5-like convolutional MNIST model exam-

ple. https://github.com/tensorflow/models/blob/v1.9.0/tutorials/image/mnist/convolutional.py.

[6] Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. Asurvey on homomorphic encryption schemes: Theory and implementa-tion. CoRR, abs/1704.03578, 2017.

[7] Jean-Claude Bajard, Julien Eynard, M. Anwar Hasan, and VincentZucca. A full RNS variant of FV like somewhat homomorphic encryp-tion schemes. In Roberto Avanzi and Howard Heys, editors, SelectedAreas in Cryptography – SAC 2016, pages 423–442. Springer, 2017.

[8] Fabrice Benhamouda, Tancrède Lepoint, Claire Mathieu, and HangZhou. Optimization of bootstrapping in circuits. In Proceedings of theTwenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms,pages 2423–2433. SIAM, 2017.

[9] Florian Bourse, Michele Minelli, Matthias Minihold, and Pascal Pail-lier. Fast homomorphic evaluation of deep discretized neural net-works. Cryptology ePrint Archive, Report 2017/1114, 2017. https://eprint.iacr.org/2017/1114.

[10] Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. (Leveled)fully homomorphic encryption without bootstrapping. ACM Transac-tions on Computation Theory (TOCT), 6(3):13, 2014.

[11] Sergiu Carpov, Pascal Aubry, and Renaud Sirdey. A multi-start heuris-tic for multiplicative depth minimization of boolean circuits. In LjiljanaBrankovic, Joe Ryan, and William F. Smyth, editors, CombinatorialAlgorithms, pages 275–286. Springer International Publishing, 2018.

[12] Sergiu Carpov, Paul Dubrulle, and Renaud Sirdey. Armadillo: acompilation chain for privacy preserving applications. CryptologyePrint Archive, Report 2014/988, 2014. https://eprint.iacr.org/2014/988.

[13] Hervé Chabanne, Amaury de Wargny, Jonathan Milgram, ConstanceMorel, and Emmanuel Prouff. Privacy-preserving classification ondeep neural network. IACR Cryptology ePrint Archive, 2017:35, 2017.

[14] Hao Chen. Optimizing relinearization in circuits for homomorphicencryption. CoRR, abs/1711.06319, 2017. https://arxiv.org/abs/1711.06319.

[15] Jung Hee Cheon, Kyoohyung Han, Andrey Kim, Miran Kim, and Yong-soo Song. Bootstrapping for approximate homomorphic encryption.In Jesper Buus Nielsen and Vincent Rijmen, editors, Advances in Cryp-tology – EUROCRYPT 2018, pages 360–384, Cham, 2018. SpringerInternational Publishing.

[16] Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. Ho-momorphic encryption for arithmetic of approximate numbers. InTsuyoshi Takagi and Thomas Peyrin, editors, Advances in Cryptology– ASIACRYPT 2017, LNCS 10624, pages 409–437. Springer, 2017.https://doi.org/10.1007/978-3-319-70694-8_15.

[17] Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Iz-abachène. Tfhe: Fast fully homomorphic encryption over the torus.Cryptology ePrint Archive, Report 2018/421, 2018. https://eprint.iacr.org/2018/421.

[18] Intel Corp. Intel software guard extensions (Intel SGX), ref. 332680-002. https://software.intel.com/sites/default/files/332680-002.pdf, jun 2015.

[19] Yao Dong, Ana Milanova, and Julian Dolby. Jcrypt: Towards compu-tation over encrypted data. In Proceedings of the 13th InternationalConference on Principles and Practices of Programming on the JavaPlatform: Virtual Machines, Languages, and Tools, Lugano, Switzer-land, August 29 - September 2, 2016, pages 8:1–8:12, 2016.

[20] Junfeng Fan and Frederik Vercauteren. Somewhat practical fully ho-momorphic encryption. IACR Cryptology ePrint Archive, 2012:144,2012.

[21] Michael J Flynn. Some computer organizations and their effectiveness.IEEE transactions on computers, 100(9):948–960, 1972.

[22] Craig Gentry and Dan Boneh. A fully homomorphic encryption scheme,volume 20. Stanford University Stanford, 2009.

[23] Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin E. Lauter,Michael Naehrig, and John Wernsing. Cryptonets: Applying neuralnetworks to encrypted data with high throughput and accuracy. InProceedings of the 33nd International Conference on Machine Learn-ing, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages201–210, 2016.

[24] Matthew Hoekstra, Reshma Lal, Pradeep Pappachan, Vinay Phegade,and Juan Del Cuvillo. Using innovative instructions to create trust-worthy software solutions. In Proceedings of the 2Nd InternationalWorkshop on Hardware and Architectural Support for Security andPrivacy, HASP ’13, pages 11:1–11:1, New York, NY, USA, 2013.ACM.

[25] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, SongHan, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-levelaccuracy with 50x fewer parameters and <1mb model size. CoRR,abs/1602.07360, 2016. https://arxiv.org/abs/1602.07360.

[26] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. In Proceed-ings of the 32nd International Conference on Machine Learning, ICML2015, Lille, France, 6-11 July 2015, pages 448–456, 2015.

[27] Kim Laine. Simple encrypted arithmetic library 2.3.1.https://www.microsoft.com/en-us/research/uploads/prod/2017/11/sealmanual-2-3-1.pdf.

[28] Vadim Lyubashevsky, Chris Peikert, and Oded Regev. On ideal latticesand learning with errors over rings. In Annual International Conferenceon the Theory and Applications of Cryptographic Techniques, pages1–23. Springer, 2010.

[29] P. Mohassel and Y. Zhang. Secureml: A system for scalable privacy-preserving machine learning. In 2017 IEEE Symposium on Securityand Privacy (SP), pages 19–38, May 2017.

[30] Bita Darvish Rouhani, M. Sadegh Riazi, and Farinaz Koushan-far. Deepsecure: Scalable provably-secure deep learning. CoRR,abs/1705.08963, 2017.

[31] Sai Deep Tetali, Mohsen Lesani, Rupak Majumdar, and Todd D. Mill-stein. Mrcrypt: static analysis for secure cloud computations. InProceedings of the 2013 ACM SIGPLAN International Conferenceon Object Oriented Programming Systems Languages & Applications,OOPSLA 2013, part of SPLASH 2013, Indianapolis, IN, USA, October26-31, 2013, pages 271–286, 2013.

12

https://www.cs.toronto.edu/~kriz/cifar.html

https://www.cs.toronto.edu/~kriz/cifar.html

https://github.com/CEA-LIST/Cingulata/

http://yann.lecun.com/exdb/mnist/

http://yann.lecun.com/exdb/mnist/

https://github.com/kaizouman/tensorsandbox/tree/master/cifar10/models/squeeze

https://github.com/kaizouman/tensorsandbox/tree/master/cifar10/models/squeeze

https://github.com/tensorflow/models/blob/v1.9.0/tutorials/image/mnist/convolutional.py

https://github.com/tensorflow/models/blob/v1.9.0/tutorials/image/mnist/convolutional.py

https://eprint.iacr.org/2017/1114




https://arxiv.org/abs/1711.06319


https://doi.org/10.1007/978-3-319-70694-8_15



https://software.intel.com/sites/default/files/332680-002.pdf

https://software.intel.com/sites/default/files/332680-002.pdf


https://www.microsoft.com/en-us/research/uploads/prod/2017/11/sealmanual-2-3-1.pdf

https://www.microsoft.com/en-us/research/uploads/prod/2017/11/sealmanual-2-3-1.pdf

chet: compiler and runtime for homomorphic ... - arxiv

Documents