arxiv:2001.03622v2 [quant-ph] 10 feb 2020 · 2020. 2. 11. · quantum embeddings for machine...

Quantum embeddings for machine learning

Seth Lloyd,1, 2 Maria Schuld,2 Aroosa Ijaz,2 Josh Izaac,2 and Nathan Killoran2

1Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA2Xanadu, Toronto, Canada(Dated: February 11, 2020)

Quantum classifiers are trainable quantum circuits used as machine learning models. The first partof the circuit implements a quantum feature map that encodes classical inputs into quantum states,embedding the data in a high-dimensional Hilbert space; the second part of the circuit executes aquantum measurement interpreted as the output of the model. Usually, the measurement is trainedto distinguish quantum-embedded data. We propose to instead train the first part of the circuit—theembedding—with the objective of maximally separating data classes in Hilbert space, a strategy wecall quantum metric learning. As a result, the measurement minimizing a linear classification lossis already known and depends on the metric used: for embeddings separating data using the `1 ortrace distance, this is the Helstrøm measurement, while for the `2 or Hilbert-Schmidt distance, it isa simple overlap measurement. This approach provides a powerful analytic framework for quantummachine learning and eliminates a major component in current models, freeing up more preciousresources to best leverage the capabilities of near-term quantum information processors.

Machine learning is a potential application for near-term intermediate scale quantum computers [1–3].Quantum machine learning algorithms have been shownto provide speed-ups over their classical counterparts fora variety of tasks, including principal component anal-ysis, topological data analysis, and support vector ma-chines [4–6]. This paper investigates quantum machinelearning using variational quantum classifiers, which areparametrized quantum circuits that embed input datain Hilbert space and perform quantum measurementsto discriminate between classes [7–9]. The variationalstrategy can be extended to more complex classifica-tion tasks, hybrid quantum-classical models, and gener-ative models [10, 11]. The training of the circuit is per-formed by optimizing the parameters of quantum gates– such as the angles of Pauli-rotations – with a hybridquantum-classical optimization procedure [12, 13]. Ithas recently been shown [14, 15] that this strategy rep-resents a quantization of classical kernel methods suchas support vector machines [16], which implicitly em-bed data in a high-dimensional Hilbert space and finddecision hyperplanes that separate the data accordingto their classes.

In [14, 15], the quantum feature map that embedsthe data is taken to be a fixed circuit, and the adaptivetraining is performed on a variational circuit that adaptsthe measurement basis. By contrast, here we note thatif the data is “well-separated” in Hilbert space, the bestmeasurements to distinguish between classes of data areknown and can be performed with shallow quantum cir-cuits. The metric under which the data is separateddefines which measurement should be performed: thebest measurement for data separated by the trace dis-tance (`1) is the Helstrøm minimum error measurement[17], and the best measurement for the Hilbert-Schmidt(`2) distance is obtained by measuring the fidelity oroverlaps between embedded data, for example with asimple swap test. We present efficient circuits to im-plement both measurements, which in the case of the

FIG. 1. Illustration of quantum metric learning. a. Theembedding is trained to maximize the distance of the dataclusters in the Hilbert space of quantum states. b. Themeasurement used to classify new inputs depends on thedistance measure used. The simple decision boundary inHilbert space can correspond to a highly complex decisionboundary in the original data space.

fidelity measurement is extremely short.

Knowing the optimal quantum measurements for dis-criminating between clusters of embedded data has im-portant consequences; the common approach of train-ing a variational circuit after the embedding and be-fore the measurement – and thereby the bulk of thecomputational resources spent on the quantum classi-fier – becomes obsolete. We argue that instead, theadaptive training of the quantum circuit should be fo-cused on training a quantum feature map that carriesout a maximally separating embedding (see Figure 1).This approach is known as “metric learning” in theclassical machine learning literature [18, 19], where fea-ture maps, and thereby a metric on the original dataspace, are learned with models such as deep neural net-works. Note that while deep metric learning extractslow-dimensional representations of the data, quantumcomputing allows us to learn high-dimensional repre-sentations without explicitly invoking a kernel function.

We numerically investigate adaptive methods fortraining quantum embeddings using the PennyLanesoftware package for hybrid optimization [20], and ex-plore the performance of the measurements for small

arX

iv:2

001.

0362

2v2

[qu

ant-

ph]

10

Feb

2020

2

examples. We note that due to potential quantum ad-vantages [21], the embeddings performed by quantumcircuits can be inaccessible to classical computers, andclose with a brief discussion of the experimental feasi-bility of implementing quantum embeddings on existingand near-term quantum computers.

I. QUANTUM EMBEDDINGS

Kernel methods for learning operate by embeddingdata points x as vectors ~x ∈ H into a Hilbert space H,a vector space with an inner product. The data is thenanalyzed by performing linear algebra on the embeddedvectors. Standard results on metric spaces imply thatany finite metric space can be embedded in a sufficientlyhigh-dimensional Hilbert space in such a way that themetric on the Hilbert space faithfully approximates themetric of the original space (e.g., [22]). The goal ofthe embedding process is to find a representation of thedata such that the known metric of the Hilbert spacefaithfully reproduces the unknown metric of the origi-nal data, for example, the human-perceived “distance”between pictures of ants and pictures of bees. For such afaithful embedding to be possible, the dimension of theHilbert space may have to be large – at least the orderof the number of data points. If one can find a faithfulembedding, the computations required to compare datavectors and assign them to clusters can be performedusing standard linear algebraic techniques.

Quantum computers are uniquely positioned to per-form kernel methods. The states of quantum systemsare vectors in a high-dimensional Hilbert space, andstandard linear algebraic operations on that Hilbertspace can often be performed in time poly-logarithmicin the dimension of the space – an exponential speedup over the corresponding classical operations. TheHilbert space of an n-qubit quantum computer is thevector space CN , where N = 2n, and the states ψ,ϕare written in Dirac notation as |ψ〉, |ϕ〉 with the in-ner product ψ†ϕ = 〈ψ|ϕ〉. Because of the probabilis-tic interpretation of quantum mechanics, state vectorsare normalized to 1. A measurement to verify that avector |ϕ〉 is equal to |ψ〉 corresponds to a projectionPψ = |ψ〉〈ψ|, which yields the answer Yes with prob-ability 〈φ|Pψ|φ〉 = |〈ψ|φ〉|2. The process of uniformlysampling M quantum states from a set |ψi〉 is de-scribed by the density matrix ρ = 1

M

∑i |ψi〉〈ψi|. The

`1 distance between the statistical ensemble ρ and an-other ensemble σ = 1

M ′

∑j |φj〉〈φj | is the widely stud-

ied trace distance Dtr(ρ, σ) = 12 tr(|ρ − σ|), while the

`2 distance is known as the Hilbert-Schmidt distanceDhs(ρ, σ) = tr((ρ− σ)2).

Quantum computers also naturally map data intoHilbert space. The map that performs the embeddinghas been termed a quantum feature map [14, 15]. Aquantum embedding is then the representation of clas-sical data points x as quantum states |x〉, facilitated bythe feature map. We assume here that the quantum fea-

ture map is enacted by a quantum circuit Φ(x, θ) whichassociates physical parameters in the preparation of thestate |x〉 = Φ(x, θ) |0 . . . 0〉 with the input x. Anotherset θ of physical parameters is used as free variablesthat can be adapted via optimization. All parametersof the embedding are classical objects – they are sim-ply numbers in a suitable high-dimensional vector spaceand can be trained using classical stochastic gradientdescent methods [23, 24]. For example, in a supercon-ducting or ion-trap quantum computer, training couldspecify the time-dependent microwave or laser pulseswhich are used to address the interacting qubits in thedevice; in an optical quantum computer the parameterscould be squeezing amplitudes or beam splitter angles.

In the following, we will limit ourselves to binary clas-sification for simplicity1. In this task, we are givena dataset with examples from two classes A and Bof labels 1 and −1, respectively, and have to pre-dict the label of a new input x, which may or maynot be from the training data set of examples. Uni-formly sampling Ma inputs from class A and embed-ding them into Hilbert space prepares the ensembleρ = 1

Ma

∑a∈A |a〉〈a|, while sampling Mb inputs from

class B prepares σ = 1Mb

∑b∈B |b〉〈b|.

II. OPTIMAL MEASUREMENTS

A quantum classifier predicts a label for an inputx mapped to |x〉 via a quantum measurement. Ingeneral, the result of the classifier can be written asf(x) = 〈x|M |x〉 = tr(|x〉〈x|M), whereM is the quan-tum observable corresponding to the measurement, rep-resented by a hermitian operator (see Appendix A). Thecontinuous expectation can be turned into a binary la-bel by thresholding on a value τ . The decision boundaryof a quantum classifier on the space of density matricesis the hyperplane defined by tr(|x〉〈x|M) = τ . In thefollowing we assume τ = 0. This separating hyperplaneis the quantum analogue of a classical support vectormachine.

According to statistical learning theory [25], the qual-ity of a classifier f is measured by the expected risk.Given a training set (xm, ym)Mm=1 of input-outputdata pairs from classes A and B, the expected riskis typically approximated by the empirical risk I[f ] =1M

∑m L(f(xm), ym), where L is a loss function indicat-

ing how similar the output f(xm) is to the target labelym. Choosing the linear loss function2 L(f(x), y) =−f(x)y, the empirical risk of a quantum classifier is

1 Note that any binary classifier can be turned into a multi-classclassifier, for example using a one-versus-all strategy.

2 The more common quadratic loss has additional terms f(x)2

which are of order 1/M2 and punish large values of f . Fortypical quantum classifiers f(x) lies in a bounded interval, andthe linear loss is a more natural choice. Other loss functionsoften depend on the linear loss.

3

given by I[f ] = −tr((ρ − σ)M), where ρ is the ensem-ble of inputs from class A with labels 1, and σ is theensemble of inputs from class B with label −1.

We consider two well-known measurements fromquantum information theory, a fidelity as well as a Hel-strøm measurement. The fidelity measurement can beimplemented by a SWAP test, which is a simple quan-tum procedure that estimates the overlap or fidelityF = |〈ϕ|ψ〉|2 of two quantum states |ϕ〉, |ψ〉. Performingthe SWAP test on |ϕ〉, |ψ〉 for k times allows one to es-

timate the fidelity to an accuracy ±√F (1− F )/k. The

SWAP test on n-qubit states can be performed using aquantum circuit that adjoins a single ancilla qubit to the2n qubits used to represent |ϕ〉 and |ψ〉, and by perform-ing n controlled SWAP operations, followed by measure-ment of the ancilla qubit. For example, if we embed ourdata in a 250 ≈ 1015 dimensional Hilbert space, eachSWAP test requires 50 controlled SWAP operations tobe performed on 101 qubits. As shown in Ref. [15],if the embedding circuit Φ(x) can be inverted, one canalternatively implement a circuit Φ(x′)†Φ(x) |0 . . . 0〉 ona register of n qubits and for two inputs x, x′, and mea-sure the overlap to the |0 . . . 0〉 state. While requiringthe same amount of repetitions, this “inversion test”scheme only needs n qubits overall.

A fidelity classifier (see Appendix C, but also [26,27]) uses either the SWAP or inversion test to assigna state |x〉 to A if the average probability of projectingthe state onto the training states in A is higher than theprobability of projecting it onto the training states in B.Hence, ffid(x) = 〈x|(ρ−σ)|x〉. The empirical risk of the

fidelity classifier (using linear loss) is given by I[ffid] =−tr((ρ− σ)2) = −Dhs(ρ, σ). Therefore, maximizing theHilbert-Schmidt distance of the embedding minimizes theempirical risk of the fidelity classifier under linear loss.

The second measurement we consider is a Helstrømmeasurement. Helstrøm measurements are known to beoptimal for the task of quantum state discrimination,in which a single measurement has to assign a quantumstate to one of two clusters [17, 28], a fact that has beenused in the context of machine learning before [29–31].The classification problem we consider here differs fromthis setting, since the classifying measurement can berepeated, and the objective is to minimize expected risk(see Appendix B). A Helstrøm measurement projectsonto the positive and negative subspaces of ρ−σ, usingprojection operators Π+, Π−, so that fhel(x) = 〈x|Π+−Π− |x〉.

As shown in [4], the minimum error probability mea-surement can be performed efficiently (in time O(Rn))on a n-qubit quantum computer as long as ρ and σ havea rank R that does not grow with the dimension of theHilbert space. In Appendix D we outline a quantumcircuit implementation and show that in a single runof the measurement circuit, ρ and σ are pure states,so that R ≤ 2. The empirical risk of the correspond-ing Helstrøm classifier under linear loss is given by thetrace distance, I[fhel] = −tr(|ρ − σ|) = −2Dtr(ρ, σ),and hence maximizing the trace distance of the embed-

FIG. 2. Decision boundary on a 2-d moons dataset for theHelstrøm and fidelity classifiers, training the embedding withthe trace and Hilbert-Schmidt distance, respectively. Theembedding was trained for 500 steps with an RMSProp op-timizer and batch size 5, using a 2 qubit QAOA featuremap of 4 layers, and reaching a final cost of 0.28 for `1 and0.55 for `2 training (see Section IV). Although the detailsof the plots vary with the hyperparameters of the training,the example illustrates that both classifiers give rise to anoverall similar, but not identical decision boundary. Consis-tent with state discrimination theory, one can also see thatthe Helstrøm classifier has outputs closer to the extremes 1and −1. The embedding was trained in PennyLane, and theclassifiers were simulated analytically.

ding minimizes the risk of the Helstrøm classifier underlinear loss.

In summary, once one has decided which Hilbert spacemetric to use when training the embedding, the opti-mal measurement for state assignment with respect tothe linear loss is known. Figure 2 shows the decisionboundaries of the two classifiers for a 2-d moons dataset,for which quantum embeddings were separately trainedwith the `1 and `2 distance.

III. TRAINING THE EMBEDDING INPRACTICE

While the trace distance plays a central role in quan-tum information theory, the Hilbert-Schmidt distancecan be measured and hence optimized by a much smallerquantum circuit, which is significant for near-term quan-tum computing. Accordingly, in our analysis we focuson the Hilbert-Schmidt distance only. Note that the dis-tances are closely related by [32] 1

2Dhs ≤ D2tr ≤ rDhs,

where r = rank(ρ)rank(σ)/(rank(ρ) + rank(σ)). Ifrank(ρ) = rank(σ) = 1, we have the equality D2

tr =12Dhs.

In Hilbert-Schmidt optimization, the cost to be min-imized during training is given by C = 1 − 1

2Dhs. The

different terms trρσ, trσ2 and trρ2 in Dhs can be esti-mated on a quantum computer by performing repeatedSWAP or inversion tests, either between inputs of the

4

HHH H

FIG. 3. Ansatz for a single trainable layer of the embed-ding used in the experiments, showing 3 inputs x1, x2, x3,as well as trainable ZZ-entanglers and RY rotations. Latentqubits, as shown with the fourth qubit (red), can be addedto increase the dimension of the Hilbert space. The quan-tum embedding consists of several such layers, and a finalrepetition of the feature encoding rotations.

same class, or between inputs of different classes. Theterms in C have a simple intuitive meaning. The termtrρσ measures the distance between the two ensemblesin Hilbert space via the inter-cluster overlap; trρσ = 1indicates that the ensembles are constructed from thesame set of pure states, while trρσ = 0 means thatall embedded data points are orthogonal. The ‘purity’terms trρ2, trσ2 are measures for the intra-cluster over-lap, which is closely related to the rank of the respectivedensity matrices. When trρ2 = 1, the embedding mapsall inputs a to the same pure state |a〉, and hencerank(ρ) = 1; this means that the cluster of class-A statesin Hilbert space is maximally tight. For trρ2 = 1/2n (nbeing the number of qubits representing |x〉) the densitymatrix has full rank and the states |a〉 are maximallyspread in Hilbert space. Overall, the cost lies in theinterval [0, 2].

Optimizing according to Hilbert-Schmidt distance hassome useful features: it automatically leads to low-rank embeddings which correspond to tight clusters inHilbert space. In addition to reducing the differencebetween `1 (Helstrøm) and `2 (Hilbert-Schmidt) opti-mization, low-rank ρ, σ are favourable for generaliza-tion, since they increase the similarity of an unseen datapoint to its cluster. Low-rank data clusters also reducethe number of samples S required to execute the mea-surement: the probability of assigning a class A inputto class A is given by p = trρ2 while assigning it to classB is q = trρσ. To distinguish between the probabili-ties p, q we require

√(p(1− p))/S < p − q as an upper

bound for the error, so that S > O( 1trρ2 ).

Lastly, the Hilbert-Schmidt distance is equivalent tothe maximum mean discrepancy between the two distri-butions from which the data from the two classes wassampled [33]. This measure has proven a powerful toolin training Generative Adversarial Networks [34, 35],and opens a wealth of statistical results to training [36].

IV. NUMERICAL EXPERIMENTS

We demonstrate the ideas of trainable quantum em-beddings with the PennyLane software framework [20],and an embedding circuit ansatz that is inspired bythe Quantum Approximate Optimization Algorithm

(QAOA) [37]3. Using repetitions or “layers” of theansatz (see also [38]) can implement classically in-tractable [39] feature maps that are universal for quan-tum computing. The ansatz uses one- and two-qubitquantum logic operations, such as single qubit rota-tions e−iθ1σj and two qubit ZZ interactions e−iθ2σz⊗σz ,where the axes of rotation j and the angles of rotationθ1,2 are individual parameters that we can select. Herewe will consider a circuit consisting of layers of indi-vidually programmable single qubit Pauli-X rotationsRx, and a chain of individually programmable pairwiseZZ or “Ising” interactions (which commute with eachother). We have empirically found it favourable to alsoinclude a block of single qubit Pauli-Y rotations Ry as“local fields”.4 The feature-encoding gates are repeatedonce more after the last layer. The input to the circuitis taken to be the state |0 . . . 0〉, and one can increasethe feature-space dimension by adding “latent qubits”as shown in Figure 3.

To perform the embedding, we designate the Rx pa-rameters to encode the input features x = (x1, ..., xN )T ,and the remainder to encode the trainable parametersθ. The overall unitary transformation Φ(x, θ) is thena function of the weights and the input, and the em-bedding takes the form x → |x〉 = Φ(x, θ)|0 . . . 0〉. Theoverlap between two embedded states is 〈x1|x2〉. Be-cause of universality of the circuit ansatz in the limit ofmany layers, evaluating this overlap for different valuesof θ, x1, x2 is equivalent to evaluating the outcome of anarbitrary quantum computation over n qubits, and sois inaccessible to a classical computer unless BQP=P.Even very shallow circuits taken from this circuit modelsuffice to produce embeddings that cannot be producedclassically. A three-layer circuit consisting of X rota-tions, followed by an Ising layer and a second layer of Xrotations furthermore implements instantaneous quan-tum polynomial-time computation (IQP) [40]. The factthat quantum systems can explore a larger class of em-beddings than classical systems gives a motivation toexplore whether this larger class of embeddings mightallow quantum computers to perform better on some setof classification problems.

Figure 4 shows a toy model of embedding 1-dimensional data into a single qubit using the “QAOAfeature map”. The trained feature map embeds thedata into tight and linearly separable clusters on theBloch sphere, and a linear decision boundary in the 2-dimensional Hilbert space corresponds to two linear de-cision boundaries in the original space. Figure 5 showsthat the quantum embedding can be used on largerdatasets by a hybrid approach, adapting an experimentfrom [10] that combines a classical ResNet with a quan-

3 This has also – and more fittingly – been termed the “QuantumAlternating Operator Ansatz”.

4 Note that fully-connected nearest-neighbor coupled pairwiseZZ interactions in one or more dimensions suffice for universal-ity [39].

5

FIG. 4. Illustrative example of a quantum embedding and classification for a non-overlapping, but not linearly separable,one-dimensional dataset. Training was done using the cost C defined in Section IV, and an RMSProp optimizer with initiallearning rate 0.01 and batch size 2. a. The untrained feature map distributes the data arbitrarily on the Bloch sphere, whileafter 200 steps of training the classes are well-separated. b. The fidelity classifier draws a linear decision boundary on thebloch sphere, which translates to two linear decision boundaries in the original space. The simulations were done in thePennyLane software framework [20].

tum circuit. PennyLane allows the joint optimization ofparameters for the classical and quantum model. Inter-estingly, one sees that the neural network learns to mapthe data to a periodic structure, which is a very naturalinput to a quantum circuit with Rx feature encoding.

While these demonstrations show that the trainingof quantum embeddings is possible, in this study we donot attempt to answer the question of whether quantumembeddings are better than classical machine learningmethods, which is in some sense as hard a question asthe general power of quantum machine learning. How-ever, our approach offers a fruitful analytical framework:the power of a quantum classifier translates into theability to find embeddings that maximize the distancebetween data clusters in Hilbert space.

V. CAPACITY OF EMBEDDINGS ONNEAR-TERM QUANTUM DEVICES

Lastly, we briefly discuss the feasibility of implement-ing quantum embeddings on near-term quantum com-puters such as superconducting or ion-trap quantuminformation processors. We require 2n + 1 qubits toembed two n-qubit states and to perform a swap test.Each n-qubit state is created by applying a pulse se-quence to our qubits, specified by parameters x, θ, as

above, where x specifies the input data and θ deter-mines the form of the embedding. Applying this pulsesequence to an initial state |0 . . . 0〉 yields an embeddedstate |x〉 = Φ(x, θ)|0 . . . 0〉. Let us consider the physicalrequirements for embedding a single n-qubit state.

Assume that we can address each of the n qubits inour device individually by time-varying electromagneticpulses with bandwidth Ω. The number of classical bitsof information that the pulses can embed in the quan-tum state of the computer within the coherence time tis then 2Ωtnb where b is the number of bits required tospecify the strength of the electromagnetic field at eachsample time. To satisfy the requirements of quantumintractability, we ask that the circuit be of depth O(n),to spread quantum information throughout the devicevia qubit-qubit interactions.

To put in some current numbers, for 100 supercon-ducting qubits with Ω equal to 10 GHz, 10 bits requiredto specify the strength of the field, and a coherence timeof 10−3s, we can embed O(1010) classical bits withinthe decoherence time. An interaction strength of 100MHz suffices to enact a quantum circuit of depth 100gates, which is deep enough to allow quantum infor-mation to spread through the system. Other proposedsolid-state quantum computing platforms, e.g., siliconquantum dots or NV-diamond centers, exhibit similarnumbers to superconducting systems. For 100 ion-trap

6

... ...

FIG. 5. Hybrid quantum-classical embedding. An image is fed into a pre-trained ResNet. The last layer is replaced by a linearlayer, transforming the 512-dimensional output to a 2 dimensional feature vector, which is fed into a parametrized quantumcircuit. The circuit parameters are trained together with the final classical layer (red). After 1500 training steps with anAdagrad optimizer, the classical weights learn to map the two classes to periodically arranged points in the intermediatefeature space, which allows the quantum circuit to perfectly separate the two classes in quantum feature space. Eachoptimization step uses a batch of 2 randomly sampled training points. The simulations were done in PennyLane.

qubits with Ω = 100 MHz, 10 bits per sample, and a co-herence time of 1s, we can embed O(1011) classical bitswithin the decoherence time. The typical interactionstrength of 10 kHz suffices to enact a quantum circuitwith depth 104, which is more than adequate for a quan-tum embedding that is unattainable classically. Otherproposed atom-optical based quantum computing plat-forms, e.g., optical lattices, exhibit numbers similar toion traps. In conclusion, high-dimensional embeddingsof large data sets for quantum machine learning shouldbe accessible on current devices.

VI. CONCLUSION

This work proposes “metric quantum learning” asan experimentally accessible and analytically promis-

ing approach to quantum machine learning. Classicaldata points are embedded as quantum states, and thencompared using optimal quantum measurements. Theembedding is adjusted adaptively to train the quan-tum classifiers. Both the embedding and the measure-ments can be performed on the relatively small, shallowquantum circuits afforded by intermediate-term quan-tum computing platforms. While the results of quan-tum embeddings can in general not be reproduced onclassical computers, the existence of a useful quantumadvantage is an open question.

Acknowledgments

The authors want to thank Juan Miguel Arrazola,Joonwoo Bae, Guillaume Dauphinais, Yann LeCun, andNicolas Quesada for helpful comments.

[1] P. Wittek, Quantum machine learning: what quan-tum computing means to data mining (Academic Press,2014).

[2] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost,N. Wiebe, and S. Lloyd, Nature 549, 195 (2017).

[3] M. Schuld and F. Petruccione, Supervised Learning withQuantum Computers, Vol. 17 (Springer, 2018).

[4] S. Lloyd, M. Mohseni, and P. Rebentrost, NaturePhysics 10, 631 (2014).

[5] S. Lloyd, S. Garnerone, and P. Zanardi, Nature Com-munications 7, 10138 (2016).

[6] P. Rebentrost, M. Mohseni, and S. Lloyd, Physical Re-view Letters 113, 130503 (2014).

[7] E. Farhi and H. Neven, arXiv preprint arXiv:1802.06002(2018).

[8] M. Schuld, A. Bocharov, K. Svore, and N. Wiebe, arXivpreprint arXiv:1804.00633 (2018).

[9] M. Benedetti, E. Lloyd, S. Sack, and M. Fiorentini,Quantum Science and Technology (2019).

[10] A. Mari, T. R. Bromley, J. Izaac, M. Schuld, and N. Kil-loran, Coming soon (2019).

[11] M. Benedetti, D. Garcia-Pintos, O. Perdomo,V. Leyton-Ortega, Y. Nam, and A. Perdomo-Ortiz,npj Quantum Information 5, 45 (2019).

[12] K. Mitarai, M. Negoro, M. Kitagawa, and K. Fujii,Physical Review A 98, 032309 (2018).

[13] M. Schuld, V. Bergholm, C. Gogolin, J. Izaac, andN. Killoran, Physical Review A 99, 032331 (2019).

[14] M. Schuld and N. Killoran, Physical Review Letters122, 040504 (2019).

http://arxiv.org/abs/1802.06002


7

[15] V. Havlıcek, A. D. Corcoles, K. Temme, A. W. Harrow,A. Kandala, J. M. Chow, and J. M. Gambetta, Nature567, 209 (2019).

[16] B. Scholkopf and A. J. Smola, Learning with kernels:support vector machines, regularization, optimization,and beyond (MIT press, 2001).

[17] C. W. Helstrom, Quantum detection and estimation the-ory (Academic press, 1976).

[18] J. Bromley, I. Guyon, Y. LeCun, E. Sackinger, andR. Shah, in Advances in neural information processingsystems (1994) pp. 737–744.

[19] S. Chopra, R. Hadsell, Y. LeCun, et al., in CVPR (1)(2005) pp. 539–546.

[20] V. Bergholm, J. Izaac, M. Schuld, C. Gogolin, andN. Killoran, arXiv preprint arXiv:1811.04968 (2018).

[21] A. W. Harrow and A. Montanaro, Nature 549, 203(2017).

[22] J. Bourgain, Israel Journal of Mathematics 52, 46(1985).

[23] J. M. Kubler, A. Arrasmith, L. Cincio, and P. J. Coles,arXiv preprint arXiv:1909.09083 (2019).

[24] R. Sweke, F. Wilde, J. Meyer, M. Schuld, P. K.Fahrmann, B. Meynard-Piganeau, and J. Eisert, arXivpreprint arXiv:1910.01155 (2019).

[25] V. N. Vapnik, IEEE transactions on neural networks 10,988 (1999).

[26] C. Blank, D. K. Park, J.-K. K. Rhee, and F. Petruc-cione, arXiv preprint arXiv:1909.02611 (2019).

[27] M. Schuld, M. Fingerhuth, and F. Petruccione, arXivpreprint arXiv:1703.10793 (2017).

[28] A. Chefles, Contemporary Physics 41, 401 (2000).[29] S. Gambs, arXiv preprint arXiv:0809.0444 (2008).[30] G. Sergioli, G. M. Bosyk, E. Santucci, and R. Giuntini,

International Journal of Theoretical Physics 56, 3880(2017).

[31] G. Sentıs Herrera, A. Monras Blasi, R. Munoz Tapia,J. Calsamiglia Costa, and E. Bagan Capella, PhysicalReview X 9.

[32] P. J. Coles, M. Cerezo, and L. Cincio, arXiv preprintarXiv:1903.11738 (2019).

[33] A. Gretton, K. M. Borgwardt, M. J. Rasch,B. Scholkopf, and A. Smola, Journal of Machine Learn-ing Research 13, 723 (2012).

[34] Y. Li, K. Swersky, and R. Zemel, in International Con-ference on Machine Learning (2015) pp. 1718–1727.

[35] C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, andB. Poczos, in Advances in Neural Information Process-ing Systems (2017) pp. 2203–2213.

[36] I. O. Tolstikhin, B. K. Sriperumbudur, andB. Scholkopf, in Advances in Neural Information Pro-cessing Systems (2016) pp. 1930–1938.

[37] E. Farhi, J. Goldstone, and S. Gutmann, arXiv preprintarXiv:1411.4028 (2014).

[38] A. Perez-Salinas, A. Cervera-Lierta, E. Gil-Fuster, andJ. I. Latorre, arXiv preprint arXiv:1907.02085 (2019).

[39] S. Lloyd, arXiv preprint arXiv:1812.11075 (2018).[40] M. J. Bremner, R. Jozsa, and D. J. Shepherd, Proceed-

ings of the Royal Society A: Mathematical, Physical andEngineering Sciences 467, 459 (2010).

[41] E. Grant, M. Benedetti, S. Cao, A. Hallam, J. Lockhart,V. Stojevic, A. Green, and S. Severini, npj QuantumInformation 4 (2018).

[42] J. Bae and L.-C. Kwek, Journal of Physics A: Mathe-matical and Theoretical 48, 083001 (2015).

[43] C. W. Helstrom, Journal of Statistical Physics 1, 231(1969).

[44] S. Kimmel, C. Y.-Y. Lin, G. H. Low, M. Ozols, andT. J. Yoder, npj Quantum Information 3, 13 (2017).

[45] M. A. Nielsen and I. Chuang, “Quantum computationand quantum information,” (2002).

Appendix A: Classification with quantum circuits

Here we provide a more technical definition of quantum classifiers in the context of binary classification. Like inthe main paper, we consider the following problem setup:

Definition 1. Let X be a data domain. Assume we are given data samples a1, ..., aMa from class A ⊆ X with

label 1, and data samples b1, ..., bMb from class B ⊆ X with label −1, as well as a new input x ∈ X. The problem

of binary classification is to predict the label of x, assigning it to either class A or class B.

A classifier is a model or algorithm solving the problem of binary classification:

Definition 2. A classifier is a map from the data domain to the real numbers, f : X → R. The classifier assignsa binary label to x according to the thresholding rule

y =

−1 if f(x) < τ

1 if f(x) ≥ τ,

with τ ∈ R. If no other information is provided, τ is assumed to be zero.

Quantum classifiers are specific models that use quantum theory to solve classification tasks. They are based on arepresentation of data as states of a quantum system.

Definition 3. A quantum embedding is a quantum state |x〉 that represents a data input x ∈ X. It is facilitatedby a quantum feature map φ : x→ |x〉. For example, the quantum feature map can be executed by a quantum circuitΦ(x) whose gate parameters depend on x.

A quantum classifier combines an embedding with a measurement.











8

Definition 4. Let |x〉 be an embedding of a data input x ∈ X. A quantum classifier is a classifier where f(x)is an expectation 〈x|M |x〉 of a quantum measurement of the observable M.

Typical quantum classifier models proposed in the literature [7, 8, 14, 15, 41] fit into this framework. Most chooseM = σz ⊗ 1⊗ . . .⊗ 1, so that the measurement queries the computational basis state of the first qubit, and traina variational ansatz after the embedding by optimizing a circuit U(θ). This can be interpreted as selecting theoptimal measurement basis by implementing an adjustable measurement U(θ)†MU(θ). However, U is a lineartransformation on the embedded inputs |x〉, which means that most of the nonlinear discriminating power isdetermined by the embedding, not the measurement. Note that the quadratic form 〈x|M |x〉 of a measurementintroduces a square nonlinearity in |x〉.

To draw connections to quantum state discrimination, one can also define the classification problem in the Hilbertspace of the embedding, a version we will call quantum binary classification:

Definition 5. For a given embedding Φ, quantum binary classification is the problem of assigning |x〉 to oneof two “data-ensembles” of quantum feature states, ρ = 1

Ma

∑a |a〉〈a| or σ = 1

Mb

∑b |b〉〈b|.

Here, ρ and σ are mixed states that describe the process of selecting an embedded data point |a〉 , |b〉 with uniformprobability from a training set.

In the main paper, we discuss two different quantum classifiers (which have also been discussed in other contexts,see [26, 27, 29, 30]).

Definition 6. Let ρ − σ =∑j λj |dj〉〈dj | be the difference of two data ensembles ρ, σ expressed in the diagonal

basis |dj〉 with eigenvalues λj, and let

Π+ =∑λj>0

|dj〉〈dj | , Π− =∑λj<0

|dj〉〈dj |

be projection operators, where the sum runs over all j for which the eigenvalue λj is positive or negative, respectively.We define the Helstrøm classifier as the quantum classifier fhel(x) = 〈x|Π+ −Π− |x〉.

Definition 7. For two data ensembles ρ, σ, the fidelity classifier is defined as ffid(x) = 〈x| ρ−σ |x〉. It measuresthe fidelity or overlap between data states, since 〈x| ρ− σ |x〉 = 1

Ma

∑a | 〈a |x 〉 |2 −

1Mb

∑b | 〈b |x 〉 |2.

Appendix B: Quantum classifiers and state discrimination

The problem of quantum classification in Definition 5 has been extensively studied in the context of quantum statediscrimination (i.e., [42]). Two techniques are popularly used in this regard; minimum-error state discriminationand unambiguous state discrimination. In the first case, an assignment of |x〉 to either ρ or σ has a non-zeroprobability of being erroneous. In the second case, the assignment is promised to be either correct or inconclusive[28].

While the problem underlying quantum classification and state discrimination is essentially the same, the twoframeworks pose very different requirements to what constitutes a desirable solution. For example, most of theliterature in state discrimination asks for the discrimination power of a single measurement. To identify anunknown state from two sets of states, the optimal measurement for single-shot state discrimination is knownto be a Helstrøm measurement [43], and the smallest probability of making a wrong guess is governed by theHolevo-Helstrøm theorem.

In contrast, a quantum classifier can prepare and measure quantum states multiple times, which requires to runa fixed embedding circuit on a quantum computer, an operation that is relatively cheap to repeat on typicalquantum computing platforms. With multiple runs, we can estimate the deterministic expectation 〈x|M |x〉of the measurement observable M, which renders single-shot or minimum-error considerations meaningless.Conversely, the goal of classification is to generalize, which means to minimize the probability of misclassifyingnew data points. As a proxy, one minimizes a loss on a set of training examples to find a decision boundarybetween classes. For a fixed embedding, different measurements correspond to different decision boundaries. As aresult, the notion of a “best” or “optimal” measurement for classification depends on the chosen embedding and loss.

9

FIG. 6. Two different routines to measure the overlap of two quantum states. Left: A SWAP test. The third register containsembedded inputs c = a, b sampled from either class A or B. Right: If one can implement an inverse embedding Φ†, the“inversion test” measures the overlap Φ(c)†Φ(x) |0 . . . 0〉 with |0 . . . 0〉, using a projector P0...0 onto the |0 . . . 0〉 subspace (andc = a or c = b) [15].

In the remainder of the Appendix we present quantum circuits for the fidelity and Helstrøm classifiers. While bothcan be efficiently implemented (i.e., the number of gates does not grow with the dimension of the Hilbert spacein which the data is embedded), we motivate in Appendix D how an implementation of the Helstrøm circuit withoptimal sample complexity still requires a large number of copies of registers containing the training data, and istherefore not feasible for near-term quantum computers.

Appendix C: The fidelity classifier

Since

〈x| ρ− σ |x〉 =1

Ma

∑a

| 〈x |a 〉 |2 − 1

Mb

∑b

| 〈x |b 〉 |2,

the fidelity classifier can be constructed from evaluating the overlaps | 〈x |a 〉 |2, | 〈x |b 〉 |2 – an operation thatrequires very few resources on a quantum computer.

Figure 6 shows two different routines to measure overlaps of two quantum states |x〉 and |c〉 represented by nqubits, where |c〉 = |a〉 , |b〉 in this application. The left circuit implements a SWAP test, which evaluates overlapsof two quantum states using 2n + 1 qubits and has no general requirements for the feature map Φ. Measuringthe ancilla multiple times allows us to estimate the expectation of the first qubit, 〈σz〉 = | 〈c |x 〉 |2. As mentionedin Ref [15], if one can implement the inverse Φ(x, θ)† of the embedding circuit Φ(x, θ), the right “inversion test”circuit can be used to significantly save resources.

Using ρ− σ =∑j λj |dj〉〈dj |, the expression

〈x| ρ− σ |x〉 =∑j

λj | 〈dj |x 〉 |2

reveals that the fidelity classifier is equivalent to the Helstrøm classifier (compare Eq. (D1)), but instead of onlyconsidering the sign of the eigenvalues λj , it also considers their magnitude.

Appendix D: The Helstrøm classifier

An implementation for a Helstrøm measurement has been suggested in the context of quantum principal componentanalysis (QPCA) [4], a routine that “simulates” a density matrix to extract its eigenvalues. The scheme requiresroughly O(Rn) gates, where R is the rank of the exponentiated matrix, and n is the number of qubits used torepresent the eigenstates. QPCA was shown to be optimal with respect to the required number of copies of thesimulated density matrix [44]. Ref [44] also shows that the framework easily extends to simulating the differenceof two density matrices. This can be used to exponentiate m = ρ − σ, and extract the sign of m’s eigenvalues λjwith optimal sample complexity.

Whether Helstrøm measurements can be efficiently implemented was an open question in [29]. Using the QPCAroutine, an efficient implementation would require the rank R to grow at most polynomially with the number ofqubits n. We show here that this is indeed the case for classification in our framework. As explained in more detailbelow, in every run of the circuit one constructs the measurement using two pure states |a〉〈a| , |b〉〈b| sampled fromthe data ensemble, and |a〉〈a| − |b〉〈b| has a maximum rank of 2. However, the QPCA implementation of theHelstrøm classifier is unlikely to be used on near-term quantum computers, since typical implementations of thephase estimation subroutine all require a large number of registers coherently prepared in states |a〉 , |b〉.

10

1. Concept

The Helstrøm classifier can be implemented by a quantum circuit that results in the following expectation of asingle-qubit Pauli-Z observable:

〈σz〉 =∑j

sign(λj)| 〈dj |x 〉 |2. (D1)

Since ∑j

sgn(λj)| 〈dj |x 〉 |2 =∑λj>0

| 〈dj |x 〉 |2 −∑λj<0

| 〈dj |x 〉 |2 = trΠ+ |x〉〈x| − trΠ− |x〉〈x|,

the expectation 〈σz〉 carries all necessary information for the assignment: if the expectation is positive – orequivalently, the corresponding qubit is more likely to be measured in state |0〉 than |1〉 – we assign the new inputto ρ, and otherwise to σ. The challenge of implementing this result in a quantum circuit is to evaluate the signfunction, which is non-smooth.

The QPCA routine suggests a method to implement the Helstrøm measurement using a combination of density-matrix exponentiation (DME) and quantum phase estimation (QPE). Roughly speaking, DME prepares a state

e−iδ(ρ−σ) |x〉 |0 . . . 0〉 =∑j

〈dj |x 〉 e−iδλj |dj〉 |0 . . . 0〉

up to error O(δ2). DME can be simulated for longer times kδ and up to error O(kδ2) if k data copies of ρ, σ areused. By choosing specific values in [0, 2π] for kδ, introducing a phase shift (to resolve the sign of the phase) andcontrolling the operation on L ancilla qubits, one can use DME as a subroutine of standard QPE [45]. QPE createsa superposition of phases encoded in the computational basis states of the ancillas. The first qubit in the jth basisstate is zero if the jth eigenvalue is negative, and one otherwise. The expectation of this qubit is consequentlyproportional to the desired value for 〈σz〉.

Measuring the first qubit in the QPE register hence resolves the sign of the eigenvalues of ρ − σ, weighted by theoverlap | 〈dj |x 〉 |2, as specified in Eq. (D1). Note that one cannot make use of slimmer iterative QPE schemes (see[44]), since the result of the computation is not a single phase, but the expectation of a qubit which depends on a“superposition of extracted phases”.

2. Sampling pure training states

The observable Π+−Π− of a Helstrøm measurement is formally constructed from ρ−σ. However, when implementingthe QPCA-based measurement circuit, each run uses samples a, b from the two classes, so that in every run themeasurement is based on the pure states |a〉〈a| − |b〉〈b|. One then averages the expectation values estimated forevery |a〉 , |b〉 pair, 〈σz〉a,b, to get the overall expectation from Eq. (D1),

〈σz〉 =1

Ma

1

Mb

∑a

∑b

〈σz〉a,b.

To see that averaging over pure-state expectations is formally equivalent to the mixed-state expectation, considera circuit W applied to an initial state |ψ〉 = |...〉 ⊗ |a〉 ⊗ |b〉. Using density-matrix notation η = |ψ〉〈ψ|, the finalexpectation value of an observable M is given by trWηW †M. Averaging over several runs of the circuit witha, b sampled with probabilities pa, pb, we get the “classical expectation of the quantum expectation”

∑a

∑b

papb trWηW †M = trW

(∑a

∑b

papbη

)W †M,

where ∑a

∑b

papbη = |...〉〈...| ⊗∑a

pa |a〉〈a| ⊗∑b

pb |b〉〈b| = |...〉〈...| ⊗ ρ⊗ σ.

11

Note that reconstructing the ensembles by averaging over pure states means that we cannot use approximate schemesthat scale the result of a single run by an arbitrary value. For example, a single-qubit pointer method evaluatesµ〈σz〉a,b up to an unknown factor µ, which skews the overall average. This is the reason why many “shortcuts”for quantum phase estimation cannot be used to reduce the resources of the QPCA-based Helstrom measurementcircuit.

3. The Helstrøm measurement is efficient

The matrix m = |a〉〈a| − |b〉〈b| has properties that allow for a particularly efficient implementation ofthe Helstrøm measurement with the QPCA strategy. If |a〉 6= |b〉, m is of rank 2. Furthermore, sincetr(|a〉〈a| − |b〉〈b|) = tr(|a〉〈a|) − tr(|b〉〈b|) = 0, m has trace zero. The only two non-zero eigenvalues are therefore±λ. Their magnitude depends on the overlap 〈a |b 〉; for 〈a |b 〉 = 0, λ = 1, and for 〈a |b 〉 = 1 we have λ = 0.

As a result, the rank of the matrix we exponentiate is constant and small, while the average size of the eigenvaluedepends on how well the embedding is trained. A successful embedding decreases the inter-class overlaps andtherefore increases the absolute value of the eigenvalue on average, making it easier to estimate its sign via quantumphase estimation.

However, and reflecting the huge gap between traditional quantum computing and the requirements of near-termhardware, even if the Helstrøm measurement can be efficiently implemented, the required number of registersprepared in |a〉 , |b〉 to simulate |a〉〈a| − |b〉〈b| for times in [0, 2π] as a subroutine to quantum phase estimation isstill prohibitive for near-term quantum computing. For example, for time 1, δ = 0.01 and L = 10 QPE qubits, weneed k = 100 copies of |a〉 , |b〉, which translates into at least 2000n qubits for embeddings that use n qubits.

arxiv:2001.03622v2 [quant-ph] 10 feb 2020 · 2020. 2. 11. · quantum embeddings for machine...

Documents