arxiv:2102.06963v1 [quant-ph] 13 feb 2021

53
Classical algorithms for Forrelation Sergey Bravyi * David Gosset †‡ Daniel Grier †§ Luke Schaeffer †‡ Abstract We study the forrelation problem: given a pair of n-bit Boolean functions f and g, estimate the correlation between f and the Fourier transform of g. This problem is known to provide the largest possible quantum speedup in terms of its query complexity and achieves the landmark oracle separation between the complexity class BQP and the Polynomial Hierarchy. Our first result is a classical algorithm for the forrelation problem which has runtime O(n2 n/2 ). This is a nearly quadratic improvement over the best previously known algorithm. Secondly, we show that quantum query algorithm that makes t queries to an n-bit oracle can be simulated by classical query algorithm making only O(2 n(1-1/2t) ) queries. This fixes a gap in the literature arising from a recently discovered critical error in a previous proof; it matches recently established lower bounds (up to poly(n, t)) factors) and thus characterizes the maximal separation in query complexity between quantum and classical algorithms. Finally, we introduce a graph-based forrelation problem where n binary variables live at vertices of some fixed graph and the functions f,g are products of terms describing interactions between nearest-neighbor variables. We show that the graph-based forrelation problem can be solved on a classical computer in time O(n) for any bipartite graph, any planar graph, or, more generally, any graph which can be partitioned into two subgraphs of constant treewidth. The graph-based forrelation is simply related to the variational energy achieved by the Quantum Approximate Optimization Algorithm (QAOA) with two entangling layers and Ising-type cost functions. By exploiting the connection between QAOA and the graph-based forrelation we were able to simulate the recently proposed Recursive QAOA with two entangling layers and 225 qubits on a laptop computer. 1 Introduction Much of what is known about the power of quantum computers has been learned by studying the black-box model of quantum query complexity. Here one considers a kind of computa- tional problem where the input is a binary string x ∈ {-1, 1} N and the goal is to compute some property of the input by accessing or querying as few bits of x as possible. In a classical algorithm we may query the input bits one at a time, whereas in a quantum algorithm each * IBM Quantum, IBM T.J. Watson Research Center Institute for Quantum Computing, University of Waterloo, Canada Department of Combinatorics and Optimization, University of Waterloo, Canada § Cheriton School of Computer Science, University of Waterloo, Canada 1 arXiv:2102.06963v2 [quant-ph] 31 Oct 2021

Upload: others

Post on 23-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Classical algorithms for Forrelation

Sergey Bravyi∗ David Gosset†‡ Daniel Grier†§ Luke Schaeffer†‡

Abstract

We study the forrelation problem: given a pair of n-bit Boolean functions f andg, estimate the correlation between f and the Fourier transform of g. This problem isknown to provide the largest possible quantum speedup in terms of its query complexityand achieves the landmark oracle separation between the complexity class BQP andthe Polynomial Hierarchy. Our first result is a classical algorithm for the forrelationproblem which has runtime O(n2n/2). This is a nearly quadratic improvement over thebest previously known algorithm. Secondly, we show that quantum query algorithmthat makes t queries to an n-bit oracle can be simulated by classical query algorithmmaking only O(2n(1−1/2t)) queries. This fixes a gap in the literature arising from arecently discovered critical error in a previous proof; it matches recently establishedlower bounds (up to poly(n, t)) factors) and thus characterizes the maximal separationin query complexity between quantum and classical algorithms. Finally, we introduce agraph-based forrelation problem where n binary variables live at vertices of some fixedgraph and the functions f, g are products of terms describing interactions betweennearest-neighbor variables. We show that the graph-based forrelation problem can besolved on a classical computer in time O(n) for any bipartite graph, any planar graph,or, more generally, any graph which can be partitioned into two subgraphs of constanttreewidth. The graph-based forrelation is simply related to the variational energyachieved by the Quantum Approximate Optimization Algorithm (QAOA) with twoentangling layers and Ising-type cost functions. By exploiting the connection betweenQAOA and the graph-based forrelation we were able to simulate the recently proposedRecursive QAOA with two entangling layers and 225 qubits on a laptop computer.

1 Introduction

Much of what is known about the power of quantum computers has been learned by studyingthe black-box model of quantum query complexity. Here one considers a kind of computa-tional problem where the input is a binary string x ∈ −1, 1N and the goal is to computesome property of the input by accessing or querying as few bits of x as possible. In a classicalalgorithm we may query the input bits one at a time, whereas in a quantum algorithm each

∗IBM Quantum, IBM T.J. Watson Research Center†Institute for Quantum Computing, University of Waterloo, Canada‡Department of Combinatorics and Optimization, University of Waterloo, Canada§Cheriton School of Computer Science, University of Waterloo, Canada

1

arX

iv:2

102.

0696

3v2

[qu

ant-

ph]

31

Oct

202

1

query is performed by applying a unitary oracle Ox satisfying Ox|i〉 = xi|i〉 for i ∈ [N ].Many computational problems have been shown to admit quantum speedups as measuredby query complexity, including e.g., quantum search [1], period finding [2], and the hiddensubgroup problem [3, 4]. These and other provable speedups in quantum query complexityinform the search for real-world quantum advantage. For example, Shor’s algorithm for inte-ger factorization is based on the quantum algorithm for period finding [2]. The conjecturedreal-world speedup for this problem is predicated on the belief that the additional structurepresent in the integer factorization problem does not make it significantly easier for classicalcomputers.

In this paper we study the forrelation problem which was introduced in Ref. [5]. Forrela-tion is a powerful computational primitive: it achieves an almost maximal quantum speedupas measured by query complexity [6, 7, 8, 9] and it underlies the dramatic oracle separationbetween the complexity classes BQP and PH [10]. Moreover, a generalized k-fold forrelationproblem is BQP-complete [6]. We note that the primary focus of these previous works wasto lower bound the classical query complexity of forrelation and related problems. In thispaper we develop new classical algorithms for forrelation. Firstly in the black-box setting,and then in an explicit graph-based setting which is related to the performance of certainquantum algorithms for optimization [11].

To describe our results in more detail, let us now define the forrelation problem, followingAaronson and Ambainis [5, 6]. The input to the problem is a pair of Boolean functionsf, g : 0, 1n → 1,−1. The forrelation between f and g is defined as

Φ(f, g) = 2−3n/2∑

x,y∈0,1nf(x)g(y)(−1)x·y.

The forrelation quantifies the correlations between f and the binary Fourier transform of g.Approximating Φ(f, g) with a small additive error is known as the forrelation problem.

Forrelation problem. Given f, g : 0, 1n → 1,−1 and ε > 0, compute an estimate µsuch that |µ− Φ(f, g)| ≤ ε.

A simple quantum algorithm can approximate Φ(f, g) by querying oracles Uf and Ugimplementing diagonal unitary operators Uf |x〉 = f(x)|x〉 and Ug|x〉 = g(x)|x〉, as can beseen from the identity

Φ(f, g) = 〈0n|H⊗nUfH⊗nUgH⊗n|0n〉.Using this identity, one can express the forrelation as Φ(f, g) = 1− 2p where p is the outputprobability of a quantum circuit that makes only 1 query to an (n + 1)-bit oracle thatcomputes either f or g depending on the value of an ancilla bit. This output probabilitycan be estimated to within additive error ε using amplitude estimation [12]. In this way wecan solve the forrelation problem on a quantum computer using only O(ε−1) queries to theoracles Uf and Ug.

While this algorithm approximates Φ(f, g) to within a small constant error using O(1)quantum queries, Aaronson and Ambainis have shown that any classical algorithm achievingthe same requires Ω(n−12n/2) queries. It is also known that the output probability of any1-query quantum algorithm can be approximated to a given constant error by a classicalrandomized algorithm using O(2n/2) queries [6, 13]. In other words—among tasks that can

2

be solved with a bounded error using 1 quantum query—forrelation has an almost maximalrandomized query complexity.

In fact, a very simple classical algorithm suggested by Aaronson in Ref. [5] (see Section 5of that paper) is almost optimal in terms of query complexity. One first samples randomn-bit strings x1, . . . , xL and y1, . . . , yL and then computes the sample forrelation

Z = L−22n/2L∑

i,j=1

f(xi)g(yj)(−1)xi·yj (1)

It is easy to see that E[Z] = Φ(f, g) and one can also show that using L = O(ε−12n/2)samples suffices to ensure that Z approximates the forrelation within an additive error εwith high probability. The total number of queries required by this algorithm is 2L whichalmost matches the lower bound on classical query complexity. However, computing Eq. (1)seems to require summing L2 terms, and so the runtime of this algorithm is quadraticallyworse than the query complexity. Our first contribution provides the following improvement:

Theorem 1. There is a classical randomized algorithm that solves the forrelation problemwith probability at least 99%. The algorithm has query complexity O(ε−12n/2) and runtimeO(nε−12n/2).

To prove the theorem, we approximate Φ(f, g) using a variant of the sample-based for-relation Eq. (1) in which the sets x1, . . . , xL ⊆ Fn2 and y1, . . . , yL ⊆ Fn2 are chosen to berandom affine subspaces with L = 2` for suitably chosen ` u n/2. To establish the claimedruntime bound we show that this estimator can equivalently be expressed as an amplitudeof a quantum circuit acting on ` qubits and computed using a runtime ∼ 2` using sparsematrix-vector multiplication.

Ref. [6] also introduces a more general k-fold forrelation of Boolean functions f1, f2, . . . , fk :0, 1n → −1, 1 defined as

Φ(f1, f2, . . . , fk) = 〈0n|H⊗nUf1H⊗nUf2H⊗n . . . UfkH⊗n|0n〉. (2)

As in the case k = 2 considered above, the k-fold forrelation can be expressed as 1 − 2pwhere p is the output probability of a quantum circuit that makes dk/2e queries to theoracles. This output probability, and the forrelation, can be additively approximated bya quantum computation that makes O(kε−1) queries. Using the well-known connectionbetween quantum query algorithms and multilinear polynomials over F2 [14], Aaronson andAmbainis claimed that any amplitude of a k-query quantum circuit—and consequently, the k-fold forrelation—can be additively approximated by a classical randomized algorithm whichuses O(2kε−2+ 2

k 2n(1− 1k

)) queries. It was recently discovered, however, that the proof of thisstatement given in Ref. [6] contained a critical error as discussed in the recent note [13] andin the blog post [15].1

In the same work, Aaronson and Ambainis conjectured that a matching lower boundholds for each even k = O(1) and constant error ε. That is, their conjecture asserts that the

1The authors of Ref. [13] provided a new proof of the upper bound for the special case k = 1.

3

k-fold forrelation should have maximal classical query complexity among quantities that canbe expressed as the output probability of a k/2 query quantum algorithm. This conjecturewas recently established in two breakthrough works Refs. [8, 9]. These works aimed toestablish that indeed k-fold forrelation [8] and variants thereof [9] achieve an almost maximalseparation in quantum query complexity. For example, Bansal and Sinha show that forevery positive integer k, approximating the k-fold forrelation to within an error ε satisfyingε = 2−Θ(k) requires Ω(2n(1− 1

k)/poly(n, k)) oracle queries using a classical computer [8].

Here we provide a simple classical algorithm that provides a matching upper bound. Inother words we prove the result claimed by Aaronson and Ambainis in Ref. [6].2

Theorem 2 (Informal). Any amplitude of a quantum circuit that makes k quantum queriesto an n-bit oracle and may include non-unitary gates, can be approximated to within additiveerror ε by a classical randomized algorithm that makes O(kε−2/k2n(1− 1

k)) oracle queries.

Since the acceptance probability of a t-query quantum algorithm can be expressed assuch an amplitude with k = 2t we directly obtain the following.

Theorem 3 (Classical simulation of quantum query algorithms). Let p be the ac-ceptance probability of a quantum query algorithm that makes t quantum queries to an n-bit oracle. There is a classical randomized algorithm that outputs an estimate p satisfying|p− p| ≤ ε and uses O(tε−1/t2n(1−1/2t)) oracle queries.

We may also directly apply Theorem 2 to the forrelation problem.

Theorem 4. There exists a classical randomized algorithm which approximates Φ(f1, f2, . . . , fk)

to within additive error ε with probability at least 99%, and uses O(kε−2/k2n(1− 1k

)) oraclequeries.

Note that if we choose ε = 2−Θ(k) as prescribed in Refs. [8, 9], then for each k ourrandomized simulation matches their lower bound up to factors polynomial in n and k. Inthis way our Theorem 3 together with the results of Refs. [8, 9] completes the characterizationof the maximal separation in query complexity between quantum and classical computers.

Next we consider a graph-based forrelation problem with explicitly specified functions fand g that possess certain local features. Suppose G = (V,E) is a fixed n-vertex graph. Weplace a binary variable xj at each vertex j ∈ V . We say that a function f : 0, 1n → S1 istwo-local on G if it is a product of terms describing interactions between nearest-neighborvariables, as well as terms that depend on a single variable, that is,

f(x) =∏

u,v∈E

fuv(xu, xv)∏u∈V

fu(xu)

for some functions fuv : 0, 12 → S1 and fu : 0, 1 → S1. Here S1 is the unit circle in thecomplex plane. We shall use the term two-local without specifying the underlying graphwhenever it is clear from the context. A function f(x) is said to be 1-local if it is a productof terms that depend on a single variable.

2In fact, our upper bound achieves an exponential improvement in terms of the scaling with k andimproves the dependence on ε, as compared with the one claimed in Ref. [6].

4

Suppose O1, . . . , On are single-qubit operators normalized such that ‖Oj‖ ≤ 1 for all j.A graph-based forrelation associated with the graph G and two-local functions f, g is definedas a complex number

Φ = 〈0n|H⊗nUg(O1 ⊗O2 ⊗ · · · ⊗On)UfH⊗n|0n〉. (3)

As before, Uf and Ug are diagonal n-qubit unitary operators such that Uf |x〉 = f(x)|x〉 andUg|x〉 = g(x)|x〉 for all x. Specializing Eq. (3) toO1 = . . . = On = H and functions f, g takingvalues ±1 one obtains the forrelation quantity Φ(f, g) defined above. The consideration ofcomplex-valued functions f, g and more general operators Oj is motivated by applicationsof the graph-based forrelation that we discuss below. Note that the operators Uf , Ug can beimplemented efficiently by quantum circuits for any two-local functions f, g. Furthermore,the operators Oj can be extended to two-qubit unitary operators using a block encoding,see e.g. Lemma 5 of [16]. Thus the graph-based forrelation can be efficiently approximatedto a given additive error using a quantum computer. A natural question is whether or notit is classically hard to approximate a graph-based forrelation to within a given additiveerror. Unfortunately the exponential lower bound on the query complexity established inthe black-box setting does not rule out an efficient classical algorithm that can “look inside”the black box.

Our interest in graph-based forrelation is partly motivated by its connection with algo-rithms for near-term quantum computers and restricted models of quantum computation.For example, one can show that the expected value of any tensor product operator on theoutput state of an n-qubit Clifford circuit can be expressed as a graph-based forrelation3

for a suitable n-vertex graph G and two-local functions f = g (in particular, an outputprobability of a measurement based quantum computation [18] can be expressed as a graph-based forrelation). Similarly, an amplitude of any IQP circuit [19] composed of one- andtwo-qubit gates can be expressed as a graph-based forrelation with Oj = H for all j. In Ap-pendix F we provide more details and ascertain as a simple consequence that approximatinga graph-based forrelation with a given relative error is #P-hard, even when restricted toplanar graphs.

A graph-based forrelation appears naturally in the study of Quantum Approximate Op-timization Algorithm (QAOA) [11]. Namely, consider n qubits located at vertices of thegraph G and a cost function Hamiltonian C =

∑u,v∈E Ju,vZuZv, where Jp,q are arbitrary

coefficients. Level-k QAOA maximizes the expected energy 〈ψ|C|ψ〉 over variational states

|ψ〉 = e−iβkBe−iγkC · · · e−iβ1Be−iγ1CH⊗n|0n〉,

where B = X1 + . . . + Xn and β, γ ∈ Rk are variational parameters. In Section 5 we showthat the variational energy 〈ψ|C|ψ〉 of level-2 QAOA states can be expressed as a linearcombination of a few graph-based forrelations with suitable single-qubit operators Oj and2-local functions f = g simply related to C. Thus a polynomial-time classical algorithm forgraph-based forrelation can be used to efficiently estimate the variational energy of level-2QAOA states. Prior to our work such an algorithm was only known for the case of constant-degree graphs, or for level-1 QAOA states on general graphs [20].

3This claim follows from the well-known fact [17] that the output state of any Clifford circuit acting on n-qubits is locally equivalent to a graph state

∏(u,v)∈E CZu,vH

⊗n|0n〉 for a suitable n-vertex graph G = (V,E).

5

Our main result is a classical algorithm for approximating graph-based forrelation with agiven additive error. The algorithm is efficient whenever the vertices of G can be partitionedinto two disjoint subsets A and B such that the subgraphs GA and GB induced by A andB have a small treewidth. Let w be the maximum width of these tree decompositions. Ouralgorithm is described in the following theorem.

Theorem 5. There is a classical randomized algorithm that outputs an estimate µ satisfying|µ− Φ| ≤ ε with probability at least 99%. The algorithm has runtime O(n4wε−2).

For example, suppose G is a bipartite graph. Then one can partition the vertices of G asV = A ∪ B such that every edge of G connects a vertex in A and a vertex in B. Note thatthe induced subgraphs GA and GB are trivial in this case (each subgraph has an empty setof edges). Such graphs have treewidth w = 1. Thus our algorithm has a runtime O(nε−2).A more sophisticated application is obtained by considering planar graphs G. As discussedin Section 4.1, there is an efficiently computable partition of vertices such that w = 2 forany planar graph G and thus our algorithm has runtime O(nε−2). In fact, the same runtimescaling holds for any family of graphs with a fixed forbidden minor [21].

Thus Theorem 5 provides an efficient classical method for calculating variational energiesof the level-2 QAOA for any bipartite or any planar graph. As a demonstration, we reporta classical simulation4 of the recently proposed Recursive QAOA [22] on planar graphs withthe level-2 variational states, see Section 5 for details. For example, simulating the RecursiveQAOA with n = 225 qubits on a laptop computer took less than one day. To accomplish thissimulation we had to solve about 300, 000 instances of the graph-based forrelation problem.

In order to convey the main ideas, let us now describe how the algorithm claimed inTheorem 5 works in the simplest case where G is a bipartite graph and all operators Oj areunitary. Partition the vertices of G as V = A ∪B such that the induced subgraphs GA andGB have no edges. Let OA and OB be the tensor products of operators Oj over all qubitsj ∈ A and j ∈ B respectively. Then O1 ⊗ O2 ⊗ · · · ⊗ On = OA ⊗ OB. Define normalizedn-qubit states

|α〉 = (OA ⊗ IB)UfH⊗n|0n〉 and |β〉 = (IA ⊗O†B)U †gH

⊗n|0n〉

such that Φ = 〈β|α〉. We claim that the states |α〉 and |β〉 are computationally tractable [23]in the sense that their amplitudes in the standard basis are easy to compute classically.Indeed, consider some amplitude 〈x|α〉. Write x = xAxB where xA and xB are the restrictionsof x onto A and B. Then

〈x|α〉 = 〈xAxB|α〉 = 2−n/2∑yA

f(yAxB)〈xA|OA|yA〉,

where the sum ranges over yA ∈ 0, 1|A|. Note that fixing the variables xB transformsf(yAxB) into a 1-local function. Indeed, each two-local term in f depends on some variablein A and some variable inB. Since the latter is fixed, f becomes a product of terms dependingon a single variable each, that is, f is 1-local. Likewise, for a fixed xA the matrix element

4Our simulation will actually use a simpler version of the algorithm in Theorem 3 whose runtime suffersfrom an additional factor of n.

6

〈xA|OA|yA〉 is a 1-local function of yA. Thus 〈x|α〉 =∑

yAh(yA), where h(yA) is a 1-local

function. Such a sum can be computed exactly in linear time. By the same reasoning, anyamplitude 〈x|β〉 is easy to compute classically. A slight generalization of the above argumentshows that a classical computer can sample the probability distribution P (x) ≡ |〈x|α〉|2 inlinear time. Now the forrelation Φ = 〈β|α〉 can be approximated by a Monte Carlo methoddue to Van den Nest [23] which is based on the identity

Φ = 〈β|α〉 =∑

x∈0,1nP (x)R(x), R(x) ≡ 〈β|x〉

〈α|x〉.

Therefore it suffices to approximate the mean value of a random variable R(x) with x sampledfrom P (x). Fix an integer S 1 and let x1, . . . , xS ∈ 0, 1n be independent samples fromP (x). Define the desired estimate of Φ as an empirical mean value of R(x) over the observedsamples, µ = (R(x1) + R(x2) + . . . + R(xS))/S. The random variable R(x) has the meanvalue Φ and its variance is bounded as

Var(R) ≤∑

x∈0,1nP (x)|R(x)|2 =

∑x∈0,1n

|〈x|β〉|2 = 1. (4)

By Chebyshev’s inequality, |µ−Φ| ≤ ε with probability at least 0.99 if we choose S = 100ε−2.The above algorithm is efficient whenever efficient subroutines are available for computing

amplitudes of the states |α〉, |β〉 and for sampling the probability distribution |〈x|α〉|2. InSection 4 we prove Theorem 5 by constructing the desired subroutines for more general(non-bipartite) graphs that admit a partition into two subgraphs with a small treewidth.We leave as an open question complexity of the graph-based forrelation problem on generalgraphs.

To sample from |〈x|α〉|2 in linear time for (non-bipartite) planar graphs, we will provethat there is an efficient subroutine for the following general problem on tensor networksthat may be of independent interest:

Problem 1. Given an initial state χ = χ1⊗ · · ·⊗χn of n d-dimensional qudits, a collectionof m diagonal gates U1, . . . , Um (not necessarily unitary, possibly multi-qudit), and single-qudit operators O1, . . . , On, sample a string x ∈ 1, 2, . . . , dn from the distribution5 P (x) ∝〈x|OUχU †O† |x〉, where

U =m∏j=1

Uj, O =n⊗i=1

Oi.

A notable special case of this problem is sampling a Markov random field which is aubiquitous problem in statistics [24]. Indeed, any Markov random field P (x) with n discretevariables xi ∈ 1, 2, . . . , d can be written as P (x) ∝ 〈x|UχU †|x〉, where Uj are nonuni-tary diagonal multi-qudit gates with real non-negative entries on the diagonal and χ is themaximally mixed state.

5We note that there are choices for the operators Oaa∈[n] for which P cannot be normalized to adistribution (i.e., is always 0). In such cases we simply report that there are no valid outputs x.

7

Connectivity of the gates Uj can be described by a graph G with n vertices such that eachvertex represents a qudit and a pair of vertices is connected by an edge if the correspondingpair of qudits participates in some gate Uj. Let w be the threewidth of G. In Appendix Awe give an algorithm that solves Problem 1 in time O(nd2w +mdw).

The remainder of the paper is organized as follows. Our results concerning oracle-basedforrelation, i.e., the proofs of Theorems 1 and 2, are provided in Sections 2 and 3, respectively.Section 4 describes our classical algorithm for graph-based forrelation. We first analyze asimplified version of the algorithm which has runtime scaling quadratically with the numberof qubits, see Section 4. To achieve the runtime stated in Theorem 5, we augment thealgorithm of Section 4 by a linear-time subroutine solving Problem 1. This subroutineis described in Appendix A and illustrated by an example in Appendix B. We apply thealgorithm for graph-based forrelation to estimate the variational energy of level-2 QAOAand report a numerical simulation RQAOA in Section 5. Further details concerning thissimulation can be found in Appendices C,D,E. We conclude by discussing implications ofour work and some open questions concerning graph-based forrelation in Section 6. Finally,the problem of approximating the graph-based forrelation with a small relative error isdiscussed in Appendix F.

2 Oracle-based forrelation

In this section we consider the forrelation problem in which functions f, g are accessed viaoracle queries. We begin by proving Theorem 1, restated below.

Theorem 1. There is a classical randomized algorithm that solves the forrelation problemwith probability at least 99%. The algorithm has query complexity O(ε−12n/2) and runtimeO(nε−12n/2).

Proof. Let ε > 0 be given. If ε ≤ 2−n/2 then we use a very naive algorithm to computeΦ(f, g) exactly using the desired total runtime O(n2n) and O(2n) queries. Indeed, in thiscase it suffices to exactly compute a length 2n vector H⊗nUgH

⊗nUfH⊗n|0n〉 using matrix

vector multiplication. Applying each single-qubit Hadamard gate using sparse matrix-vectormultiplication requires a runtime O(2n) 6 while applying each gate Uf or Ug uses O(2n)queries. The forrelation Φ(f, g) is then obtained as the 0n entry of this vector.

In the following we consider the case

ε ≥ 2−n/2. (5)

Let An,k be the set of all affine spaces S ⊆ Fn2 with |S| = 2k. For each S ∈ An,k define anormalized n-qubit stabilizer state

|S〉 = 2−k/2∑x∈S

|x〉.

6To see this, note that for any n-qubit state |v〉, each entry of (H ⊗ In−1) |v〉 is a sum of two entries of|v〉.

8

For any pair S, T ∈ An,k, define

µ(S, T ) = 2n−k〈S|UfH⊗nUg|T 〉. (6)

Claim 1. Suppose S, T ∈ An,k are drawn uniformly at random. Then

E(µ(S, T )) = Φ(f, g) and Var(µ(S, T )) ≤ 2n−2k +2

2k.

Proof. For any x ∈ Fn2 , let χS(x) = 1 if x ∈ S and χS(x) = 0 otherwise. Then

ES(χS(x)) =|S|2n

= 2k−n. (7)

Likewise, for any pair of binary strings x, y ∈ Fn2 such that x 6= y we have

σ2 ≡ ES(χS(x)χS(y)) =|S| (|S| − 1)

2n(2n − 1)=

4k−n(1− 2−k)

1− 2−n≤ 4k−n. (8)

Using Eq. (7) we getES(|S〉) = 2(k−n)/2|+n〉

where |+n〉 = H⊗n|0n〉, and consequently

ESETµ(S, T ) = Φ(f, g). (9)

Using Eq. (8) we get

ES(|S〉〈S|) = a · I + b · |+n〉〈+n| a ≡ 2−n − 2−kσ2 b ≡ 2n−kσ2. (10)

and

ESET (µ(S, T ))2 = 4n−kESETTr(|S〉〈S|UfH⊗nUg|T 〉〈T |U †gH⊗nU

†f

)= 23n−2ka2 + 4n−kb2Φ(f, g)2 + 2 · 4n−kab

≤ 2n−2k + Φ(f, g)2 +2

2k. (11)

where we used the upper bounds a ≤ 2−n and b ≤ 2k−n. Combining Eqs. (9, 11) gives

Var(µ(S, T )) = ESET (µ(S, T ))2 − (ESETµ(S, T ))2 ≤ 2n−2k +2

2k.

9

Now let us fix k to be an integer satisfying

218 · 2nε−2 ≥ 22k ≥ 216 · 2nε−2. (12)

We shall choose our estimate µ ≡ µ(S, T ) where the affine subspaces S, T ∈ An,k are drawnuniformly at random. Note that the random selection of these affine spaces can be performedusing a runtime O(poly(n)), resulting in a parameterization

S = Ax+ b : x ∈ 0, 1k T = Cx+ d : x ∈ 0, 1k (13)

where A and C are n× k binary matrices and b, d ∈ 0, 1n, and addition is performed overF2.

Using Claim 1 and Eq. (12) we get

Var(µ(S, T )) ≤ ε2

216+

ε

272−n/2 ≤ ε2

(1

216+

1

27

)≤ ε2

100,

where in the second inequality we used Eq. (5). Applying Chebyshev’s inequality we seethat

Pr (|µ(S, T )− Φ(f, g)| ≥ ε) ≤ 1/100,

as desired. Below we show that there is a classical algorithm which, given S, T ∈ An,k (as inEq. (13)), computes µ(S, T ) exactly using a runtime O(k2n+n2k) and O(2k) queries. UsingEq. (12) we see that the total runtime is O(n2n/2ε−1) and the total number of queries isO(2n/2ε−1).

Let us now describe how to compute µ(S, T ) using the stated runtime and number ofqueries. Using Eqs. (13,6) we may write

µ(S, T ) = 2n/2−2k∑

x,y∈0,1kf(Ax+ b)g(Cy + d)(−1)y

TCTAx+xTAT d+bTCy+bT d

To compute this sum on a classical computer we first initialize the following vector of length2k

|ψ1〉 =∑

x∈0,1kf(Ax+ b)(−1)x

TAT d|x〉.

We compute the entries of this vector one at a time using a Gray code ordering x1, x2, . . . , x2k

of the set of binary strings 0, 1k, so that xi+1 differs from xi in only one bit. With thischoice, we can compute Axi+1 + b from Axi + b using O(n) binary operations. We need onlycompute ATd once (using a runtime O(nk)) and then each phase factor (−1)x

TAT d can alsobe computed using O(k) runtime. From this we see that initializing ψ1 can be performedusing O(nk + n2k) runtime and O(2k) queries.

In the next step we compute a vector

|ψ2〉 =∑

x∈0,1kf(Ax+ b)(−1)x

TAT d|CTAx〉.

10

We first compute the matrix CTA which takes a runtime O(k2n). We initialize |ψ2〉 as avector of length 2k with every entry equal to zero. Then, using our Gray code order wecompute CTAx1, C

TAx2, . . . , CTAx2k . Each time we compute a vector zi = ATCxi we set

〈zi|ψ2〉 ← 〈zi|ψ2〉+ 〈xi|ψ1〉.

The total runtime for this step is O(k2n+ k2k) since CTA is a k × k matrix.Viewing ψ2 as a k-qubit state, we then apply a Hadamard gate on each qubit to obtain

|ψ3〉 = H⊗k|ψ2〉 = 2−k/2∑

x,y∈0,1kf(Ax+ b)(−1)x

TAT d(−1)yTCTAx|y〉.

Applying a single Hadamard gate to a k-qubit state stored in classical computer memorytakes a runtime O(2k) using sparse matrix-vector multiplication. The total runtime to com-pute ψ3 using a sequential application of the k Hadamard gates is then upper bounded asO(k2k).

Finally, we compute a vector ψ4 defined by

〈y|ψ4〉 = 〈y|ψ3〉 · 2n/2−3k/2(−1)bT dg(Cy + d)(−1)b

TCy y ∈ 0, 1k.

Performing this step using a Gray code ordering can be done using O(nk+n2k) total runtimeand O(2k) queries (similar to the procedure used to initialize ψ1). Here

|ψ4〉 = 2n/2−2k∑

x,y∈0,1kf(Ax+ b)(−1)x

TAT d(−1)yTCTAxg(Cy + d)(−1)b

TCy(−1)bT d|y〉.

Finally, µ(S, T ) is obtained by summing all entries of the vector ψ4 using a runtime O(2k).In total, this algorithm uses O(2k) queries and O(k2n+ n2k) runtime, as claimed.

3 Quantum query complexity and k-fold Forrelation

Now let us turn our attention to quantum query algorithms and the k-fold forrelationΦ(f1, f2, . . . , fk) defined in Eq. (2). Here f1, f2, . . . , fk : 0, 1n → −1, 1n are n-bit booleanfunctions. In this Section we prove Theorem 2.

Letting m ≥ n we shall describe a randomized classical algorithm which is capable ofestimating an m-qubit amplitude of the form

q = 〈0m|M1Uf1M2Uf2 . . . UfkMk+1|0m〉. (14)

where M1, . . . ,Mk+1 are m-qubit operators such that ‖Mj‖ ≤ 1 for all 1 ≤ j ≤ k + 1, andUfj |x〉|y〉 = fj(x)|x〉|y〉 for each x ∈ 0, 1n, y ∈ 0, 1m−n. Clearly the k-fold forrelationΦ(f1, f2, . . . , fk) can be expressed as in Eq. (14) with m = n. Amplitudes of the form Eq. (14)can also express output probabilities of quantum query algorithms. Indeed, suppose p isthe probability of measuring an output qubit in the state |1〉 at the output of a quantumalgorithm that makes t quantum queries to an oracle. Then p can be expressed as as inEq. (14) with k = 2t.

11

Theorem 6. There exists a classical randomized algorithm which, given ε > 0, outputs anestimate q such that

|q − q| ≤ ε

with probability at least 99%. The algorithm uses O(kε−2/k2n(1− 1k

)) oracle queries.

Proof. In the following we shall write

Π(z) = |z〉〈z| ⊗ I z ∈ 0, 1n.

where the identity acts on m − n qubits. For any nonzero complex vector α of length 2m,define a probability distribution over n-bit strings

pα(z) = ‖α‖−2‖Π(z)|α〉‖2 z ∈ 0, 1n. (15)

Define |φ0〉 = |0m〉 and then for each 1 ≤ j ≤ k construct a (possibly un-normalized) stateφj as follows. First sample binary strings z1, z2, . . . , zL independently from the distributionpM†j φj−1

(see Eq. (15)). We will fix L later. Then define

|φj〉 =1

L

L∑i=1

fj(zi)Π(zi)M†j |φj−1〉

pM†j φj−1(zi)

(16)

Note thatE(|φj〉) = UfjM

†j |φj−1〉 (17)

and we have the operator inequality

E(|φj〉〈φj|) (18)

=

(L− 1

L

)UfjM

†j |φj−1〉〈φj−1|MjUfj +

‖φj−1‖2

L

∑z∈0,1n

Π(z)M †j |φj−1〉〈φj−1|MjΠ(z)

〈φj−1|MjΠ(z)M †j |φj−1〉

(19)

≤ UfjM†j |φj−1〉〈φj−1|MjUfj +

‖φj−1‖2

L· I. (20)

Using Eq. (19) we see that

E(‖φj‖2) = Tr (E(|φj〉〈φj|)) ≤(

1 +2n

L

)‖φj−1‖2 (21)

where to bound the first term we used the fact that ‖Mj‖ ≤ 1. For each 0 ≤ r ≤ k define acomplex-valued random variable

Xr = 〈φr|Mr+1Ufr+1Mr+2 . . . UfkMk+1|0m〉.

Note that X0 = q is the desired amplitude. Using Eq. (17) we see that

E(Xk) = E(Xk−1) = . . . = E(X0) = q.

Note that Xk can be computed using L queries to each of the given oracles f1, f2, . . . , fk.

12

Lemma 1.

E(|Xk|2)− |q|2 ≤ 2n(k−1)L−k(

1 +L

2n

)k. (22)

Proof. Let us write Ej for the expectation over all random variables used in generating thestates φ1, φ2, . . . , φj. Then, for all 1 ≤ r ≤ k we have

E(|Xr|2) = Er(|Xr|2) = Er−1Eφr(|〈φr|Mr+1Ufr+1Mr+2 . . . UfkMk+1|0m〉|2

)≤ Er−1

(|Xr−1|2

)+

1

LEr−1

(‖φr−1‖2

)

where in the last line we used Eq. (20). The second term above can be computed using arepeated application of Eq. (21), which gives, for each 1 ≤ r ≤ k,

E(|Xr|2)− E(|Xr−1|2

)≤ 1

L

(1 +

2n

L

)r−1

Since X0 = q we get

E(|Xk|2)− |q|2 =k∑r=1

(E(|Xr|2)− E

(|Xr−1|2

))≤ 2−n

((1 +

2n

L

)k− 1

)

Discarding the subtracted term in the parentheses, we arrive at Eq. (22).

Now let us fix B as follows

B = 2k ·⌈2n(1−1/k)(ε2/400)−1/k

⌉(23)

This will be our query budget. There are two cases to consider. If B ≥ k · 2n then we cansimply query each oracle at all 2n possible binary strings and output an exact computationof q. If instead B ≤ k · 2n then we compute the estimator q = Xk with the choice L = B/k.In this case we have L ≤ 2n and therefore Eq. (22) gives

E(|Xk|2)− |q|2 ≤ 2n(k−1)L−k2k ≤ ε2/400

where in the second inequality we lower bounded L = B/k using Eq. (23). Therefore

Var(Re(q)) + Var(Im(q)) = E(|Xk|2)− |q|2 ≤ ε2

400,

By Chebyshev’s inequality we see that |Re(q) − Re(q)| ≤ ε/√

2 with probability at least0.995. The same arguments show that |Im(q) − Im(q)| ≤ ε/

√2 with probability at least

0.995. By a union bound, we have |q − q| ≤ ε with probability at least 0.99.

13

4 Graph-based forrelation

The algorithm for estimating a graph-based forrelation Φ described in the Introduction hasa polynomial runtime whenever efficient subroutines are available for computing amplitudesof the states |α〉 and |β〉 such that Φ = 〈β|α〉 and for sampling the probability distribution|〈x|α〉|2.

In this section we will construct such subroutines for more general (non-bipartite) graphsthat admit a partition into two subgraphs with a small treewidth. We will actually givetwo separate algorithms for sampling the probability distribution |〈x|α〉|2. We give the firstsuch algorithm in this section because it is conceptually simpler, and contained within thesame framework as the algorithm for computing amplitudes. However, it will suffer from aquadratically worse dependence on n. The second algorithm, which solves Problem 1, willbe described in full in Appendix A and will lead to the runtime given in Theorem 5.

Section 4.1 provides relevant background concerning tree decompositions. Section 4.2describes an efficient subroutine for computing a sum

∑x h(x), where h(x) is a two-local

function on a graph with a small treewidth. Section 4.3 combines these ingredients andcompletes the proof of Theorem 5.

4.1 Low treewidth partitions of planar graphs

Definition 1 (Tree decomposition). Let G = (V,E) be a graph. A tree decomposition of Gconsists of a tree T = (W,F ), along with a bag Bi ⊆ V for each node i ∈ W , such that thefollowing properties hold:

• For each v ∈ V , there exists a node i ∈ W such that v ∈ Bi.

• For each u, v ∈ E, there exists a node i ∈ W such that u ∈ Bi and v ∈ Bi.

• For each v ∈ V , the subgraph Tv of T induced by the set of nodes i ∈ W such thatv ∈ Bi is a tree.

We shall use the shorthand T,B to denote a tree decomposition, where B = Bi : i ∈ Wis the set of bags. The width w of a tree decomposition is defined as

w = maxi∈W|Bi| − 1.

The treewidth of a graph G is defined as the minimal width of any tree decomposition of it.

Definition 2. A graph G is said to be outerplanar if it admits a planar embedding in whichall vertices are incident to the outer face.

Boedlaender has shown that outerplanar graphs have tree decompositions of width atmost 2 that can be computed in linear time.

Theorem 7 (Bodlaender [25, 26]). If G is an outerplanar graph then its treewidth is ≤ 2and a tree decomposition of width ≤ 2 can be computed in linear time.

14

We refer the reader to Ref. [25] which provides a straightforward inductive proof thatthe treewidth is at most 2, and Ref .[26] which gives a linear time algorithm to compute sucha tree decomposition. In this paper it will suffice for us to use the slower quadratic timeAlgorithm 1 (implicit in Ref. [25]).

The following theorem states that the vertices of any planar graph can be partitioned intotwo subsets, each of which induce a graph of low treewidth. For completeness, we includethe simple proof below, following Ref. [27]. An example is depicted in Fig. 1.

Theorem 8 (Chartrand, Geller, Hedetniemi 1971 [27]). If G = (V,E) is a planar graphthen there exists a partition V = A ∪ B such that the induced subgraphs GA and GB areouterplanar graphs. Moreover, the partition can be computed in linear time from a planarembedding of G.

Proof. We begin by defining a sequence of planar graphs Gk = (Vk, Ek) for 0 ≤ k ≤ p, wherep ≤ |V | depends on G. Let V0 = V,E0 = E, and G0 = G. For each k let us write V out

k ⊆ Vkfor the vertices incident to the outer face of Gk. If Vk 6= V out

k we define

Vk+1 = Vk \ V outk

and let Gk+1 be the subgraph of Gk induced by Vk+1. The process terminates at some graphGp for which Vp = V out

p (an outerplanar graph). Note that Gk+1 has fewer vertices than Gk,so the process stops after p ≤ |V | steps.

The vertex set of G can be partitioned as V = A ∪B where

A = V out0 ∪ V out

2 ∪ V out4 . . . B = V out

1 ∪ V out3 ∪ V out

5 . . . .

Finally, we show that the induced subgraphs GA and GB are outerplanar. For each j =0, 1, . . . , p let Hj be the induced subgraph of G on vertices V out

j . By construction, GA

consists of disconnected copies of H0, H2, H4, . . . and GB consists of disconnected copies ofH1, H3, H5, . . .. To complete the proof we show that each graph Hj is outerplanar. Observethat Hj is the induced subgraph of the graph Gj on the vertices incident to its outer face.Therefore all vertices of Hj are on its outer face and we are done.

4.2 Summation of two-local functions

Suppose G = (V,E) is a graph with m = |V | vertices and Γ is a finite set. We shall say thath : Γm → C is a two-local function on G if it can be written as

h(x) =∏

u,v∈E

huv(xu, xv)∏u∈V

hu(xu) (24)

for some functions huv : Γ2 → C and hu : Γ → C. Here x = x1x2 . . . xm with xj ∈ Γ for1 ≤ j ≤ m. Note that if G has no isolated vertices then the vertex terms hu can be foldedinto the edge terms huv and need not explicitly appear in Eq. (24).

Lemma 2. Suppose we are given a two-local function h : Γm → C on an m-vertex tree T .There is a classical algorithm which computes

∑x∈Γm h(x) using a runtime O(m|Γ|2).

15

Figure 1: Illustration of Theorem 8. A planar graph and a partition of its vertices intosubsets A and B shown in red and blue. The induced subgraphs GA and GB are outerplanargraphs.

hu

hv

hq

hw

huv hqvhqv

hwv

Contract leaf u

(a)

h′v

hq

hw

hqvhqv

hwv

(b)

Figure 2: (a) Depiction of a two-local function h on a tree with 4 vertices. (b) The two-localfunction resulting from contraction of a leaf vertex u. Here h′v(xv) =

∑xuhuv(xu, xv)hv(xv)

can be computed in time O(|Γ|2).

16

Proof. Write T = (W,F ). Suppose u ∈ W is a leaf of the tree and is incident to v ∈ W . LetT ′ = (W ′, F ′) where W ′ = W \ u and F ′ = F \ u, v be the tree obtained by removing uand its incident edge from T . We may compute a new two-local function h′ on T ′ such that∑

y∈Γm−1 h′(y) =∑

x∈Γm h(x). This is depicted in Fig. 2. Indeed, define h′r,s = hr,s for alledges r, s ∈ F ′, h′r = hr for all vertices r ∈ W ′ \ v, and

h′v(xv) =∑xu∈Γ

huv(xu, xv)hv(xv). (25)

We can compute the new function h′ from h by evaluating the sum Eq. (25) for each xv ∈ Γ.Each such evaluation requires O(|Γ|) additions and multiplications, so the total runtime isO(|Γ|2). To evaluate the sum

∑x h(x) we repeat this process m times until we have removed

all vertices of T . The total runtime is O(m|Γ|2).

Lemma 3. Let G = (V,E) be an m-vertex graph. Suppose we are given a two-local functionh : 0, 1m → C on G, along with a tree decomposition of G of width w and containing Nnodes. There is a classical algorithm which computes∑

x∈0,1mh(x) (26)

using a runtime O(Nw2w).

Proof. Let us write T = (W,F ) and B = Bi : i ∈ W for the given tree decomposition.With this notation we have N = |W | and w + 1 = maxi |Bi|. To simplify notation in thefollowing we redefine w ← w + 1 to be the width of the tree decomposition plus one.

For each bag Bi with i ∈ W we define a variable yi ∈ 0, 1w. Here yi contains a bitlabeled by each vertex of G in bag i (i.e., each element of Bi), and also contains w − |Bi|ancilla bits. These ancilla bits are only present to ensure, for convenience, that all variablesyi are of the same length. We say that yi and yj are consistent if, (a) for each v ∈ Bi ∩ Bj

the corresponding bit of yi agrees with that of yj, and (b) all ancilla bits of yi are 0 and allancilla bits of yj are 0.

Let Γ = 0, 1w. We now show how to compute a function Q : ΓN → C which is two-localon T and such that ∑

y∈ΓN

Q(y) =∑

x∈0,1mh(x). (27)

By definition, such a two-local function is of the form

Q(y) =∏i,j∈F

Qij(yi, yj)∏i∈W

Qi(yi).

We choose

Qij(yi, yj) =

1 yi and yj are consistent

0 otherwise.

With this choice, the binary strings y such that∏i,j∈F

Qij(yi, yj) = 1

17

are in one-to-one correspondence with assignments x ∈ 0, 1m of bits to each vertex of G.Let us write x ∼ y for this correspondence. We choose the functions at each node so that∏

i∈W

Qi(yi) = h(x) =∏

u,v∈E

huv(xu, xv)∏u∈V

hu(xu) for x ∼ y. (28)

which then ensures Eq. (27). To satisfy Eq. (28), note that (by definition of a tree decompo-sition) for each edge u, v ∈ E, there is a node i such that u ∈ Bi and v ∈ Bi. The weighthuv(xu, xv) can then be incoporated into the function Qi. Likewise, the weight hu(xu) canbe incorporated into any function Qi such that u ∈ Bi.

Let’s now upper bound the time needed to compute the truth table of each Qi. First,we can compute Qi(0 . . . 0) in O(w2) time since there are at most O(w2) terms of h onw-many vertices. Notice that if we compute Qi(0 . . . 01), there are O(w) terms of h whosecontribution to Qi changes. That is, given Qi(0 · · · 0), computing Qi(0 . . . 01) can be done inO(w) time. Of course, this argument generalizes for any two inputs which differ in a singleinput bit. Therefore, we sequentially compute Qi(yi) for yi ∈ 0, 1w in Gray code order (i.e.,each yi differs in at most 1 bit from the previously computed bit), leading to an O(Nw2w)time algorithm to compute Qi(yi) for all i ∈ W and yi ∈ 0, 1w.

Since Q is two-local on the tree T , we may then apply Lemma 2 which gives a classicalalgorithm to evaluate the sum Eq. (27) in time O(N4w). However, we can speed up thisalgorithm using the fact that Qij(yi, yj) is particularly simple. Using the same algorithmfrom Lemma 2, we must compute

Q′v(yv) = Q(yv)∑

yu∈0,1wQuv(yu, yv)Qu(yu)

when we are contracting a child u to its parent v in the tree. Suppose that node v and nodeu share k many vertices, and for simplicity, let’s assume that these are the first k bits ofboth yu and yv. These k bits partition

∑yu∈0,1w Quv(yu, yv)Qu(yu) into 2k non-overlapping

sums. For example, if the first k bits of yv are 0k, then∑yu∈0,1w

Quv(yu, yv)Qu(yu) =∑

x∈0,1w−k

Qu(0kx)

where we have ignored the ancilla bits since we can always just set those bits to be 0. Ifthe first k bits of yv are z 6= 0k, then we get the sum

∑x∈0,1w−k Qu(zx) instead. Therefore,

computing∑

yu∈0,1w Quv(yu, yv)Qu(yu) for all the possible k-bit prefixes of yv can be done

in time O(2w). Finally, to compute Q′(yv) we must also multiply by the factor Q(yv). Onceagain, we iterate over all yv in time O(2w). Repeating this process for each node, we get atotal runtime of O(N2w), which completes the proof.

4.3 Classical algorithm for graph-based forrelation

In this section we prove Theorem 5. We first consider the special case in which all operatorsOj are unitary; at the end of this section we show how to handle the more general case. As

18

discussed in Section 4, it suffices to give efficient algorithms for computing amplitudes of astate

|α〉 = (OA ⊗ IB)UfH⊗n|0n〉.

and for sampling the probability distribution |〈x|α〉|2. This is accomplished in Lemmas 4,5below. Let GA = (A,EA) be the subgraph of G induced by A. By assumption, we are givena tree decomposition T,B of GA of width w such that T has O(n) nodes.

Lemma 4. There is a classical algorithm which, given x ∈ 0, 1n, computes an amplitude〈x|α〉 using a runtime O(nw2w).

Proof. Below we use notation introduced in Section 4.1. Let us begin by writing

〈x|α〉 = 2−n/2∑yA

〈xA|OA|yA〉f(yAxB). (29)

Partition the edges of G as E = EA∪Ecut∪EB where EA are the edges with both endpointsin A, EB are the edges with no endpoints in A, and Ecut are the edges with exactly oneendpoint in A. Since f is a two-local function on G we may write

f(yAxB) =∏

u,v∈EA

fuv(yu, yv)∏

u,v∈Ecut

fuv(yu, xv)∏

u,v∈EB

fuv(xu, xv)∏u∈A

fu(yu)∏u∈B

fu(xu).

We claim that for any fixed x = xAxB,

h(yA) ≡ 2−n/2〈xA|OA|yA〉f(yAxB), yA ∈ 0, 1|A|

defines a two-local function on GA. Indeed, we may write

h(yA) = ω∏

u,v∈EA

huv(yu, yv)∏u∈A

hu(yu) (30)

wherehuv(yu, yv) = fuv(yu, yv),

hu(yu) = 〈xu|Ou|yu〉fu(yu)∏

v:u,v∈Ecut

fuv(yu, xv)

and ω = 2−n/2∏u,v∈EB

fuv(xu, xv)∏

u∈B fu(xu) ∈ C can be absorbed into (say) one of the

functions hu to make Eq. (30) of the form Eq. (24). We then apply Lemma 3 which completesthe proof.

Lemma 5. There is a classical algorithm which samples a binary string x ∈ 0, 1n fromthe distribution P (x) = |〈x|α〉|2 using a runtime O(n2w4w).

Proof. It will be convenient to order the bits of x as xBxA, that is, all bits of B appear first.We shall use the well-known linear time reduction from sampling to computing marginalprobabilities. That is, for each ` ∈ [n] and binary string x1x2 . . . x`, we define a marginalprobability

P`(x1x2 . . . x`) =∑

x`+1x`+2...xn

P (x). (31)

19

To sample from P (x) we first sample x1 ∈ 0, 1 from P1(x1), then we sample x2 fromP2(x1x2)/P1(x1), then x3 from P3(x1x2x3)/P2(x1x2), and so on. To complete the proof,below we show that each marginal Eq. (31) can be computed using a runtime O(nw4w).

If ` ≤ |B|, then

P`(x1x2 . . . x`) = 〈0n|H⊗nU †f (|x1x2 . . . x`〉〈x1x2 . . . x`| ⊗ I|B|−` ⊗O†AOA)UfH⊗n|0n〉.

Since all operators Oj are unitary, one can replace O†AOA with the identity. Then U †f andUf cancel each other since Uf commutes with any diagonal operator. Thus

P`(x1x2 . . . x`) = 〈0n|H⊗n(|x1x2 . . . x`〉〈x1x2 . . . x`| ⊗ In−`)H⊗n|0n〉 =1

2`.

Suppose now that ` ≥ |B|+ 1. Define a partition A = CD, where C = A ∩ 1, 2, . . . , `includes all bits of A that contribute to the marginal probability and D includes all bits ofA that are traced out. Then x1x2 . . . x` = xBxC . Using Eq. (29) we may write

P`(xBxC) = 2−n∑xD

∣∣∣∣∣∑yA

〈xCxD|OA|yA〉f(xByA)

∣∣∣∣∣2

= 2−n∑

xD,yA,zA

〈xCxD|OA|yA〉∗〈xCxD|OA|zA〉f ∗(xByA)f(xBzA). (32)

For any subset of qubits M we shall write OM for the tensor product of Oj over j ∈ M .Then OA = OC ⊗OD and thus∑xD

〈xCxD|OA|yA〉∗〈xCxD|OA|zA〉 = (〈xC |OC |yC〉∗〈xC |OC |zC〉)∑xD

〈xD|OD|yD〉∗〈xD|OD|zD〉.

The unitarity of Oj implies∑xD

〈xD|OD|yD〉∗〈xD|OD|zD〉 = 〈yD|O†DOD|zD〉 = 〈yD|zD〉 = δyD,zD .

We arrive at

P`(xBxC) = 2−n∑

yC ,y′C ,yD

〈xC |OC |yC〉∗〈xC |OC |y′C〉f ∗(xByCyD)f(xBy′CyD)

=∑

yC ,y′C ,yD

h(yC , y′C , yD), (33)

whereh(yC , y

′C , yD) = 2−n〈xC |OC |yC〉∗〈xC |OC |y′C〉f ∗(xByCyD)f(xBy

′CyD).

We claim that h is a two-local function on a certain augmented graph G = (V , E) withvertex set V = C ∪ C ′ ∪ D, where C ′ is a second copy of C. To construct this graph,start with a disjoint union of GA and GC′ (the induced subgraphs of G on vertices in A,C ′

respectively). For each edge u, v ∈ EA with u ∈ C and v ∈ D, we then add a new edge

20

u′, v between v and the copy u′ ∈ C ′. Since f is two-local on G, a direct inspection showsthat each two-local term in f ∗(xByCyD) or f(xBy

′CyD) that depends on the variables y or

y′ involves nearest-neighbor variables in the augmented graph G. The remaining two-localterms in f ∗(xByCyD) or f(xBy

′CyD) that depend on at least one variable x become 1-local

since xB is fixed. Furthermore, the matrix elements 〈xC |OC |yC〉 and 〈xC |OC |y′C〉 are 1-localfunctions of yC and y′C . Thus h is a two-local function on the augmented graph G.

We claim that the treewidth w of G satisfies w ≤ 2w + 1. Indeed, let T,B be the giventree decomposition of GA of width w. Define a tree decomposition T , B of G as follows.First set T = T . Then, for each node i of T , let Si = Bi ∩C be the set of vertices of C thatappear in Bi. Let S ′i contain the corresponding vertices of C ′, and define

Bi = Bi ∪ S ′i.

One can straightforwardly check that this defines a tree decomposition T , B of G. The upperbound w ≤ 2w + 1 follows from the fact that each bag can at most double in size.

To complete the proof, we apply Lemma 3 which allows us to compute the sum Eq. (33)using a runtime O(n(2w + 1)22w+1) = O(nw4w). Since sampling x from P (x) requirescomputation of n marginal probabilities the overall runtime of the algorithm is O(n2w4w).

Using the algorithm from Lemma 5 suffices to give an efficient algorithm for the graph-based forrelation problem as we will describe below. Indeed, this is the algorithm that willbe used in our numerical calculations of Section 5.2. Nevertheless, an asymptotically moreefficient algorithm exists which we prove in Appendix A:

Lemma 6. There is a classical algorithm which samples a binary string x ∈ 0, 1n fromthe distribution P (x) = |〈x|α〉|2 using a runtime O(n4w).

In either case, let us now describe how to obtain Theorem 5 from the lemmas above. LetA be a randomized algorithm that samples a bit string x ∈ 0, 1n from the distribution|〈x|α〉|2 and outputs the quantity µ = 〈β|x〉/〈α|x〉. We have E(µ) = 〈β|α〉 = Φ andE(|µ|2) = 〈β|β〉 ≤ 1. Thus µ is an unbiased estimator of Φ with variance at most one.Lemmas 4 and 6 will imply that imply thatA has runtime O(n4w). By generating S = 100ε−2

independent samples of µ using S calls to the algorithm A and computing the sample meanvalue one can approximate Φ with an additive error ε in time O(n4wε−2), as claimed.

So far we have assumed that Oj are unitary operators.7 Now let us consider the moregeneral case where we only require ‖Oj‖ ≤ 1 for all j. Using the singular value decompositionone can write any 2× 2 matrix M as

M = ‖M‖ · U[

1 00 s

]V

for some 0 ≤ s ≤ 1 and some 2× 2 unitary matrices U, V . The identity[1 00 s

]=

(1 + s)

2I +

(1− s)2

Z

7While this is not technically required for Lemma 6, our running will not suffer from making this assump-tion, and the following argument can be made for both Lemma 5 and Lemma 6.

21

gives M = ‖M‖(q0M0 +q1M1), where M0 = UV and M1 = UZV are unitary, q0 = (1+s)/2,and q1 = (1− s)/2. Applying this decomposition to each operator Oj one gets

Φ = ω∑

z∈0,1nq(z)Φ(z),

where ω =∏n

j=1 ‖Oj‖ ≤ 1 is the normalization constant, q(z) is a product distribution, and

Φ(z) = 〈0n|H⊗nUg(O1,z1 ⊗O2,z2 ⊗ · · · ⊗On,zn)UfH⊗n|0n〉,

for some unitary operators Oj,0 and Oj,1. As shown above, there exists a randomized algo-rithm A that takes as input a bit string z ∈ 0, 1n and outputs a random variable µ(z) ∈ Csuch that E(µ(z)) = Φ(z) and E(|µ(z)|2) ≤ 1. Here the mean values are taken over theinternal randomness of A. We can now estimate Φ by Monte Carlo method. GenerateS = 100ε−2 independent samples z1, . . . , zS from the distribution q(z). For each sample zj

call the algorithm A to obtain the estimator µ(z). Finally, approximate Φ by the quantity

η =ω

S

S∑j=1

µ(zj).

One can straightforwardly check that E(η) = Φ. Furthermore,

Var(η) =ω2

SVar(µ(z1)) ≤ 1

S

∑z

q(z)E(|µ(z)|2) ≤ 1

S=

ε2

100.

By the Chebyshev inequality, |η − Φ| ≤ ε with probability at least 0.99.

5 Applications to quantum approximate optimization

In this section we discuss the connection between the classical algorithm for graph-basedforrelation and certain quantum approximate optimization algorithms (QAOA) [11]. Inparticular, in Section 5.1 we show how graph-based forrelation can be be used to calculatethe variational energies for the level-2 QAOA algorithm on planar and bipartite graphs. InSection 5.2, we describe an implementation of this algorithm that can accurately estimatethese energies. Finally, in Section 5.3, we discuss the recursive QAOA algorithm [22] forwhich our graph-based forrelation approach can give a complete classical simulation. Wethen combine our implementation for computing level-2 QAOA energies with the recursiveQAOA algorithm to show that our algorithm can be leveraged for solving large optimizationproblems.

5.1 Quantum mean value problem for level-2 QAOA

Suppose G = (V,E) is a graph with n vertices. We place a qubit at each vertex of G.Consider a diagonal n-qubit Hamiltonian

C =∑p,q∈E

Jp,qZpZq

22

where Jp,q are real coefficients. It describes a classical Ising-type cost function defined on thegraph G. Note that 〈z|C|z〉 is a 2-local function on G. The Quantum Approximate Opti-mization Algorithm (QAOA) introduced in [11] maximizes the expected value 〈ψ|C|ψ〉 overa suitable class of n-qubit variational states ψ that depend on a few parameters. Once theoptimal variational state ψ is found, a bit string x ∈ 0, 1n is sampled from the distribution|〈x|ψ〉|2 obtained by measuring every qubit of ψ. The expected value of the cost function〈x|C|x〉 coincides with the optimal variational energy 〈ψ|C|ψ〉.

Here we focus on the level-2 version of QAOA. The corresponding variational states aregenerated by a quantum circuit with two entangling layers,

|ψ〉 = e−iβ2Be−iγ2Ce−iβ1Be−iγ1C |+n〉.

Here, β`, γ` ∈ R are variational parameters, |+n〉 = H⊗n|0n〉, and B = X1 + X2 + . . . + Xn.Consider the first step of QAOA – computing the expected value 〈ψ|C|ψ〉. By linearity,it suffices to compute the expected value 〈ψ|ZsZt|ψ〉 for some fixed edge s, t ∈ E. Thestandard lightcone argument shows that 〈ψ|ZsZt|ψ〉 only depends on coefficients Jp,q suchthat the edge p, q is incident to s, or t, or one of nearest neighbors of s, t. The remainingedges are irrelevant and can be removed from the graph. Below we assume that the graphhas been truncated such that all irrelevant edges are removed.

Inserting the identity decompositions on qubits s, t between the first B-layer and thesecond C-layer one gets

〈ψ|ZsZt|ψ〉 =∑

x∈0,14µ(x)

where

µ(x) = 〈+n|eiγ1Ceiβ1B|x1x2〉〈x1x2|s,teiγ2Ceiβ2BZsZte−iβ2Be−iγ2C |x3x4〉〈x3x4|s,te−iβ1Be−iγ1C |+n〉

Using the fact that X⊗n commutes with B,C, ZsZt and X⊗n|+n〉 = |+n〉 one can easilycheck that µ(x) has symmetries µ(x⊕ 1111) = µ(x) and µ(x3x4x1x2) = µ∗(x1x2x3x4). Since〈ψ|ZsZt|ψ〉 is real, one can write

〈ψ|ZsZt|ψ〉 = Re [2µ(0000) + 4µ(0010) + 4µ(0001) + 2µ(0011) + 2µ(0110) + 2µ(0101)] .(34)

Let us show that the quantity µ(x) can be expressed as a graph-based forrelation, namely,

µ(x) = 〈x1x2|U |x3x4〉 · 〈+n|eiγ1CO1(x)⊗O2(x)⊗ · · · ⊗On(x)e−iγ1C |+n〉 (35)

for some single-qubit operators Oj(x) such that ‖Oj(x)‖ ≤ 1 and a two-qubit operator

U = eiγ2Js,tZ⊗Zeiβ2(X⊗I+I⊗X)Z ⊗ Ze−iβ2(X⊗I+I⊗X)e−iγ2Js,tZ⊗Z . (36)

Indeed, a simple algebra shows that

µ(x) = 〈x1x2|U |x3x4〉 · 〈+n|eiγ1Ceiβ1Beiγ2C′ |x1x2〉〈x3x4|s,te−iγ2C′e−iβ1Be−iγ1C |+n〉

where C ′ is obtained from C by retaining only the edges incident to exactly one of thevertices s, t. In other words,

C ′ =∑

p∈V \s,t

Js,pZsZp + Jt,pZtZp.

23

Replacing Zs and Zt in C ′ by their eigenvalues leads to Eq. (35) where

Os(x) = eiβ1X |x1〉〈x3|e−iβ1X , (37)

Ot(x) = eiβ1X |x2〉〈x4|e−iβ1X , (38)

andOp(x) = eiβ1Xeiγ2(Js,p(−1)x1+Jt,p(−1)x2−Js,p(−1)x3−Jt,p(−1)x4 )Ze−iβ1X (39)

for all p /∈ s, t. Combining Eqs. (34,35,36,37,38,39) shows that computing the mean value〈ψ|ZsZt|ψ〉 can be reduced to solving six instances of the graph-based forrelation problem.

Finally, although there are efficient classical algorithms for maximizing Ising cost func-tions over planar graphs [28], we note that for Ising cost functions with non-zero magneticfields (i.e., the Hamiltonian contains JaZa terms) no such polynomial-time algorithm isknown. Furthermore, the above reduction from level-2 QAOA to graph-based forrelationcan be easily extended to these more general cost functions C that contain both quadraticand linear terms.

As a consequence, our classical algorithm for graph-based forrelation problems can bedeployed to efficiently compute mean values 〈ψ|C|ψ〉 for level-2 QAOA on planar and bi-partite graphs. This allows one to study the performance of these QAOA algorithms usinga classical computer. Note that it does not provide a classical simulation of the full QAOAalgorithm—a quantum computer is still needed to sample from the variational state |ψ〉 oncea suitable choice of variational parameters has been found.

5.2 Numerical estimation of quantum mean values

We implemented our graph-based forrelation algorithm8 and used it to calculate level-2QAOA variational energies. To test our implementation, we calculated 〈ψ|C |ψ〉 where C isthe 2-local Hamiltonian with randomly-selected ±1 couplings over the 10-qubit triangularlattice shown below:

A green edge corresponds to a +1 coupling, while a red edge corresponds to a−1 coupling.The triangular lattice was chosen over the usual square lattice due to the fact that it is notbipartite, and therefore requires the more sophisticated partitioning of vertices used in thegraph-based forrelation algorithm for general planar graphs.

In Figure 3, we graph the accuracy of the estimate as a function of ε where ε−2 sampleswere taken for each unitary graph-based forrelation instance. Recall, however, that weestimate each local term of the Hamiltonian separately (of which there are 18 in the example).

8To be clear, the running time of our algorithm will scale quadratically in n due to the fact that we usethe sampling algorithm from Lemma 5

24

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 0.01 0.02 0.03 0.04 0.05

Ave

rage

addit

ive

erro

r

ε

Figure 3: Additive estimation error for the energy of the triangular lattice as a function ofaccuracy parameter ε. Each data point is the average of 10 trials with variance shown by theerror bars. The parameters β1 = 1.44433, β2 = 3.56786, γ1 = 0.937498, and γ2 = 4.93861were fixed for all ε, and the exact energy −4.35917 was calculated by brute force. We notethat we expect linear scaling with ε.

We randomly chose the variational angles in the range [0, 2π) (β1 = 1.44433, β2 = 3.56786,γ1 = 0.937498, and γ2 = 4.93861). These choices were fixed for all ε. The exact energy(−4.35917) was calculated explicitly using Mathematica.

5.3 The level-2 RQAOA algorithm and implementation

Recall that the ability to optimize the variational parameters of the QAOA algorithm doesnot immediately confer the ability to carry out the entire QAOA heuristic due to the factthat a quantum computer is needed to sample from the variational state |ψ〉. Instead, weturn to the Recursive QAOA algorithm (RQAOA) of Bravyi, Kliesch, Konig, and Tang [22]which can be simulated entirely given an algorithm to compute expected values and optimizevariational angles. The algorithm has also been shown to perform well in certain cases inwhich the usual QAOA is provably suboptimal.

Let us briefly describe this algorithm. Suppose we have some cost function C : −1, 1n →R we would like to maximize. For our purposes, let us also suppose that C is an Ising costfunction of the form

C(z) =∑p,q∈E

Jp,qzpzq

where E is the edge set of some graph, and the Jp,q are real coefficients. As in the QAOAalgorithm, we promote the cost function to a Hamiltonian, i.e., C =

∑p,q∈E Jp,qZpZq, and

optimize the variational parameters of the state

|ψ〉 = e−iβ2Be−iγ2Ce−iβ1Be−iγ1C |+n〉

to maximize 〈ψ|C |ψ〉. Given these parameters, we compute the values

Mp,q = 〈ψ|ZpZq |ψ〉

25

for each edge p, q ∈ E. We then choose whichever edge p, q showed the strongestcorrelation, that is,

p, q = arg maxa,b∈E

|Ma,b|.

Finally, we impose the constraint zp = sgn(Mp,q)zq and eliminate zp from the cost functionso that we obtain a new Ising cost function on n − 1 variables. That is, after eliminatingvariable zp, the term Jr,pzrzp in the cost function becomes Jr,psgn(Mp,q)zrzq. We recurseon this smaller instance (finding the new optimal parameters, choosing an edge, etc.) untilsome threshold number of variables is met, after which we simply brute force the optimalsolution.

The RQAOA algorithm has an important property of preserving planarity of the under-lying graph G. Namely, the variable elimination step of the algorithm, i.e., constraining zpand zq to be (anti)correlated, corresponds to a contraction of the edge p, q in the graphG —the term Jp,qzpzq in the cost function disappears and becomes a constant energy shift,and all edges incident to those two vertices become incident to the merged node. It is well-known that edge contractions preserve planarity.9 Thus, recursive steps in RQAOA generatea family of Ising cost functions defined on planar graphs, as long as the initial cost functionis defined on a planar graph.

Let us now describe our implementation of the level-2 RQAOA algorithm on 2D gridgraphs. We optimize the variational angles in two phases. First, we set β2 = γ2 = 0 andoptimize β1 and γ1 using the solution for level-1 QAOA [22]. We fix the optimal β1 andγ1 obtained at this step. Second, we search over a grid of possible β2 and γ2 values andchoose whichever angles yield the largest energy using our algorithm for calculating 2-levelvariational energies. We note that this approach ensures that the energy computed at eachstep is at least the energy computed by the level-1 QAOA algorithm. Indeed, we observeempirically that the energy we obtain is almost always strictly greater than the level-1 energy.That said, our approach does not guarantee we obtain the largest possible level-2 variationalenergy.

The second step above is aided by the following observation: for fixed angles β1, γ1, γ2,the value of β2 which maximizes the variational energy can be computed by calculatingthe energy at only three different values of β2. Therefore, the two-dimensional search inthe second step above essentially reduces to a one-dimensional search. We describe thisreduction in Appendix C.

Finally, in order to more efficiently probe larger and more complex optimization problemsusing level-2 RQAOA, we supplement our graph-based forrelation algorithm for calculatingthe variational energies with an explicit brute force solution. It is simple to see that theexpected value of a particular ZsZt term only depends on neighbors of the neighbors of theedge s, t—that is, the lightcone of the ZsZt term. When this set of qubits is small, wecan brute force the expected values by explicitly writing out the state of the system. On theother hand, when this set of qubits is large, the exponential running time of the brute forcealgorithm becomes prohibitive and we must turn to our graph-based forrelation algorithm.

9For example, one can appeal to Wagner’s theorem which states that a finite graph is planar iff it excludesK5 and K3,3 as minors. By definition, taking minors involves contracting edges, and so if the minor wasexcluded before then it is clearly excluded after contracting an edge.

26

0

5

10

15

20

25

30

50 100 150 200

Tim

e(m

inute

s)

Step of RQAOA

05

101520253035404550

50 100 150 200For

rela

tion

Cal

ls(h

undre

ds)

Step of RQAOA

Figure 4: The graphs above depict a single run of the level-2 RQAOA algorithm on a 15×15grid with random ±1 couplings. We set the parameters ε = .03, brute force the solution when10 vertices remain, and search over 30 possible values of γ2 to optimize the level-2 variationalangles. The algorithm returned an optimal assignment to the vertices to maximize the costfunction. Left: The running time of the algorithm at each step of the RQOA algorithm.Right: The number of calls we make to our graph-based forrelation algorithm (e.g., a valueof 0 implies that only the brute-force algorithm was used).

We describe our exact brute force algorithm (which is more sophisticated than the one hintedat above) in Appendix D.

Notice that although the 2D grid graph has low degree, the graph obtained by contractingvarious edges does not in general remain low degree. Therefore, while the earlier steps ofthe RQAOA algorithm can be computed by brute force, the later steps must be computedby our graph-based forrelation algorithm. This phenomenon can be observed in Figure 4.

We test this implementation of the level-2 RQAOA on 10 × 10 and 15× 15 grid graphswith random ±1 couplings. In both tests, we set ε = .03 (once again, taking ε−2 manysamples), brute force the solution when 10 vertices remain, and search over a grid of 30points to optimize the level-2 variational angles. In ten random instances on the 10 × 10grid, there was only one for which our implementation did not yield an optimal solution (andeven in this case, the solution obtained was 98% of the optimal value). While impressive,we note that the level-1 RQAOA algorithm also performs very well on such instances, andso we do not attempt to make a qualitative comparison between the two. That said, westress that we know of no faster method for calculating the level-2 variational energies andthat the energies given by our level-2 algorithm are greater than those given by the level-1algorithm. Finally, we tested a single random instance on the 15×15 grid and give a detailedbreakdown of its runtime in Figure 4. The algorithm found an optimal assignment to thevertices to maximizes the cost function, which we computed explicitly using Gurobi’s integerquadratic programming software [29].

27

6 Conclusions

Quantum computers available today may already be capable of certain specialized tasksthat are too hard for existing classical computers. A convincing demonstration of a quan-tum computational advantage requires evidence that the task is hard for classical computers.Even if we focus on problems which admit asymptotic quantum speedups, one would liketo ensure that the problem size is large enough to defeat existing classical machines. Thisrequires an accurate estimate of the runtime required by classical algorithms. Our work pro-vides such estimates for the forrelation problem – a powerful computational primitive thatachieves nearly maximal quantum speedup, as measured by query complexity. We devel-oped classical algorithms both for the oracle-based and for an explicitly defined graph-basedforrelation problems. In the oracular setting our algorithm is almost optimal as its runtimenearly matches the query complexity lower bound. This provides a nearly quadratic speedup(2n/2 vs 2n) compared with the best previously known algorithms reported in Refs. [5, 6].In the graph-based setting our algorithm has a linear runtime for a large family of graphsincluding any bipartite or any planar graph. This extends a line of work studying classicalalgorithms for problems that promise a quantum advantage such as boson sampling [30],random circuit sampling [31], or IQP circuits [32]. Our work also provides a new tool tostudy the performance of quantum approximate optimization algorithms. The ability tocalculate the variational energy of QAOA states classically eliminates the need to preparethe state on a quantum device while optimizing the QAOA parameters. Alternatively, ourclassical simulator can be used to benchmark QAOA on large problem sizes that are not yetaccessible in the experiment.

While we have provided strong evidence that a relative-error version of graph-basedforrelation is intractable, an intriguing open question is whether or not the standard additive-error version is hard for classical machines. Alternatively, one may attempt to improve orextend the reach of our efficient classical algorithms. A modest goal would be to improve theerror dependence of the classical algorithm for graph-based forrelation on planar graphs fromε−2 to ε−1, matching the error scaling of our algorithm for oracle-based forrelation. Such animprovement would likely provide a substantial enhancement of the practical performanceof QAOA simulation.

A related open question is whether some special instances of the graph-based forrelationproblem can be used to demonstrate a computational quantum advantage with an efficientclassical verification. For example, suppose G = (V,E) is an n-vertex graph that admitsa partition of vertices V = AB such that the subgraphs GA and GB induced by A andB have treewidth O(log n). Suppose also that finding a low-treewidth partition as abovefrom scratch is a classically hard problem. The forrelation Φ based on such graph G can beefficiently (in time polynomial in n) approximated by the classical algorithm of Section 4 onlyif a low-treewidth partition AB is explicitly specified. Meanwhile, a quantum computer canefficiently approximate the forrelation Φ in all cases. Imagine a classical verifier in possessionof a secret low-treewidth partition V = AB and a prover who claims to have a quantumcomputer. The verifier can challenge the prover to approximate the forrelation Φ withoutaccess to the secret partition AB. Once the prover provides an estimate of Φ, the verifiercan efficiently test whether this estimate is accurate enough using the algorithm of Section 4.The ability to pass this test may serve as a proof of quantumness, assuming that graphs G

28

with the desired properties indeed exist.

7 Acknowledgments

D. Gosset and D. Grier are supported in part by IBM Research. D. Gosset also acknowl-edges the support of the Natural Sciences and Engineering Research Council of Canadaand the Canadian Institute for Advanced Research. This work is supported in part by theArmy Research Office under Grant Number W911NF-20-1-0014. The views and conclusionscontained in this document are those of the authors and should not be interpreted as repre-senting the official policies, either expressed or implied, of the Army Research Office or theU.S. Government. The U.S. Government is authorized to reproduce and distribute reprintsfor Government purposes notwithstanding any copyright notation herein.

References

[1] Lov K Grover. Quantum mechanics helps in searching for a needle in a haystack.Physical review letters, 79(2):325, 1997.

[2] Peter W Shor. Polynomial-time algorithms for prime factorization and discrete loga-rithms on a quantum computer. SIAM review, 41(2):303–332, 1999.

[3] Michele Mosca and Artur Ekert. The hidden subgroup problem and eigenvalue es-timation on a quantum computer. In NASA International Conference on QuantumComputing and Quantum Communications, pages 174–188. Springer, 1998.

[4] Mark Ettinger, Peter Høyer, and Emanuel Knill. The quantum query complexity of thehidden subgroup problem is polynomial. Information Processing Letters, 91(1):43–48,2004.

[5] Scott Aaronson. BQP and the polynomial hierarchy. In Proceedings of the forty-secondACM symposium on Theory of computing, pages 141–150, 2010.

[6] Scott Aaronson and Andris Ambainis. Forrelation: A problem that optimally separatesquantum from classical computing. SIAM Journal on Computing, 47(3):982–1038, 2018.

[7] Avishay Tal. Towards optimal separations between quantum and randomized querycomplexities. arXiv preprint arXiv:1912.12561, 2019.

[8] Nikhil Bansal and Makrand Sinha. k-forrelation optimally separates quantum and clas-sical query complexity. arXiv eprint arXiv:2008.07003, 2020.

[9] Alexander A Sherstov, Andrey A Storozhenko, and Pei Wu. An optimal separation ofrandomized and quantum query complexity. arXiv preprint arXiv:2008.10223, 2020.

[10] Ran Raz and Avishay Tal. Oracle separation of bqp and ph. In Proceedings of the 51stAnnual ACM SIGACT Symposium on Theory of Computing, pages 13–23, 2019.

29

[11] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate opti-mization algorithm. arXiv preprint arXiv:1411.4028, 2014.

[12] Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitudeamplification and estimation. Contemporary Mathematics, 305:53–74, 2002.

[13] Scott Aaronson, Andris Ambainis, Andrej Bogdanov, Krishnamoorthy Dinesh, and Che-ung Tsun Ming. On quantum versus classical query complexity. Electron. ColloquiumComput. Complex., page 115, 2021.

[14] Robert Beals, Harry Buhrman, Richard Cleve, Michele Mosca, and Ronald De Wolf.Quantum lower bounds by polynomials. Journal of the ACM (JACM), 48(4):778–797,2001.

[15] Scott Aaronson. “Yet more mistakes in papers”, blog post: https://www.

scottaaronson.com/blog/?p=5706.

[16] Sergey Bravyi, David Gosset, and Ramis Movassagh. Classical algorithms for quantummean values. arXiv preprint arXiv:1909.11485, 2019.

[17] Maarten Van den Nest, Jeroen Dehaene, and Bart De Moor. Graphical descriptionof the action of local Clifford transformations on graph states. Physical Review A,69(2):022316, 2004.

[18] Robert Raussendorf and Hans J Briegel. A one-way quantum computer. Physical ReviewLetters, 86(22):5188, 2001.

[19] Michael J Bremner, Richard Jozsa, and Dan J Shepherd. Classical simulation ofcommuting quantum computations implies collapse of the polynomial hierarchy. Pro-ceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,467(2126):459–472, 2011.

[20] Zhihui Wang, Stuart Hadfield, Zhang Jiang, and Eleanor G Rieffel. Quantum ap-proximate optimization algorithm for MaxCut: A fermionic view. Physical Review A,97(2):022304, 2018.

[21] Matt DeVos, Guoli Ding, Bogdan Oporowski, Daniel P Sanders, Bruce Reed, PaulSeymour, and Dirk Vertigan. Excluding any graph as a minor allows a low tree-width2-coloring. Journal of Combinatorial Theory, Series B, 91(1):25–41, 2004.

[22] Sergey Bravyi, Alexander Kliesch, Robert Koenig, and Eugene Tang. Obstacles tovariational quantum optimization from symmetry protection. Physical Review Letters,125:260505, 2020.

[23] M Nest. Simulating quantum computers with probabilistic methods. arXiv preprintarXiv:0911.1624, 2009.

[24] Peter Clifford. Markov random fields in statistics. Disorder in physical systems: Avolume in honour of John M. Hammersley, pages 19–32, 1990.

30

[25] Hans Bodlaender. Planar graphs with bounded treewidth. Technical Report RUUCS-88-14, Department of Computer Science, Utrecht University, the Netherlands (1988).

[26] Hans L Bodlaender. A linear-time algorithm for finding tree-decompositions of smalltreewidth. SIAM Journal on computing, 25(6):1305–1317, 1996.

[27] Gary Chartrand, Dennis Geller, and Stephen Hedetniemi. Graphs with forbidden sub-graphs. Journal of Combinatorial Theory, Series B, 10(1):12–41, 1971.

[28] Nicol Schraudolph and Dmitry Kamenetsky. Efficient exact inference in planar isingmodels. Advances in Neural Information Processing Systems, 21:1417–1424, 2008.

[29] LLC Gurobi Optimization. Gurobi optimizer reference manual, 2021.

[30] Peter Clifford and Raphael Clifford. The classical complexity of boson sampling. In Pro-ceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms,pages 146–155. SIAM, 2018.

[31] Scott Aaronson and Lijie Chen. Complexity-theoretic foundations of quantumsupremacy experiments. arXiv preprint arXiv:1612.05903, 2016.

[32] Gregory D Kahanamoku-Meyer. Forging quantum data: classically defeating an IQP-based quantum test. arXiv preprint arXiv:1912.05547, 2019.

[33] David Gosset, Daniel Grier, Alex Kerzner, and Luke Schaeffer. Fast simulation of planarclifford circuits, 2020.

[34] Scott Aaronson and Daniel Gottesman. Improved simulation of stabilizer circuits. Phys-ical Review A, 70(5):052328, 2004.

[35] Leslie Ann Goldberg and Heng Guo. The complexity of approximating complex-valuedIsing and Tutte partition functions. computational complexity, 26(4):765–833, 2017.

A Linear-time graph-based forrelation algorithm

In this appendix, we prove Lemma 6 which improves the running time of Lemma 5 to linearin n. The key to the proof is Theorem 9 stated below, which solves a more general problemthan is necessary for Lemma 6:

• Qubits are generalized to qudits of dimension d.

• The starting state is a tensor product of mixed states, χ1 ⊗ · · · ⊗ χn, rather thanH⊗n |0n〉.

• There are operators Oa for all qudits, and they need not have bounded norm.

• The gates U1, . . . , Um may act on a set S of qudits, and induce edges i, j in G for alli, j ∈ S. We do not require the product of the gates to be unitary.

31

We recall the problem from the introduction.

Problem 1. Given an initial state χ = χ1⊗· · ·⊗χn, a collection of diagonal gates U1, . . . , Um,and single-qudit operators Oaa∈[n] on the qudits, sample a string x from the distribution Pwhere P (x) ∝ 〈x|OUχU †O† |x〉 and

U =m∏j=1

Uj, O =n⊗i=1

Oi.

As before, there is graph G associated with an instance of this problem.

Definition 3. We define the connectivity graph G = (V,E) where V is the set of all quditsand

E := (i, j) ∈ V × V : some Uk acts on both i and j.

Our main result for this appendix is the following improvement of Lemma 5:

Theorem 9. Given an instance of Problem 1, and a rooted tree decomposition T of width wfor the connectivity graph G = (V,E), there is a classical algorithm sampling a solution intime O(nd2w +mdw + `) where ` is the length of the input, m is the number of gates, and nis the number of qudits.

Before we go any further, we note that the algorithm shares some similarities with aresult of Gosset et al. [33]. That result describes how to use a tree decomposition to helpsample Pauli measurements on a graph state. Since a graph state is just the tensor productstate |+〉⊗n with diagonal CZ gates applied on the edges, the graph state sampling problemthey introduce is actually a special case of Problem 1. However, Theorem 9 gives a muchworse bound (exponential in treewidth, rather than polynomial) because it does not assumethe state is Clifford.

We divide the proof into three lemmas. First, we use a tree decomposition for G tobuild a tensor network for OUχU †O†, and prove correctness of the network in Lemma 7.Second, we present an algorithm in Lemma 8 to sample measurement outcomes from thetensor network, exploiting its tree structure. Third, we separate out the runtime analysis ofthe algorithm into Lemma 9.

We begin with the tensor network, N , for the state OUχU †O†. Actually, it is easier ifOUχU †O† is a vector, so let vec be the linear map which flattens operators to vectors withrespect to the standard basis:

vec(|i〉 〈j|) = |i〉 |j〉 for all i, j.

That is, operators on a Hilbert space H1⊗· · ·⊗Hk map to vectors in⊗k

i=1(Hi⊗Hi). Thenthe goal is for N to be a tensor network for

vec(OUχU †O†) = (O ⊗O)(U ⊗ U) vec(χ) =⊗i

(Oi ⊗Oi)∏j

(Uj ⊗ Uj)⊗k

vecχk. (40)

For example, one possible tensor network for vec(OUχU †O†) looks like a normal circuit: aninitial state followed by a sequence of gate tensors Uj⊗Uj for each gate Uj, and finally Oi⊗Oi

32

χb

χb Mb χbU

Mb = Mb

U=

UMb

Figure 5: Identities with the merge tensor. (1) Two copies of χb are merged into one. (2) Agate tensor for diagonal U can be moved to any one of the three arms.

operator tensors for the operators. The wires are d2-dimensional, the states are flatteneddensity matrices, gates and operators are doubled as Uj ⊗ Uj or Oi ⊗Oi, but otherwise it isjust like a normal circuit.

Of course, we could readily simulate a circuit on a mixed state without the formalism ofvec and tensor networks, but we will also be using merge tensors Mb to combine two wiresrepresenting the same qudit b. If vec(χb) =

∑i αi |i〉 (note: this is a classical basis of size

d2) then we define the merge tensor Mb as

Mb :=∑i:αi 6=0

α−1i |i〉 (〈i| ⊗ 〈i|).

We need three key properties of these tensors. See Figure 5 for a diagram of the first two.

Proposition 1. For any qudit b, the merge tensor Mb has the following properties:

1. Applying Mb to two copies of vec(χb) produces vec(χb).

Mb(vec(χb)⊗ vec(χb)) = vec(χb).

2. Any diagonal gate U commutes through from either input of Mb to the output.

(Mb ⊗ I)(I ⊗ U) = U(Mb ⊗ I).

3. Mb(|u〉 ⊗ |v〉) = |c〉 |u〉 |v〉 where c is a constant vector and denotes entrywisemultiplication.

Proof. For the first property, we have the calculation

Mb(vec(χb)⊗ vec(χb)) =∑i:αi 6=0

α−1i (〈i| vec(χb))

2 |i〉 =∑i:αi 6=0

αi |i〉 = vec(χb).

Now consider a diagonal gate U and decompose it as U =∑

j |j〉 〈j| ⊗ Uj. Then

(Mb ⊗ I)(I ⊗ U) =

(∑i:αi 6=0

α−1i |i〉 (〈i| ⊗ 〈i|)⊗ I

)(∑j

I ⊗ |j〉 〈j| ⊗ Uj

)

=

(∑i:αi 6=0

∑j

δijα−1i |i〉 (〈i| ⊗ 〈j|)⊗ AUj

)

=

(∑j

|j〉 〈j| ⊗ Uj

)(∑i:αi 6=0

α−1i |i〉 (〈i| ⊗ 〈i|)

)= U(Mb ⊗ I).

33

The final property is straightforward when we take |c〉 =∑

i ci |i〉 where

ci =

α−1i if αi 6= 0

0 if αi = 0.

Finally, we can define the tensor networkN . We constructN from the tree decompositionT by creating a subnetwork NB for each node B in T , and connecting subnetworks ofneighbouring nodes in T . For organizational purposes, each wire in the network is directed,and labeled by a qudit. Then for each child C of B in the tree, NB has incoming wires fromNC for each qudit in B ∩ C. There are outgoing wires for each qudit in B; the parent Aof B takes wires A ∩ B and the remaining B\A are free. Since each vertex “leaves” a treedecomposition (appears in some node, but not its parent) exactly once, N has one free legfor each vertex (qudit). See Figure 9 for an example tensor network.

Within each subnetworkNB, there are four kinds of tensors that can appear: state tensors(i.e., vec(χi)), gate tensors (i.e., Uj ⊗Uj), operator tensors (i.e., Oi⊗Oi), and merge tensors(Mb). These are split across four stages appearing in the following order.

Merge stage: We have incoming wires from child subnetworks NCi, and there may be more

than one wire for some qudit b ∈ B. Use the merge tensor Mb to combine wires for bin pairs until there is only one wire for qudit b.

Introduction stage: If some qudit b in B does not appear in any child of B, then add thetensor vec(χb), so there is a wire for that qudit leading into the next stage.

Gate stage: At this point, there is exactly one incoming wire per qudit in B. Apply gatetensors for all gates which (i) act only on qudits in B, and (ii) touch some qudit inB\A where A is B’s parent (or A = ∅ if B is the root). That is, apply

∏j∈S Uj ⊗ Uj

for the prescribed set S.

Operator stage: Wires for qudits in A ∩ B go to the parent; the remaining B\A quditsare due to be measured, so we apply operator tensors to these qudits.

Lemma 7. The tensor network N contracts to vec(OUχU †O†).

Proof. We have already established that N has a free wire for each qudit, so the tensornetwork has the right dimension.

Next, note that the edges between subnetworks are directed towards the root of the tree,and edges within subnetworks are directed through the sequence of stages (→ merge →introduction → gate → operator →), so the full network is a directed acyclic graph. As aresult, individual merge tensors are partially ordered. If there are any merge tensors in N ,then there is an “earliest” merge tensor Mb, preceded only by gate tensors and state tensors(operator tensors are last under this ordering, so they do not interfere). By the secondproperty of Proposition 1, we can commute the gate tensors past Mb, until both input wiresto Mb come directly from state tensors.

Recall that all wires are labeled, and note that all the wires of a merge tensor Mb arelabeled b in the original network. This is preserved when we commute gates through with

34

Proposition 1. It follows that the inputs to Mb are labeled b, and therefore are vec(χb) statetensors. By the first property of Proposition 1, we replace Mb and its inputs with vec(χb).Repeat the whole process with the next earliest merge tensor, until we eliminate all mergetensors.

The remaining network has only state tensors, gate tensors, and operator tensors, but itcontracts to the same tensor as the original network, N . The new network is essentially acircuit: we can follow each output wire back through the circuit to the input, a state tensor.There is a free wire and operator tensor for each qudit i at the end of the circuit, so theremust be a matching state tensor vec(χi) at the start of the circuit. It only remains to checkthat we have the right set of gate tensors. Since the gate tensors commute, it will followthat the network is correct by (40).

Let Uk be an arbitrary gate, on some subset of qudits S ⊆ V . For each i, let Ti be thenon-empty subtree of T with all nodes containing i. Note that any i, j ∈ S define an edge inG, so i, j appear together somewhere in T , i.e., Ti ∩ Tj 6= ∅. The intersection of a collectionof pairwise intersecting subtrees is non-empty, so there is some highest node B in

⋂s∈S Ts.

We conclude that the gate tensor Uk⊗Uk appears in the gate stage of NB because it containsall the necessary vertices S (it is in Ti for all i ∈ S), but its parent does not (or B would notbe highest).

Finally, we claim this occurrence is unique. Uk cannot appear in some B′ below B becausethere is no vertex b that leaves (i.e., isn’t in the parent of B′) since b is in B, B′, and all thenodes in between by tree decomposition definitions. On the other hand, Uk cannot appearoutside the subtree rooted at B, since there is some vertex b ∈ S missing from B’s parent,and hence in the rest of the tree.

We encourage the reader to work through the steps on the example in Figure 9 to helpvisualize the process.

It is perhaps not surprising that a tensor network N derived from a bounded-treewidthgraph can be sampled efficiently, but we could not find a blackbox way to get the runtimewe want from existing algorithms.10 We describe a new contraction-based algorithm, whilearguing its correctness, in Lemma 8, and then analyze its running time in Lemma 9.

The algorithm works with subtrees of T and corresponding subnetworks of N , so let usintroduce some notation for this. For a node B, let TB be the subtree of T rooted at B.Given an arbitrary subtree T ′ of T , let N (T ′) be the restriction of N to nodes appearing inT ′.

Lemma 8. There is a classical algorithm (described within) which solves Problem 1 givenan instance and an accompanying tree decomposition T .

Proof. Assume the tree decomposition T is rooted and ordered, and we have constructed thetensor network N . The algorithm traverses T (and the corresponding pieces of the network)twice: first to compute tensors ρB for all nodes B, then again to compute tensors σB, ΠB,and the sampled outcomes |xb〉.

1. Tensor ρB is the contraction of N (TB). This represents the initial state of the subtreenetwork before any sampling.

10We depend too much on the gates being diagonal, among other things.

35

2. Tensor σB is the contraction of N (TB) with the sampled outcomes vec(|xb〉 〈xb|) for allqudits measured in the subtree TB. This represents the final state of the subtree asthe second traversal leaves it.

3. Tensor ΠB is the contraction of N (T\TB) with sampled outcomes vec(|xb〉 〈xb|) for allqudits measured in nodes visited before B in the second traversal. ΠB represents therest of the network during the (second) traversal of TB: the contraction everythingoutside TB, postselected on the current sampled outcomes at the moment the traversaldescends into B and TB.

These three tensors are explicit, i.e., stored as vectors of complex numbers at some point inthe algorithm. As an essential performance optimization, we restrict ρB to qudits B, andrestrict σB and ΠB to qudits A ∩ B where A is the parent of B—we trace out all otherqudits in these tensors. Since tracing out a qudit is itself a tensor contraction, and theorder of contractions in a tensor network does not affect the result, we can reason about theunrestricted versions of ρB, σB, and ΠB, with the understanding that we still have to traceout the qudits at the end.

In the first traversal, the algorithm computes ρB for each B from the bottom up. Wesimply take ρC1 , . . . , ρCk

for B’s children C1, . . . , Ck (if they exist), contract them with NB,call the result ρB and pass it up the tree.

The second traversal works from the top down. See Figure 10 for a diagram of the stepsin node B. In this pass, we assume the algorithm gets ΠB from the parent when it first visitsB, and returns σB to the parent after traversing the subtree (TB). The base case is the rootR, where ΠR is an empty tensor with no wires. Then for an arbitrary node B in the tree,we notice that ΠB and ρB represent two halves of the network, and both are available tothe algorithm when it first arrives at B. Thus, the algorithm contracts ρB and ΠB to getN (restricted to qudits B), and this is enough to sample measurement outcomes for quditsB\A in B which do not connect to the parent, A.

Next, the algorithm visits the children C1, . . . , Ck of B. Since we need to provide ΠCi

to traverse Ci, we contract ΠB, NB, projectors onto the measurement outcomes for B, andeither ρCj

(if j > i) or σCj(if j < i) for each sibling of Ci to get ΠCi

. Then the algorithmrecursively traverses TCi

, samples measurement outcomes within the subtree, and returnsσCi

.Finally, once we have traversed all the subtrees of B, we visit B once more to compute

σB by contracting NB, projections onto the sampled outcomes for qudits measured in B,and σCi

for each child of B. The result is σB, and we pass it to the parent.Let us consider correctness. Consider an arbitrary qudit b (measured in some node B),

and suppose inductively that the qudits sampled before it were sampled with the correctdistribution. The algorithm samples b by contracting NB with ΠB and possibly ρC1 , . . . , ρCk

.We have already argued that the tensors ΠB, ρC1 , . . . , ρCk

represent the contractions ofN (T\TB), N (TC1), . . . ,N (TCk

) claimed at the start of this proof, so the main contractiongives the state OUχU †O† post-selected on the measurement outcomes so far. Unless thisstate is zero11, the algorithm can renormalize the state and sample b from it. Hence, b has

11If the post-selected state is zero then OUχU†O† must have been zero from the start, since we have onlypost-selected on outcomes with positive probability.

36

the correct distribution, and therefore the algorithm correctly samples.

A significant portion of the runtime analysis depends on the cost of contracting tensorstogether. We will need the following fact about tensor contraction.

Fact 1. Given two tensors X and Y explicitly, we can combine them in O(Dc+f ) arithmeticops where c is the number of wires between X and Y , f is the number of wires in the combinedtensor, and D is the dimension of each wire.

We will take D = d2 when we apply this fact, since recall that all our wires have dimensiond2. Note that we are counting the number of arithmetic operations (arithmetic ops), meaningcomplex additions, subtractions, multiplications and divisions.

Lemma 9. Given gates U1, . . . , Um (as dense vectors of their diagonal elements) on a systemof n qudits, operators Oii∈[n] (as dense matrices), and a tree decomposition T of width w,then the algorithm described above (Lemma 8) is dominated by the cost of O(nd2w+mdw+`)arithmetic ops, where ` is the length of the input.

Proof. We now show that we may assume T is a binary tree with O(n) nodes. First, wecontract any edge where the parent contains all the vertices of the child. In the remainingtree, every edge has some qudit b that appears in the child but not the parent. However, thereis only one such edge for each qudit b (the least common ancestor of all nodes containingb) by tree decomposition axioms. Hence, there are at most n edges and n − 1 nodes aftercontraction. Next, we replace any node B with three or more children C1, . . . , Ck by a binarytree having C1, . . . , Ck at the leaves and B at all internal nodes, at most doubling the numberof nodes. The resulting tree decomposition has the same treewidth as the original, at most2n− 2 nodes.

The main algorithm computes tensors ρB, σB, and ΠB for each node B in the treedecomposition. In Lemma 8, we presented the operations on ρB, σB, and ΠB abstractly astensor contractions, mostly between NB and ρ, σ, and Π for various nodes. We explicitlystore ρ, σ, and Π tensors as vectors of complex numbers, but not NB because it would be toolarge (up to w wires to the parent and w from each child is up to 3w). Instead, every timewe contract tensors to NB, we expand it as a network of individual state, gate, operator,and merge tensors, and contract the network strategically.

Note that there are up to four tensor computations that involve NB, which compute ρB(in the first traversal), ΠC1 , ΠC2 , σB. Each one contracts all the individual tensors of NB(for essentially the same cost in each contraction). We analyze the cost of contraction foreach stage in the forward and reverse directions (i.e., whether we’re given a tensor on theinput wires or output wires to contract with).

Merge tensors: We wish to combine tensors from the two children of B, namely C1 andC2, which overlap on qudits B ∩ (C1 ∪ C2). We apply the merge tensor MB :=⊗b∈B∩(C1∪C2)Mb on the duplicates. Fortunately, each individual Mb is an entry-wiseproduct of the two input tensors and a constant vector (by Proposition 1), and sois the full merge tensor MB. On w qudits, the vectors all have dimension d2w, andtherefore we use at most O(d2w) arithmetic ops. It is similarly O(d2w) ops to contractthe tensor given a tensor for the output and one of the inputs.

37

We can only merge as many times as there are internal nodes of T , i.e., at most O(n).Hence, O(nd2w) arithmetic ops are spent on merge tensors.

State tensors: The tensor product of vec(χi) with an existing tensor vec(ρ) takes O(d2f )arithmetic ops (by Fact 1) where f is the final number of wires. If we build up a largeproduct vec(χ1)⊗ ·⊗ vec(χk) iteratively then since the final number of wires increasesin each step, the number of arithmetic ops is at worst a geometric series,

O(d2) +O(d4) + · · ·+O(d2w−4) +O(d2w−2) +O(d2w) = O(d2w).

The series dominated by its largest term, which is at most O(d2w). In the otherdirection, contracting state tensor vec(χi) with an arbitrary tensor is O(d2f ) (again byFact 1) where f is the number of wires in the arbitrary tensor. Since this decreasesafter the contraction, the number of arithmetic ops required for a sequence of theseoperations is again at worst a geometric series totalling O(d2w).

The cost for the whole tree is O(d2w) per node times O(n) nodes: O(nd2w) arithmeticops.

Gate tensors: The gate tensors are linear maps Uj ⊗ Uj, where Uj is diagonal. To updatean explicit input tensor, it suffices to multiply entry-wise by the diagonal of Uj ⊗ Uj.Clearly this is O(d2w) arithmetic ops, and it is similar in reverse.

Now let UB be the product of all the gates applied in B. Rather than apply gate tensorsfor the individual gates, we compute the product in time O(dw) times the number ofgates, and then use the gate tensor UB ⊗UB in time O(d2w). This way there are O(n)gate tensor contractions using O(d2w) arithmetic ops each, plus O(dw) operations pergate times O(m) gates. Thus, gate tensors cost us O(nd2w +mdw).

Operator tensors: Unlike gates, operators may be arbitrary dense matrices, Oi. For thisreason, it is cheaper to apply the operators one at a time rather than grouping them likegates. From Fact 1, we need only O(d2w+2) = O(d2w) arithmetic ops to contract onewire of the operator tensor the main w wire tensor. There are at most O(n) operatortensors, one per qudit, so the total cost is O(d2wn) arithmetic ops across the tree.

There is one final step which does not involve NB: contracting ρB and ΠB and thensampling outcomes. The two tensors have A ∩B qudits in common, and B\A wires are leftat the end, so again the cost of the contraction is O(d2w) arithmetic ops by Fact 1. The resultis a tensor for the qudits to be measured, and since we are measuring in the classical basisthe distribution can be read off of the tensor. Hence we can sample in O(dw) time per node,which is negligible in comparison to contraction. Hence, the cost is O(nd2w) arithmetic opsacross the entire tree.

We present a way to tweak the algorithm to compute amplitudes as well, although itdoes not improve on the earlier algorithm given in Lemma 4.

Corollary 1. Given an initial pure state |χ〉 = |χ1〉 ⊗ · · · ⊗ |χn〉, a collection of diagonalgates U1, . . . , Um, single-qudit operators Oaa∈[n], and classical state |x〉, there is a classical

38

algorithm computing 〈x|α〉 where

|α〉 =⊗i

Oi

∏j

Uj |χ〉 .

The algorithm runs in time O((n+m)dw + `) where ` is the length of the input.

Proof. Re-use the same tensor network, but specializing it to pure states. That is, wires aredimension d, gate tensors become Uj, operator tensors become Oi, state tensors are |χi〉.Merge tensors are still

Mb =∑i:βi 6=0

β−1i |i〉 (〈i| ⊗ 〈i|),

but over d dimensions for i, and where |χb〉 =∑

i βi |i〉. We claim that this tensor networkcontracts to the vector |α〉, so if we attach 〈xb| to the free wire b for all qudits, then we cancompute 〈x|α〉.

We leave it as an exercise to check that we can contract this network in such a waythat all intermediate tensors have at most O(dw) entries, and hence we have the desiredruntime.

Recall that Theorem 5 partitions the graph into two halves, and splits the operators Oa

between them. We sample from one half, and evaluate an amplitude for the other half, soin each calculation, some of the operators are the identity. Crucially, we get to drop verticeswhere Oa = I from the graph, and the treewidth is computed for the remaining vertices.Since we have generalized the problem, we need to verify that this is still true: we can samplethe easy qudits and reduce to an instance on the rest. However, we require the product ofthe gates to be unitary.

Lemma 10. Given an instance of Problem 1 where the product U is unitary. Let

Veasy = a ∈ V : Oa = I,

be the set of vertices/qudits with trivial operators Oa. Then we can reduce to an instance onV \Veasy in linear time, i.e., sample outcomes for Veasy and simplify the instance to verticesV \Veasy.

Proof. Take an arbitrary qudit i ∈ Veasy. Ordinarily, the probability of measuring qudit i in

state |xi〉 〈xi| would be Tr((|xi〉 〈xi| ⊗ I)OiUχU†O†i ), but since Oi is the identity, we have

Tr((|xi〉 〈xi| ⊗ I)UχU †) = Tr(U †U(|xi〉 〈xi| ⊗ I)χ) = Tr((|xi〉 〈xi| ⊗ I)χ) = 〈xi|χi |xi〉 ,

where we used the fact that U is unitary (so U †U = I) and diagonal (so it commutes withother diagonal matrices, e.g., |x〉 〈x| ⊗ I). It is easy to sample an outcome |xi〉 proportionalto 〈xi|χi |xi〉; no calculation is necessary since the distribution is on the diagonal of χi. Oncewe have |xi〉, we are interested in the state Tri((|xi〉 〈xi|⊗I)UχU †), where we have projectedonto the outcome and traced out qudit i. Since each factor of U commutes with the projector,this is equivalent to replacing each Uj that touches qudit i with U ′j := Tri((|xi〉 〈xi| ⊗ I)Uj).We leave it as an exercise to check that this is a new instance of Problem 1.

39

Let us sample all qudits Veasy first, then project onto |xVeasy〉 and trace out Veasy. Sam-pling from a given discrete distribution on d elements requires at most O(d) arithmetic ops.Projecting onto a standard basis state |xi〉 and tracing out simply restricts Uj to a subma-trix U ′j with rows and columns matching |xi〉. We pick out this submatrix in linear time,completing the reduction to an instance on V \Veasy.

Finally, we recall Lemma 6 from Section 4.3:

Lemma 6. There is a classical algorithm which samples a binary string x ∈ 0, 1n fromthe distribution P (x) = |〈x|α〉|2 in time O(n4w). Recall that |α〉 = (OA ⊗ IB)UfH

⊗n |0n〉,and w is the treewidth of the graph restricted to qubits A (i.e., to the qubits having non-trivial operators). We additionally assume that Uf is unitary, and that we are given a treedecomposition of size O(n) and width w.

Proof. First, we use Lemma 10 to sample the qubits from B (i.e., qubits without operators)in linear time. The graph for the remaining qubits A and gates has treewidth w, and weassume we are given a tree decomposition T of size O(n) for it. Note that since our gatesare 2-local, there can be at most w2 gates per node of the tree decomposition, and thusm = O(nw2) gates total. Applying Theorem 9 samples the remaining qubits in time

O(nd2w +mdw) = O(n4w + nw22w) = O(n4w).

B Example Tensor Network NConsider a circuit on eight qudits labeled a through h, with diagonal gates on the followingsubsets of qudits:

a, b, a, c, b, b, c, b, e, b, f, b, g, b, h, c, d, e, d, e, e, g, h, f, g.

We wish to initialize the qudits in state χ = χa ⊗ · · · ⊗ χh, apply these gates, and measurequdits after applying operators Oa, . . . , Oh.

On closer inspection, we notice that d, e is a subset of c, d, e, so the two gates may becombined. Similarly, b can be combined with any of the other gates on b. Also, supposeOh = I, so we can remove h entirely: sample a classical outcome for h based on χh andrestrict Uegh to that outcome, creating a new gate Ueg. The other operators, Oa, . . . , Og, areall nontrivial, so at this point we construct the graph G in Figure 6. The figure shows theremaining qudits, and edges between pairs of qudits that appear together in some gate.

At this point, we could sample outcomes for the rest of the qudits with a naıve simulationof a circuit like in Figure 7. That is, construct the full density matrix for χ1⊗· · ·⊗χg, applythe gates one at a time, and then measure the qudits a through g. The problem is that thedensity matrix grows exponentially with the number of qudits, so it is extremely inefficientto store all the qudits.

Depending on the graph, we may be able to avoid storing all the qudits at once, bycarefully sequencing the gates, and scheduling the introduction/measurement of qudits. For

40

a b

c

d e

f

g

h

Figure 6: A graph on qudits a through g, where there is an edge u, v whenever some gateacts on u and v. The grayed out node h and edges e, h and g, h show what the graphwould be if we had not removed h, i.e., if h was measured in a different basis.

χaUab Uac

Oa

χbUbc

Ube Ubf

Ubg

Ob

χc

Ucde

Oc

χd Od

χeUeg

Oe

χfUfg

Of

χg Og

Figure 7: A straightforward circuit for the example.

41

bce

abc

cde

beg bfg

Figure 8: A tree decomposition for G

χa

χc

χb

χe

χc

χd

Uab

Ucde

Uac

Od

Oa

Mc

Ucb

Oc

χgUeg

UbeOe

χfUbf

UfgUbg

Ob

Of

Og

Figure 9: A tensor network for the example. For brevity, χb represents the state tensor vecχb,Uxy represents a gate tensor Uxy ⊗ Uxy, and Oz represents the operator tensor Oz ⊗Oz.

example, separate the graph into two overlapping halves, a, b, c, d, e and b, e, f, g. Wecan simulate the qudits and gates on a, b, c, d, e, measure a, c, d leaving the overlap,b, e, and then introduce f, g and finish the circuit. The density matrix never has morethan five qudits by this strategy, but otherwise does the same operations as a full seven quditsimulation. In fact, we can make do with only four qudits at a time, e.g., in groups like this

a, b, c, b, c, d, e, b, e, g, b, f, g.

This grouping is a tree decomposition of G, and in fact we can do even better with thewidth-2 decomposition in Figure 8.

We turn the tree decomposition into a tensor network, Figure 9. Each node becomes asubnetwork, shown as a dotted box in the figure. Recall that we justify the correctness ofthe network by pushing merge gadgets to the beginning (left, in Figure 9) of the network,and eliminating them. Without merge tensors, there is one line through the network foreach qudit (like a circuit), and it is easy to check that all the gates appear. Correctness ofN follows.

42

Sample

ΠB

C1 C2

ΠC1σC1 ΠC2

σC2

σB

Oc

Ubc

Mc

Figure 10: The second traversal at a node B: (1) ΠB is given, (2) sample outcomes for quditsmeasured in the node, (3) compute ΠC1 , (4) recurse on C1 and get σC1 , (5) compute ΠC2 ,(6) recurse on C2 and get σC2 , (7) compute σB and return.

43

Now we just need to contract the tensor network and sample outcomes, as described inLemma 8. Recall the tensors computed by the algorithm.

• ρB is the contraction of the subtree rooted at B before any measurements, with allqudits traced out except those in B itself.

• σB is the contraction of the subtree rooted at B after all measurements in the subtreehave been made.

• ΠB is the contraction of the subtree “above” B, i.e., everything in the network exceptthe subtree. All qudits not in B are traced out or measured, depending on what partsof the tree the algorithm traversed before B.

We compute all the ρB in a bottom-up traversal, i.e., given ρC1 and ρC2 for the children,trace out any qudits not in B, contract the tensors with N , and the result is ρB.

The more interesting half is the second traversal from the top down. Figure 10 shows theflow of this phase of the algorithm. We are given ΠB, representing the state of the tree aboveus. Using ρB and ΠB, we can get the state of the qudits in B, and sample measurementoutcomes for them. Then we recurse on the children, but need to provide ΠC1 and ΠC2

respectively. So, we use ΠB, the outcomes we just sampled, and ρC2 to compute ΠC1 , andrecurse on the first child. We get σC1 back, and combine with ΠB and the samples to getΠC2 . We recurse on C2 and get σC2 back. Then we use σC1 , σC2 , and the outcomes within B,to compute σB, the post-measurement state of the subtree (restricted to B). This completesthe traversal of B (and the subtree TB).

C Simplifying level-2 QAOA variational optimization

Let G = (V,E) be a graph with n vertices,

B =∑p∈V

Xp and C =∑p,q∈E

Jp,qZpZq.

For level-2 QAOA, we would like to set β1, β2, γ1, γ2 to maximize the energy

E(β1, β2, γ1, γ2) = 〈+n|W †CW |+n〉 ,

where W = e−iβ2Be−iγ2Ce−iβ1Be−iγ1C is the QAOA circuit with two entangling layers.Our goal for this section will be to show that if you fix the values of β1, γ1 and γ2, then

the value of β2 that maximizes the energy E can be computed analytically using the energyat three distinct values of β2.

To show this, let us first focus on the contribution to the energy by single ZsZt term ofC. Since e−iβ2B is a product of single-qubit operations, the conjugation of ZsZt by e−iβ2B

only depends on the gates on qubits s and t. We get

eiβ2BZsZte−iβ2B = eiβ2(Xs+Xt)ZsZte

−iβ2(Xs+Xt) = e2iβ2(XI+IX)ZZ

=(cos2(2β2)II + i sin(2β2) cos(2β2)(IX +XI)− sin2(2β2)XX

)ZZ

=1 + cos(4β2)

2ZZ +

sin(4β2)

2(ZY + Y Z) +

1− cos(4β2)

2Y Y.

44

By linearity, this implies that we can write the expectation as

E(β1, β2, γ1, γ2) = a cos(4β2) + b sin(4β2) + c

for some real coefficients a, b, and c which are complicated functions of β1, γ1, and γ2. We cancompute these coefficients exactly using only three evaluations of the expectation function:

E(β1, π/8, γ1, γ2) = b+ c

E(β1,−π/8, γ1, γ2) = −b+ c

E(β1, 0, γ1, γ2) = a+ c.

So, a, b, and c can be calculated easily by solving the system of equations. Once the valuesare known, we can calculate the max energy over β2 using some elementary trigonometry:

maxβ2

E(β1, β2, γ1, γ2) = c+√a2 + b2

where the optimal β2 is found by solving

tan(4β2) = b/a

a cos(4β2) ≥ 0

b sin(4β2) ≥ 0.

As a final remark, we note that this calculation also shows that the optimal value for β2 isbetween [0, π/2) since arctan(b/a) ∈ [0, 2π).

D Simulation of level-2 QAOA for low-degree graphs

Let G = (V,E) be a graph with n vertices,

B =∑p∈V

Xp and C =∑p,q∈E

Jp,qZpZq.

Consider some fixed edge s, t ∈ E. Our goal is to compute a quantum mean value

µ = 〈+n|W †ZsZtW |+n〉, (41)

where W = e−iβ2Be−iγ2Ce−iβ1Be−iγ1C is the QAOA circuit with two entangling layers. In thissection we give two classical algorithms A′ and A′′ that compute the mean value µ using theSchrodinger and the Heisenberg pictures respectively. The runtime of these algorithms scalesexponentially with the size of certain local neighborhoods of the vertices s and t defined asfollows. Suppose M ⊆ V is a subset of vertices and r ≥ 0 is an integer. Let Nr(M) bethe set of all vertices i ∈ V such that the graph distance between i and M is at most r.Let nr(M) = |Nr(M)|. The runtime of our algorithms depends on n2(s), n2(t), n1(s, t), andn2(s, t). For example, suppose each vertex of G has at most d neighbors. Then n2(j) ≤ 1+d2

for any vertex j, n1(s, t) ≤ 2d, and n2(s, t) ≤ 2(d2− d+ 1). These bounds are tight if G is ad-regular tree.

45

The unitary operator W can be implemented by a quantum circuit with two layersof single-qubit X-rotations and two entangling layers composed of nearest-neighbor ZZ-rotations. Such circuit can propagate information from a qubit j ∈ V only within thelightcone N2(j). Accordingly, the mean value µ can be computed by restricting the circuitW onto the lightcone N2(s, t). The restricted version of W acting on n2(s, t) qubits containsO(n2(s, t)) single-qubit gates and two diagonal unitary operators describing a time evolutionunder the restricted version of C. Such circuit can be simulated using the standard methodsin time O(n2(s, t)2n2(s,t)). Once the state W |+n〉 has been obtained, computing the expectedvalue of ZsZt on this state takes time O(2n2(s,t)). Thus a naive algorithm for computing µthat exploits the lightcone structure has runtime O(n2(s, t)2n2(s,t)). Below we show how toimprove upon this naive algorithm.

Lemma 11. There exist classical algorithms A′ and A′′ that compute the mean value Eq. (41)in time

T ′ = O(n2(s)2n2(s) + n2(t)2n2(t)) and T ′′ = O(n1(s, t)4n1(s,t) + n2(s, t)3n1(s,t)

)(42)

respectively.

For simplicity here we assume that arithmetic operations with real numbers and evalua-tion of the standard trigonometric functions can be performed with an infinite precision ina unit time. Under this assumption the algorithms A′ and A′′ compute the mean value µexactly. For a given problem instance one can estimate the runtimes using Eq. (42) and pickthe algorithm with the smallest runtime. In the rest of this section we prove Lemma 11 byexplicitly describing the two algorithms.Algorithm A′: Consider a vertex j ∈ s, t. Write C = Cloc(j) +Celse(j), where Cloc(j) is thesum of all terms Jp,qZpZq with p, q ∈ N2(j) and Celse(j) is the sum of all remaining terms.Since all terms in C pairwise commute, the standard lightcone argument shows that

W †ZjW = Wloc(j)†ZjWloc(j)

whereWloc(j) = e−iβ2Be−iγ2Cloc(j)e−iβ1Be−iγ1Cloc(j).

Note that Wloc(j) acts non-trivially only on N2(j). Thus W †ZjW |+n〉 = |ψj〉 ⊗ |+n−n2(j)〉,where

|ψj〉 = Wloc(j)†ZjWloc(j)|+n2(j)〉

is a state supported in N2(j). One can compute a lookup table of the function Cloc(j) :0, 1n2(j) → R in time O(n2(j)2n2(j)) by iterating over the set 0, 1n2(j) using Gray code.Indeed, since Cloc(j) is a quadratic function, updating its value upon flipping a single bit ofthe input bit string takes time O(n2(j)). The quantum circuit Wloc(j)

†ZjWloc(j) containsO(n2(j)) single-qubit X-rotations and O(1) diagonal unitary operators implementing thetime evolution under Cloc(j). The latter can be simulated in time O(2n2(j)) since the lookuptable of Cloc(j) is available. Each single-qubit X-rotation can be simulated in time O(2n2(j))

46

using a sparse matrix-vector multiplication. Thus the full vector specifying the state |ψj〉can be computed in time O(n2(j)2n2(j)). We have

µ = 〈+n|(W †ZsW )(W †ZtW )|+n〉 = 〈ψ′s|ψ′t〉

where |ψ′j〉 is a state obtained from |ψj〉 by projecting all qubits i /∈ N2(s) ∩ N2(t) onto the|+〉 state. Once a qubit is projected onto the |+〉 state, it is discarded. The state |ψ′j〉 can

be computed in time O(2n2(j)), as follows from the proposition below.

Proposition 2. Given a state |ψ〉 ∈ (C2)⊗n and an integer k ∈ 1, 2, . . . , n, let |ψ′〉 ∈(C2)⊗n−k be a (possibly unnormalized) state obtained from |ψ〉 by projecting the first k qubitsonto the |+〉 state, that is,

〈y|ψ′〉 = 2−k/2∑

x∈0,1k〈x, y|ψ〉, y ∈ 0, 1n−k. (43)

Then one can compute |ψ′〉 in time O(2n).

Proof. If k = 1 then compute all amplitudes 〈y|ψ′〉 by iterating over y ∈ 0, 1n−1 and usingEq. (43). This takes time O(2n). Applying the same step inductively k times takes time

k∑i=1

O(2n−i) ≤ O(2n)∞∑i=1

2−i = O(2n).

Here we noted that the number of qubits is reduced by one at each step.

Computing the inner product µ = 〈ψ′s|ψ′t〉 takes time 2|N2(s)∩N2(t)| which is negligiblecompared with the earlier steps of the algorithm. The overall runtime of algorithm A′ istherefore T ′ = O(n2(s)2n2(s) + n2(t)2n2(t)).Algorithm A′′: Our starting point is the expression for the mean value µ derived in Sec-tion 5.1, namely

µ = Re [2µ(0000) + 4µ(0010) + 4µ(0001) + 2µ(0011) + 2µ(0110) + 2µ(0101)] , (44)

where

µ(v) = 〈v1v2|U |v3v4〉 · 〈+n|eiγ1CO1(v)⊗O2(v)⊗ · · · ⊗On(v)e−iγ1C |+n〉, (45)

U = eiγ2Js,tZ⊗Zeiβ2(X⊗I+I⊗X)Z ⊗ Ze−iβ2(X⊗I+I⊗X)e−iγ2Js,tZ⊗Z , (46)

and Oj(v) are single-qubit operators defined by

Os(v) = eiβ1X |v1〉〈v3|e−iβ1X , (47)

Ot(v) = eiβ1X |v2〉〈v4|e−iβ1X , (48)

Op(v) = eiβ1Xeiγ2(Js,p(−1)v1+Jt,p(−1)v2−Js,p(−1)v3−Jt,p(−1)v4 )Ze−iβ1X (49)

47

for p /∈ s, t. Below we use shorthand notations Nr ≡ Nr(s, t) and nr = |Nr(s, t)|, wherer = 1, 2. We begin by observing that

Op(v) = I for all p /∈ N1. (50)

Indeed, if p /∈ N1 then Js,p = Jt,p = 0 and Op(v) = eiβ1Xe−iβ1X = I, see Eq. (49). Fix abit string v ∈ 0, 14 and let ON1 be the tensor product of all operators Op(v) with p ∈ N1.From Eqs. (45,50) one gets µ(v) = 〈v1v2|U |v3v4〉η(v), where

η(v) = 〈+n|eiγ1C(ON1 ⊗ Ielse)e−iγ1C |+n〉. (51)

It suffices to show that η(v) can be computed in time T ′′ defined in Eq. (42). The standardlightcone argument shows that η(v) depends only on the subgraph GN2 induced by N2 andthe restricted cost function that includes only the terms Jp,qZpZq with p, q ∈ N2. To easethe notations, below we ignore all the remaining terms and assume that G = GN2 , n = n2,and C =

∑p,q∈N2

Jp,qZpZq. Let us write

C = Cloc + Cent + Celse, (52)

where Cloc includes all terms Jp,qZpZq with p, q ∈ N1, Cent includes all terms Jp,qZpZq withp ∈ N1, q /∈ N1, and Celse includes all the remaining terms. Using the fact that all terms inC pairwise commute and that Celse commutes with ON1 ⊗ Ielse, one can rewrite Eq. (51) as

η(v) = Tr(ON1e

−iγ1Clocρ eiγ1Cloc)

(53)

where ρ is a (mixed) n1-qubit state defined as

ρ = TrN c1

(e−iγ1Cent |+n〉〈+n|eiγ1Cent

). (54)

Here N c1 is the complement of N1 and TrN c

1denotes the partial trace over N c

1 .

Proposition 3. There exists a classical algorithm that computes the matrix of ρ in thestandard basis in time O(n14n1 + n3n1).

Proof. Assume wlog that N1 = 1, 2, . . . , n1. Let m = |N c1 | and p(j) be the j-th qubit of

N c1 , where j = 1, . . . ,m. Define n1-qubit states ρ0, ρ1, . . . , ρm such that ρ0 = |+n1〉〈+n1 | and

ρj = Tranc

[e−iγ1

∑q∈N1

Jp(j),qZqZanc (ρj−1 ⊗ |+〉〈+|) eiγ1∑

q∈N1Jp(j),qZqZanc

](55)

for all j = 1, 2, . . . ,m. Here we consider a system of n1 + 1 qubits with the first n1 qubitsdescribing N1 and the last qubit serving as ancilla. The notation Tranc means the partialtrace over the ancilla. We claim that ρm = ρ is the state defined in Eq. (54). Indeed, onecan consider the righthand side of Eq. (54) as the output state of a quantum channel thattakes as input the state ρ0 = |+n1〉〈+n1|, introduces m = n− n1 ancillary qubits initializedin the |+〉〈+| state, couples N1 with the ancillary qubits by e−iγ1Cent , and finally traces outthe ancillas. Since all terms in Cent pairwise commute, the same quantum channel can beimplemented sequentially such that the j-th step introduces a single ancilla qubit p(j) ∈ N c

1

48

initialized in the state |+〉〈+|, applies all interactions in Cent that act non-trivially on p(j),and traces out p(j). This is described by Eq. (55).

A simple algebra shows that Eq. (55) is equivalent to

〈x|ρj|y〉 = 〈x|ρj−1|y〉 · cos

(2γ1

∑q∈N1

Jp(j),q(xq − yq)

)(56)

for all x, y ∈ 0, 1n1 . Recalling that ρ0 = |+n1〉〈+n1| and ρ = ρm one arrives at

〈x|ρ|y〉 = 2−n1

m∏j=1

cos

(2γ1

∑q∈N1

Jp(j),q(xq − yq)

). (57)

Define a function f : −1, 0,+1n1 → R such that

f(z) = 2−n1

m∏j=1

cos

(2γ1

∑q∈N1

Jp(j),qzq

).

From Eq. (57) one infers that 〈x|ρ|y〉 = f(x− y) for all x, y ∈ 0, 1n1 . Let us compute thematrix of ρ by iterating over strings z ∈ −1, 0,+1n1 using ternary Gray code. In otherwords, each iteration changes only one component of z. We shall maintain a list of values ofm linear functions

`j(z) = 2γ1

∑q∈N1

Jp(j),qzq, j = 1, 2, . . . ,m.

Updating the value of each function `j(z) upon changing a single component zq takes aconstant time. Thus updating the value of f(z) = 2−n1

∏mj=1 cos (`j(z)) takes time O(m).

The number of pairs x, y ∈ 0, 1n1 such that z = x − y is equal to 2w(z), where w(z) isthe number of zeros in z. Identifying all such pairs x, y and recording the value f(z) to thematrix elements 〈x|ρ|y〉 takes time O(n12w(z)) for a fixed z. We conclude that computingthe full matrix of ρ takes time

O(m3n1) +∑

z∈−1,0,+1n1

O(n12w(z)) = O(m3n1 + n14n1) = O(n3n1 + n14n1).

In the last equality we noted that m = n− n1 ≤ n.

From Eq. (53) one gets η(v) = Tr(ON1 ρ), where

ρ = e−iγ1Clocρ eiγ1Cloc .

The next step is to compute the matrix of ρ in the standard basis. We have

〈x|ρ|y〉 = 〈x|ρ|y〉 · e−iγ1〈x|Cloc|x〉+iγ1〈y|Cloc|y〉. (58)

A lookup table of the function 〈x|Cloc|x〉 with x ∈ 0, 1n1 can be computed in time O(n12n1)by iterating over x’s using Gray code. Then one can compute the matrix of ρ using Eq. (58)in time O(4n1) by the brute force method.

49

It remains to compute η(v) = Tr(ON1 ρ). We claim that this can be done in time O(4n1),assuming that the matrix of ρ is already available. Indeed, let N1 = AB, where A and B aresubsets of size at most (n1 + 1)/2. Let OA and OB be the tensor products of the single-qubitoperators Op(v) with p ∈ A and p ∈ B respectively. Then ON1 = OA ⊗ OB. Compute thematrix of OA and OB in the standard basis. This takes time O(|A|4|A|+ |B|4|B|) = O(n12n1)if one uses the brute force method. Finally, we have

η(v) = Tr(ON1 ρ) =∑

α,α′∈0,1|A|

∑β,β′∈0,1|B|

〈α′|OA|α〉 · 〈β′|OB|β〉 · 〈α, β|ρ|α′, β′〉.

This sum can be computed by the brute force method in time O(4|A|+|B|) = O(4n1) since thematrices of OA, OB, and ρ are already available.

Summing up the runtime of each step in the algorithm and recalling that n = n2 afterrestricting the problem to the subgraph induced by N2 gives the overall runtime T ′′ inEq. (42).

Comment: Numerical data for RQAOA reported in Section 5.3 was generated by combin-ing three algorithms for computing expected values of observables ZZ on the QAOA states:the forrelation-based algorithm described in Section 4 and Section 5.1, algorithm A′, and asimplified version of algorithm A′′. The latter computes the matrix ρ defined in Eq. (57)in time O(n2(s, t)4n1(s,t)) by iterating over bit strings x, y in Eq. (57) using the binary Graycode. This is only a minor slowdown compared with the algorithm based on ternary Graycode described above which computes ρ in time O(n1(s, t)4n1(s,t)+n2(s, t)3n1(s,t)). AlgorithmsA′ or A′′ were selected only if their estimated runtime is below a cutoff value of 0.1 second.Otherwise, the forrelation-based algorithm was selected.

50

E Tree decomposition algorithm

Algorithm 1 Tree decomposition of an outerplanar graph [25]

1: function OuterplanarTD(G)2: if all vertices of G have degree zero then3: return T = (V, ∅), B = Bv = v : v ∈ V 4: else if G has a degree-1 vertex then5: Choose any degree-1 vertex v ∈ V .6: Let u ∈ V be the unique neighbor of v.7: V ′ ← V \ v8: E ′ ← E \ v, u9: G′ ← (V ′, E ′)

10: T = (W,F ),B = Bi : i ∈ VT ← OuterplanarTD(G′)11: Find a node i ∈ W such that u ∈ Bi.12: W ′ ← W ∪ α13: Bα ← u, v14: F ′ ← F ∪ i, α return T ′ = (W ′, F ′), B′ = B ∪Bα

15: else if G has a degree-2 vertex then16: Choose any degree-2 vertex v ∈ V .17: Let u,w ∈ V be the two distinct neighbors of v.18: V ′ ← V \ v19: E ′ ← (E \ v, u, v, w) ∪ u,w20: G′ ← (V ′, E ′)21: T = (W,F ),B = Bi : i ∈ VT ← OuterplanarTD(G′)22: Find a node i ∈ W such that w ∈ Bi and u ∈ Bi.23: W ′ ← W ∪ α24: Bα ← u, v, w25: F ′ ← F ∪ i, α return T ′ = (W ′, F ′), B′ = B ∪Bα

26: end if27: end function

F Hardness of approximation for a small relative error

In the following we consider the problem of approximating a forrelation Φ(f, g) to within agiven relative error, where f, g are two-local functions.

Firstly, let us see how to establish hardness of approximation for graph-based forrelationsarising from stabilizer states [34]. As was shown in [17], for any n-qubit stabilizer state |ψ〉where exists an n-vertex graph G = (V,E) and single-qubit Clifford operators C1, . . . , Cnsuch that

|ψ〉 = (C1 ⊗ C2 ⊗ · · · ⊗ Cn)UfH⊗n|0n〉, f(x) =

∏u,v∈E

(−1)xuxv .

Clearly, f is a two-local function on G. Thus the expected value of any tensor productoperator on any n-qubit stabilizer state can be expressed as a graph-based forrelation for

51

a suitable n-vertex graph and two-local functions f = g. As a corollary, approximating agraph-based forrelation with a small relative error is #P-hard, even on a two-dimensionalgrid graph G. This follows from the facts that (a) any output probability of a quantumcircuit can be expressed as the expected value of a tensor product observable on a 2D clusterstate using measurement-based quantum computing [18], and (b) estimating the outputprobabilities of quantum circuits with a small relative error is #P-hard [35].

In the remainder of this section we show that this hardness persists for the standardforrelation problem with Oj = H for all j. In the following we write S1 for the unit circle inthe complex plane.

To this end, we consider so-called IQP circuits [19]. In Ref.[19] it is shown that given anN -qubit, m-gate quantum circuit C we may efficiently compute n ≤ N +m and a two-localfunction h : 0, 1n → S1 on a two-dimensional grid graph such that

〈0n|H⊗nUhH⊗n|0n〉 =√

2n−N〈0N |C|0N〉.

This is used in Ref.[19] to show that postselected IQP circuits on planar graphs are as pow-erful as postBQP. Hardness of approximation for IQP circuit amplitudes then follows fromthe well known fact that a quantum circuit amplitude 〈0N |C|0N〉 is #P-hard to approximateto a given relative error.

Lemma 12 ([19]). Given a two-local function h : 0, 1n → S1 on a planar graph G andε > 0, it is #P-hard to compute an approximation Z such that

(1− ε)〈0n|H⊗nUhH⊗n|0n〉 ≤ Z ≤ (1 + ε)〈0n|H⊗nUhH⊗n|0n〉.

To establish the following lemma, we show that approximating forrelation of two-localfunctions on a planar graph is as hard as approximating IQP circuit amplitudes.

Lemma 13. Given ε > 0 and two-local functions f, g : 0, 1n → S1 on a planar graph G,it is #P-hard to compute an approximation Φ such that

(1− ε)Φ(f, g) ≤ Φ ≤ (1 + ε)Φ(f, g).

Proof. Suppose we are given a two-local function h : 0, 1n → S1 on a planar graph G.Below we (efficiently) construct two-local functions f, g : 0, 1n → S1 on G such that

〈0n|H⊗nUfH⊗nUgH⊗n|0n〉 = 〈0n|H⊗nUhH⊗n|0n〉. (59)

The claimed hardness of approximation then follows immediately from Lemma 12.To prove Eq. (59) we use the one-qubit identity

e−iπ/4SHS = HS†H, (60)

where S = diag(1, i) is the phase gate. Define functions f, g such that Ug = (S†)⊗n andUf = eiπn/4UhUg. Note that both f and g are two-local on G. Then

〈0n|H⊗nUfH⊗nUgH⊗n|0n〉 = e−iπn/4〈0n|H⊗nUfS⊗nH⊗n|0n〉 = 〈0n|H⊗nUhH⊗n|0n〉,

52

where in the first equality we used Eq. (60). This establishes Eq. (59) and completes theproof.

The graph G in lemmas 12 and 13 may be taken to be a two-dimensional grid graphwithout loss of generality.

53