the value function polytope in reinforcement learning

The Value Function Polytope in Reinforcement Learning

Robert Dadashi 1 Adrien Ali Taıga 1 2 Nicolas Le Roux 1 Dale Schuurmans 1 3 Marc G. Bellemare 1

AbstractWe establish geometric and topological proper-ties of the space of value functions in finite state-action Markov decision processes. Our main con-tribution is the characterization of the nature of itsshape: a general polytope (Aigner et al., 2010). Todemonstrate this result, we exhibit several proper-ties of the structural relationship between policiesand value functions including the line theorem,which shows that the value functions of policiesconstrained on all but one state describe a linesegment. Finally, we use this novel perspectiveto introduce visualizations to enhance the under-standing of the dynamics of reinforcement learn-ing algorithms.

1. IntroductionThe notion of value function is central to reinforcementlearning (RL). It arises directly in the design of algorithmssuch as value iteration (Bellman, 1957), policy gradient(Sutton et al., 2000), policy iteration (Howard, 1960), andevolutionary strategies (e.g. Szita & Lorincz, 2006), whicheither predict it directly or estimate it from samples, whilealso seeking to maximize it. The value function is also a use-ful tool for the analysis of approximation errors (Bertsekas& Tsitsiklis, 1996; Munos, 2003).

In this paper we study the map π 7→ V π from stationarypolicies, which are typically used to describe the behaviourof RL agents, to their respective value functions. Specifi-cally, we vary π over the joint simplex describing all policiesand show that the resulting image forms a polytope, albeitone that is possibly self-intersecting and non-convex.

We provide three results all based on the notion of “policyagreement”, whereby we study the behaviour of the mapπ 7→ V π as we only allow the policy to vary at a subset ofall states.

1Google Brain 2Mila, Universite de Montreal 3Department ofComputing Science, University of Alberta. Correspondence to:Robert Dadashi <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

Line theorem. We show that policies that agree on all butone state generate a line segment within the value functionpolytope, and that this segment is monotone (all state valuesincrease or decrease along it).

Relationship between faces and semi-deterministic poli-cies. We show that d-dimensional faces of this polytope aremapped one-to-many to policies which behave deterministi-cally in at least d states.

Sub-polytope characterization. We use this result to gen-eralize the line theorem to higher dimensions, and demon-strate that varying a policy along d states generates a d-dimensional sub-polytope.

Although our “line theorem” may not be completely surpris-ing or novel to expert practitioners, we believe we are thefirst to highlight its existence. In turn, it forms the basis ofthe other two results, which require additional technical ma-chinery which we develop in this paper, leaning on resultsfrom convex analysis and topology.

While our characterization is interesting in and of itself, italso opens up new perspectives on the dynamics of learningalgorithms. We use the value polytope to visualize the ex-pected behaviour and pitfalls of common algorithms: valueiteration, policy iteration, policy gradient, natural policy gra-dient (Kakade, 2002), and finally the cross-entropy method(De Boer et al., 2004).

⇡<latexit sha1_base64="7YlkrurOAYOv+Ch30L7WpchFVys=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTAXXxvO+ndLa+sbmVnm7srO7t3/gHh61dJIphk2WiER1QqpRcIlNw43ATqqQxqHAdji+nfntJ1SaJ/LRTFIMYjqUPOKMGis99FLed6tezZuDrBK/IFUo0Oi7X71BwrIYpWGCat31vdQEOVWGM4HTSi/TmFI2pkPsWippjDrI56dOyZlVBiRKlC1pyFz9PZHTWOtJHNrOmJqRXvZm4n9eNzPRdZBzmWYGJVssijJBTEJmf5MBV8iMmFhCmeL2VsJGVFFmbDoVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOcF+fd+Vi0lpxi5hj+wPn8AU+9jc0=</latexit><latexit sha1_base64="7YlkrurOAYOv+Ch30L7WpchFVys=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTAXXxvO+ndLa+sbmVnm7srO7t3/gHh61dJIphk2WiER1QqpRcIlNw43ATqqQxqHAdji+nfntJ1SaJ/LRTFIMYjqUPOKMGis99FLed6tezZuDrBK/IFUo0Oi7X71BwrIYpWGCat31vdQEOVWGM4HTSi/TmFI2pkPsWippjDrI56dOyZlVBiRKlC1pyFz9PZHTWOtJHNrOmJqRXvZm4n9eNzPRdZBzmWYGJVssijJBTEJmf5MBV8iMmFhCmeL2VsJGVFFmbDoVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOcF+fd+Vi0lpxi5hj+wPn8AU+9jc0=</latexit><latexit sha1_base64="7YlkrurOAYOv+Ch30L7WpchFVys=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTAXXxvO+ndLa+sbmVnm7srO7t3/gHh61dJIphk2WiER1QqpRcIlNw43ATqqQxqHAdji+nfntJ1SaJ/LRTFIMYjqUPOKMGis99FLed6tezZuDrBK/IFUo0Oi7X71BwrIYpWGCat31vdQEOVWGM4HTSi/TmFI2pkPsWippjDrI56dOyZlVBiRKlC1pyFz9PZHTWOtJHNrOmJqRXvZm4n9eNzPRdZBzmWYGJVssijJBTEJmf5MBV8iMmFhCmeL2VsJGVFFmbDoVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOcF+fd+Vi0lpxi5hj+wPn8AU+9jc0=</latexit><latexit sha1_base64="7YlkrurOAYOv+Ch30L7WpchFVys=">AAAB6nicbVBNS8NAEJ3Ur1q/oh69LBbBU0lE0GPRi8eK9gPaUDbbSbt0swm7G6GE/gQvHhTx6i/y5r9x2+agrQ8GHu/NMDMvTAXXxvO+ndLa+sbmVnm7srO7t3/gHh61dJIphk2WiER1QqpRcIlNw43ATqqQxqHAdji+nfntJ1SaJ/LRTFIMYjqUPOKMGis99FLed6tezZuDrBK/IFUo0Oi7X71BwrIYpWGCat31vdQEOVWGM4HTSi/TmFI2pkPsWippjDrI56dOyZlVBiRKlC1pyFz9PZHTWOtJHNrOmJqRXvZm4n9eNzPRdZBzmWYGJVssijJBTEJmf5MBV8iMmFhCmeL2VsJGVFFmbDoVG4K//PIqaV3UfK/m319W6zdFHGU4gVM4Bx+uoA530IAmMBjCM7zCmyOcF+fd+Vi0lpxi5hj+wPn8AU+9jc0=</latexit> V ⇡

<latexit sha1_base64="c8qVjM0oZqdW1JnKlbApS+DyygE=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSNcxOOsmQ2dlhZlYISz7CiwdFvPo93vwbJ8keNLGgoajqprsrUoIb6/vfXmFtfWNzq7hd2tnd2z8oHx41TZJqhg2WiES3I2pQcIkNy63AttJI40hgKxrfzvzWE2rDE/lgJwrDmA4lH3BGrZNazcesq/i0V674VX8OskqCnFQgR71X/ur2E5bGKC0T1JhO4CsbZlRbzgROS93UoKJsTIfYcVTSGE2Yzc+dkjOn9Mkg0a6kJXP190RGY2MmceQ6Y2pHZtmbif95ndQOrsOMS5ValGyxaJAKYhMy+530uUZmxcQRyjR3txI2opoy6xIquRCC5ZdXSfOiGvjV4P6yUrvJ4yjCCZzCOQRwBTW4gzo0gMEYnuEV3jzlvXjv3seiteDlM8fwB97nD3NBj6E=</latexit><latexit sha1_base64="c8qVjM0oZqdW1JnKlbApS+DyygE=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSNcxOOsmQ2dlhZlYISz7CiwdFvPo93vwbJ8keNLGgoajqprsrUoIb6/vfXmFtfWNzq7hd2tnd2z8oHx41TZJqhg2WiES3I2pQcIkNy63AttJI40hgKxrfzvzWE2rDE/lgJwrDmA4lH3BGrZNazcesq/i0V674VX8OskqCnFQgR71X/ur2E5bGKC0T1JhO4CsbZlRbzgROS93UoKJsTIfYcVTSGE2Yzc+dkjOn9Mkg0a6kJXP190RGY2MmceQ6Y2pHZtmbif95ndQOrsOMS5ValGyxaJAKYhMy+530uUZmxcQRyjR3txI2opoy6xIquRCC5ZdXSfOiGvjV4P6yUrvJ4yjCCZzCOQRwBTW4gzo0gMEYnuEV3jzlvXjv3seiteDlM8fwB97nD3NBj6E=</latexit><latexit sha1_base64="c8qVjM0oZqdW1JnKlbApS+DyygE=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSNcxOOsmQ2dlhZlYISz7CiwdFvPo93vwbJ8keNLGgoajqprsrUoIb6/vfXmFtfWNzq7hd2tnd2z8oHx41TZJqhg2WiES3I2pQcIkNy63AttJI40hgKxrfzvzWE2rDE/lgJwrDmA4lH3BGrZNazcesq/i0V674VX8OskqCnFQgR71X/ur2E5bGKC0T1JhO4CsbZlRbzgROS93UoKJsTIfYcVTSGE2Yzc+dkjOn9Mkg0a6kJXP190RGY2MmceQ6Y2pHZtmbif95ndQOrsOMS5ValGyxaJAKYhMy+530uUZmxcQRyjR3txI2opoy6xIquRCC5ZdXSfOiGvjV4P6yUrvJ4yjCCZzCOQRwBTW4gzo0gMEYnuEV3jzlvXjv3seiteDlM8fwB97nD3NBj6E=</latexit><latexit sha1_base64="c8qVjM0oZqdW1JnKlbApS+DyygE=">AAAB7nicbVDLSgNBEOyNrxhfUY9eBoPgKeyKoMegF48RzAOSNcxOOsmQ2dlhZlYISz7CiwdFvPo93vwbJ8keNLGgoajqprsrUoIb6/vfXmFtfWNzq7hd2tnd2z8oHx41TZJqhg2WiES3I2pQcIkNy63AttJI40hgKxrfzvzWE2rDE/lgJwrDmA4lH3BGrZNazcesq/i0V674VX8OskqCnFQgR71X/ur2E5bGKC0T1JhO4CsbZlRbzgROS93UoKJsTIfYcVTSGE2Yzc+dkjOn9Mkg0a6kJXP190RGY2MmceQ6Y2pHZtmbif95ndQOrsOMS5ValGyxaJAKYhMy+530uUZmxcQRyjR3txI2opoy6xIquRCC5ZdXSfOiGvjV4P6yUrvJ4yjCCZzCOQRwBTW4gzo0gMEYnuEV3jzlvXjv3seiteDlM8fwB97nD3NBj6E=</latexit>

Figure 1. Mapping between policies and value functions.

2. PreliminariesWe are in the reinforcement learning setting (Sutton &Barto, 2018). We consider a Markov decision processM := 〈S,A, r, P, γ〉 with S the finite state space, A thefinite action space, r the reward function, P the transitionfunction, and γ the discount factor for which we assumeγ ∈ [0, 1). We denote the number of states by |S|, thenumber of actions by |A|.


A stationary policy π is a mapping from states to distri-butions over actions; we denote the space of all policiesby P(A)S . Taken with the transition function P , a policydefines a state-to-state transition function Pπ:

Pπ(s′ | s) =∑a∈A

π(a | s)P (s′ | s, a).

The value V π is defined as the expected cumulative rewardfrom starting in a particular state and acting according to π:

V π(s) = EPπ( ∞∑i=0

γir(si, ai) | s0 = s).

The Bellman equation (Bellman, 1957) connects the valuefunction V π at a state s with the value function at the subse-quent states when following π:

V π(s) = EPπ(r(s, a) + γV π(s′)

). (1)

Throughout we will make use of vector notation (e.g. Put-erman, 1994). Specifically, we view (with some abuse ofnotation) Pπ as a |S|× |S| matrix, V π as a |S|-dimensionalvector, and write rπ for the vector of expected rewards underπ. In this notation, the Bellman equation for a policy π is

V π = rπ + γPπV π = (I − γPπ)−1rπ.

In this work we study how the value function V π changesas we continuously vary the policy π. As such, we will findconvenient to also view this value function as the functional

fv : P(A)S → RS

π 7→ V π = (I − γPπ)−1rπ.

We will use the notation V π when the emphasis is on thevector itself, and fv when the emphasis is on the mappingfrom policies to value functions.

Finally, we will use 4 and < for element-wise vector in-equalities, and for a function f : F → G and a subsetF ⊂ F write f(F ) to mean the image of f applied to F .

2.1. Polytopes in Rn

Central to our work will be the result that the image ofthe functional fv applied to the space of policies forms apolytope, possibly nonconvex and self-intersecting, withcertain structural properties. This section lays down someof the necessary definitions and notations. For a completeoverview on the topic, we refer the reader to Grunbaum et al.(1967); Ziegler (2012); Brondsted (2012).

We begin by characterizing what it means for a subset P ⊆Rn to be a convex polytope or polyhedron. In what followswe write Conv(x1, . . . , xk) to denote the convex hull of thepoints x1, . . . , xk.

Definition 1 (Convex Polytope). P is a convex polytopeiff there are k ∈ N points x1, x2, ..., xk ∈ Rn such thatP = Conv(x1, . . . , xk).

Definition 2 (Convex Polyhedron). P is a convex polyhe-dron iff there are k ∈ N half-spaces H1, H2, ..., Hk whoseintersection is P , that is

P = ∩ki=1Hk.

A celebrated result from convex analysis relates these twodefinitions: a bounded, convex polyhedron is a convex poly-tope (Ziegler, 2012).

The next two definitions generalize convex polytopes andpolyhedra to non-convex bodies.

Definition 3 (Polytope). A (possibly non-convex) polytopeis a finite union of convex polytopes.

Definition 4 (Polyhedron). A (possibly non-convex) poly-hedron is a finite union of convex polyhedra.

We will make use of another, recursive characterizationbased on the notion that the boundaries of a polytope shouldbe “flat” in a topological sense (Klee, 1959).

For an affine subspace K ⊆ Rn, Vx ⊂ K is a relativeneighbourhood of x in K if x ∈ Vx and Vx is open inK. For P ⊂ K, the relative interior of P in K, denotedrelintK(P ), is then the set of points in P which have arelative neighbourhood inK∩P . The notion of “open inK”is key here: a point that lies on an edge of the unit squaredoes not have a relative neighbourhood in the square, butit has a relative neighbourhood in that edge. The relativeboundary ∂KP is defined as the set of points in P not in therelative interior of P , that is

∂KP = P \ relintK(P ).

Finally, we recall that H ⊆ K is a hyperplane if H is anaffine subspace of K of dimension dim(K)− 1.

Proposition 1. P is a polyhedron in an affine subspaceK ⊆ Rn if

(i) P is closed;

(ii) There are k ∈ N hyperplanes H1, ..,Hk in K whoseunion contains the boundary of P in K:∂KP ⊂ ∪ki=1Hi; and

(iii) For each of these hyperplanes, P ∩Hi is a polyhedronin Hi.

All proofs may be found in the appendix.

3. The Space of Value FunctionsWe now turn to the main object of our study, the space ofvalue functions V . The space of value functions is the set


of all value functions that are attained by some policy. Asnoted earlier, this corresponds to the image of P(A)S underthe mapping fv:

V = fv(P(A)S) ={fv(π) |π ∈ P(A)S

}. (2)

As a warm-up, Figure 2 depicts the space V correspondingto four 2-state MDPs; each set is made of value functionscorresponding to 50,000 policies sampled uniformly at ran-dom from P(A)S . The specifics of all MDPs depicted inthis work can be found in Appendix A.

V⇡(s

2)

<latexit sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==</latexit><latexit sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==</latexit><latexit sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==</latexit><latexit sha1_base64="sSFxDEbpjp0oUn9Y4Vv2pHUvmTA=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkpSBD0WvHisYD+giWWz3bRLN5tldyOU0L/hxYMiXv0z3vw3btMctPXBwOO9GWbmhZIzbVz32yltbG5t75R3K3v7B4dH1eOTrk5SRWiHJDxR/RBrypmgHcMMp32pKI5DTnvh9Hbh956o0iwRD2YmaRDjsWARI9hYye8+Zr5k87oeNi+H1ZrbcHOgdeIVpAYF2sPqlz9KSBpTYQjHWg88V5ogw8owwum84qeaSkymeEwHlgocUx1k+c1zdGGVEYoSZUsYlKu/JzIcaz2LQ9sZYzPRq95C/M8bpCa6CTImZGqoIMtFUcqRSdAiADRiihLDZ5Zgopi9FZEJVpgYG1PFhuCtvrxOus2G5za8+6taq1nEUYYzOIc6eHANLbiDNnSAgIRneIU3J3VenHfnY9lacoqZU/gD5/MHPYSRGA==</latexit>

V ⇡(s1)<latexit sha1_base64="uo4b35U72TXZ8laWxMVeZYpKcYQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw==</latexit><latexit sha1_base64="uo4b35U72TXZ8laWxMVeZYpKcYQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw==</latexit><latexit sha1_base64="uo4b35U72TXZ8laWxMVeZYpKcYQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw==</latexit><latexit sha1_base64="uo4b35U72TXZ8laWxMVeZYpKcYQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBahXkoioh4LXjxWsB/QxLLZbtqlm82yuxFK6N/w4kERr/4Zb/4bt2kO2vpg4PHeDDPzQsmZNq777ZTW1jc2t8rblZ3dvf2D6uFRRyepIrRNEp6oXog15UzQtmGG055UFMchp91wcjv3u09UaZaIBzOVNIjxSLCIEWys5HceM1+yWV0PvPNBteY23BxolXgFqUGB1qD65Q8TksZUGMKx1n3PlSbIsDKMcDqr+KmmEpMJHtG+pQLHVAdZfvMMnVlliKJE2RIG5erviQzHWk/j0HbG2Iz1sjcX//P6qYlugowJmRoqyGJRlHJkEjQPAA2ZosTwqSWYKGZvRWSMFSbGxlSxIXjLL6+SzkXDcxve/WWteVXEUYYTOIU6eHANTbiDFrSBgIRneIU3J3VenHfnY9FacoqZY/gD5/MHPTORGw==</latexit>

a) b)

d)c)

V ⇤<latexit sha1_base64="lfZ3Gnfbnzp35gzxe64PzcC27Vk=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD0WvXisaD+gjWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUr3Tcfz3rlilt1ZyDLxMtJBXLUe+Wvbj9macQVMkmN6Xhugn5GNQom+aTUTQ1PKBvRAe9YqmjEjZ/NTp2QE6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjlZ0IlKXLF5ovCVBKMyfRv0heaM5RjSyjTwt5K2JBqytCmU7IheIsvL5PmedVzq97dRaV2ncdRhCM4hlPw4BJqcAt1aACDATzDK7w50nlx3p2PeWvByWcO4Q+czx/LtI12</latexit><latexit sha1_base64="lfZ3Gnfbnzp35gzxe64PzcC27Vk=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD0WvXisaD+gjWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUr3Tcfz3rlilt1ZyDLxMtJBXLUe+Wvbj9macQVMkmN6Xhugn5GNQom+aTUTQ1PKBvRAe9YqmjEjZ/NTp2QE6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjlZ0IlKXLF5ovCVBKMyfRv0heaM5RjSyjTwt5K2JBqytCmU7IheIsvL5PmedVzq97dRaV2ncdRhCM4hlPw4BJqcAt1aACDATzDK7w50nlx3p2PeWvByWcO4Q+czx/LtI12</latexit><latexit sha1_base64="lfZ3Gnfbnzp35gzxe64PzcC27Vk=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD0WvXisaD+gjWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUr3Tcfz3rlilt1ZyDLxMtJBXLUe+Wvbj9macQVMkmN6Xhugn5GNQom+aTUTQ1PKBvRAe9YqmjEjZ/NTp2QE6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjlZ0IlKXLF5ovCVBKMyfRv0heaM5RjSyjTwt5K2JBqytCmU7IheIsvL5PmedVzq97dRaV2ncdRhCM4hlPw4BJqcAt1aACDATzDK7w50nlx3p2PeWvByWcO4Q+czx/LtI12</latexit><latexit sha1_base64="lfZ3Gnfbnzp35gzxe64PzcC27Vk=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBZBPJREBD0WvXisaD+gjWWz3bRLN5uwOxFK6E/w4kERr/4ib/4bt20O2vpg4PHeDDPzgkQKg6777RRWVtfWN4qbpa3tnd298v5B08SpZrzBYhnrdkANl0LxBgqUvJ1oTqNA8lYwupn6rSeujYjVA44T7kd0oEQoGEUr3Tcfz3rlilt1ZyDLxMtJBXLUe+Wvbj9macQVMkmN6Xhugn5GNQom+aTUTQ1PKBvRAe9YqmjEjZ/NTp2QE6v0SRhrWwrJTP09kdHImHEU2M6I4tAselPxP6+TYnjlZ0IlKXLF5ovCVBKMyfRv0heaM5RjSyjTwt5K2JBqytCmU7IheIsvL5PmedVzq97dRaV2ncdRhCM4hlPw4BJqcAt1aACDATzDK7w50nlx3p2PeWvByWcO4Q+czx/LtI12</latexit>




Figure 2. Space of value functions for various two-state MDPs.

While the space of policies P(A)S is easily described (itis the Cartesian product of |S| simplices), value functionspaces arise as complex polytopes. Of note, they may benon-convex – justifying our more intricate definition.

In passing, we remark that the polytope gives a clear illus-tration of the following classic results regarding MDPs (e.g.Bertsekas & Tsitsiklis, 1996):

• (Dominance of V ∗) The optimal value function V ∗ isthe unique dominating vertex of V;

• (Monotonicity) The edges of V are oriented with thepositive orthant;

• (Continuity) The space V is connected.

The next sections will formalize these and other, less-understood properties of the space of value functions.

3.1. Basic Shape from Topology

We begin with a first result on how the functional fv trans-forms the space of policies into the space of value functions(Figure 1). Recall that

fv(π) = (I − γPπ)−1rπ.

Hence fv is infinitely differentiable everywhere on P(A)S

(Appendix C). The following is a topological consequenceof this property, along with the fact that P(A)S is a compactand connected set.

Lemma 1. The space of value functions V is compact andconnected.

The interested reader may find more details on this topolog-ical argument in (Engelking, 1989).

3.2. Policy Agreement and Policy Determinism

Two notions play a central role in our analysis: policy agree-ment and policy determinism.

Definition 5 (Policy Agreement). Two policies π1, π2 agreeon states s1, .., sk ∈ S if π1(· | si) = π2(· | si) for each si,i = 1, . . . , k.

For a given policy π, we denote by Y πs1,...,sk ⊆ P(A)S theset of policies which agree with π on s1, . . . , sk; we willalso write Y πS\{s} to describe the set of policies that agreewith π on all states except s. Note that policy agreementdoes not imply disagreement; in particular, π ∈ Y πS for anysubset of states S ⊂ S.

Definition 6 (Policy Determinism). A policy π is

(i) s-deterministic for s ∈ S if π(a | s) ∈ {0, 1}.

(ii) semi-deterministic if it is s-deterministic for at leastone s ∈ S.

(iii) deterministic if it is s-deterministic for all states s ∈ S .

We will denote byDs,a the set of semi-deterministic policiesthat take action a when in state s.

Lemma 2. Consider two policies π1, π2 that agree ons1, . . . , sk ∈ S. Then the vector rπ1 − rπ2 has zeros inthe components corresponding to s1, . . . , sk and the matrixPπ1 − Pπ2 has zeros in the corresponding rows.

This lemma highlights that when two policies agree on agiven state they have the same immediate dynamic on thisstate, i.e. they get the same expected reward, and have thesame next state transition probabilities. Lemma 3 in Section3.3 will be a direct consequence of this property.


3.3. Value Functions and Policy Agreement

We begin our characterization by considering the subsets ofvalue functions that are generated when the action probabil-ities are kept fixed at certain states, that is: when we restrictthe functional fv to the set of policies that agree with somebase policy π on these states.

Something special arises when we keep the probabilitiesfixed at all but state s: the functional fv draws a line segmentwhich is oriented in the positive orthant (that is, one enddominates the other end). Furthermore, the extremes ofthis line segment can be taken to be s-deterministic policies.This is the main result of this section, which we now statemore formally.

Theorem 1. [Line Theorem] Let s be a state and π, a pol-icy. Then there are two s-deterministic policies in Y πS\{s},denoted πl, πu, which bracket the value of all other policiesπ′ ∈ Y πS\{s}:

fv(πl) 4 fv(π′) 4 fv(πu).

Furthermore, the image of fv restricted to Y πS\{s} is a linesegment, and the following three sets are equivalent:

(i) fv(Y πS\{s}

),

(ii) {fv(απl + (1− α)πu) |α ∈ [0, 1]},

(iii) {αfv(πl) + (1− α)fv(πu) |α ∈ [0, 1]} .

The second part of Theorem 1 states that one can generatethe set of value functions fv(Y πS\{s}) in two ways: either bydrawing the line segment in value space, fv(πl) to fv(πu),or drawing the line segment in policy space, from πl to πuand then mapping to value space. Note that this result is aconsequence from the Sherman-Morrison formula, whichhas been used in reinforcement learning for efficient se-quential matrix inverse estimation (Bradtke & Barto, 1996).While somewhat technical, this characterization of line seg-ment is needed to prove some of our later results. Figure3 illustrates the path drawn by interpolating between twopolicies that agree on state s2.

Theorem 1 depends on two lemmas, which we now pro-vide in turn. Consider a policy π and k states s1, . . . , sk,and write Cπk+1, . . . , C

π|S| for the columns of the matrix

(I − γPπ)−1 corresponding to states other than s1, . . . , sk.Define the affine vector space

Hπs1,...,sk

= V π + Span(Cπk+1, . . . , Cπ|S|).

Lemma 3. Consider a policy π and k states s1, . . . , sk.Then the value functions generated by Y πs1,..,sk are contained

V⇡(s

2)



Figure 3. Illustration of Theorem 1. The orange points are thevalue functions of mixtures of policies that agree everywhere butone state.

in the affine vector space Hπs1,..,sk

:

fv(Yπs1,..,sk

) = V ∩Hπs1,..,sk

.

Put another way, Lemma 3 shows that if we fix the policieson k states, the induced space of value function loses at leastk degrees of freedom, specifically that it lies in a |S| − kdimensional affine vector space.

For k = |S| − 1, Lemma 3 implies that the value functionslie on a line – however, the following is necessary to exposethe full structure of V within this line.

Lemma 4. Consider the ensemble Y πS\{s} of policies thatagree with a policy π everywhere but on s ∈ S . For π0, π1 ∈Y πS\{s} define the function g : [0, 1]→ V

g(µ) = fv(µπ1 + (1− µ)π0).

Then the following hold regarding g:

(i) g is continuously differentiable;

(ii) (Total order) g(0) 4 g(1) or g(0) < g(1);

(iii) If g(0) = g(1) then g(µ) = g(0), µ ∈ [0, 1];

(iv) (Monotone interpolation) If g(0) 6= g(1) there is aρ : [0, 1] → R such that g(µ) = ρ(µ)g(1) + (1 −ρ(µ))g(0), and ρ is a strictly monotonic rational func-tion of µ.

The result (ii) in Lemma 4 was established in (Mansour& Singh, 1999) for deterministic policies. Note that ingeneral, ρ(µ) 6= µ in the above, as the following exampledemonstrates.

Example 1. Suppose S = {s1, s2}, with s2 terminalwith no reward associated to it, A = {a1, a2}. Thetransitions and rewards are defined by P (s2|s1, a2) =


1, P (s1, |s1, a1) = 1, r(s1, a1) = 0, r(s1, a2) = 1. De-fine two deterministic policies π1, π2 such that π1(a1|s1) =1, π2(a2|s1) = 1. We have

fv((1− µ)π1 + µπ2) =

[ µ1−γ(1−µ)

0

].

Remarkably, Theorem 1 shows that policies agreeing onall but one state draw line segments irrespective of the sizeof the action space; this may be of particular interest inthe context of continuous action problems. Second, thisstructure is unique, in the sense that the paths traced byinterpolating between two arbitrary policies may be neitherlinear, nor monotonic (Figure 4 depicts two examples).

V⇡(s

2)



Figure 4. Value functions of mixtures of two policies in the generalcase. The orange points describe the value functions of mixturesof two policies.

3.4. Convex Consequences of Theorem 1

Some consequences arise immediately from Theorem 1.First, the result suggests a recursive application from thevalue function V π of a policy π into its deterministic con-stituents.Corollary 1. For any set of states s1, .., sk ∈ S and a policyπ, V π can be expressed as a convex combination of valuefunctions of {s1, .., sk}-deterministic policies. In particular,V is included in the convex hull of the value functions ofdeterministic policies.

This result indicates a relationship between the vertices ofV and deterministic policies. Nevertheless, we observe inFigure 5 that the value functions of deterministic policies arenot necessarily the vertices of V and that the vertices of V arenot necessarily attained by value functions of deterministicpolicies.

The space of value functions is in general not convex. How-ever, it does possess a weaker structural property regardingpaths between value functions which is reminiscent of policyiteration-type results.Corollary 2. Let V π and V π

′be two value functions. Then

there exists a sequence of k ≤ |S| policies, π1, . . . , πk, such

V⇡(s

2)



Figure 5. Visual representation of Corollary 1. The space of valuefunctions is included in the convex hull of value functions ofdeterministic policies (red dots).

that V π = V π1 , V π′

= V πk , and for every i ∈ 1, . . . , k−1,the set

{fv(απi + (1− α)πi+1) |α ∈ [0, 1]}

forms a line segment.

3.5. The Boundary of V

We are almost ready to show that V is a polytope. To do so,however, we need to show that the boundary of the space ofvalue functions is described by semi-deterministic policies.

While at first glance reasonable given our earlier topologi-cal analysis, the result is complicated by the many-to-onemapping from policies to value functions, and requires ad-ditional tooling not provided by the line theorem. Recallfrom Lemma 3 the use of the affine vector space Hπ

s1,..,skto constrain the value functions generated by fixing certainaction probabilities.

Theorem 2. Consider the ensemble of policies Y πs1,..,sk thatagree with π on states S = {s1, .., sk}. Suppose ∀s /∈ S ,∀a ∈ A, @π′ ∈ Y πs1,..,sk ∩ Da,s s.t. fv(π′) = fv(π), thenfv(π) has a relative neighborhood in V ∩Hπ

s1,..,sk.

Theorem 2 demonstrates by contraposition that the boundaryof the space of value functions is a subset of the ensembleof value functions of semi-deterministic policies. Figure 6shows that the latter can be a proper subset.

Corollary 3. Consider a policy π ∈ P(A)S , the statesS = {s1, .., sk}, and the ensemble Y πs1,..,sk of policies thatagree with π on s1, .., sk. Define Vy = fv(Y

πs1,..,sk

), wehave that the relative boundary of Vy inHπ

s1,..,skis included

in the value functions spanned by policies in Y πs1,..,sk thatare s-deterministic for s /∈ S :

∂Vy ⊂⋃s/∈S

⋃a∈A

fv(Yπs1,..,sk

∩Ds,a),

where ∂ refers to ∂Hπs1,..,sk .

The Value Function Polytope in Reinforcement LearningV

⇡(s

2)



Figure 6. Visual representation of Corollary 3. The orange pointsare the value functions of semi-deterministic policies.

3.6. The Polytope of Value Functions

We are now in a position to combine the results of theprevious section to arrive at our main contribution: V is apolytope in the sense of Def. 3 and Prop. 1. Our result is infact stronger: we show that any subset of policies Y πs1,..,skgenerates a sub-polytope of V .

Theorem 3. Consider a policy π ∈ P(A)S , the statess1, .., sk ∈ S, and the ensemble Y πs1,..,sk of policies thatagree with π on s1, .., sk. Then fv(Y πs1,..,sk) is a polytopeand in particular, V = fv(Y

π∅ ) is a polytope.

Despite the evidence gathered in the previous section infavour of the above theorem, the result is surprising giventhe fundamental non-linearity of the functional fv: again,mixtures of policies can describe curves (Figure 4), andeven the mapping g in Lemma 4 is nonlinear in µ.

That the polytope can be non-convex is obvious from thepreceding figures. As Figure 6 (right) shows, this can hap-pen when value functions along two different line segmentscross. At that intersection, something interesting occurs:there are two policies with the same value function but thatdo not agree on either state. We will illustrate the effect ofthis structure on learning dynamics in Section 5.

Finally, there is a natural sub-polytope structure in the spaceof value functions. If policies are free to vary only on asubset of states of cardinal k, then there is a polytope ofdimension k associated with the induced space of valuefunctions. This makes sense since constraining policies on asubset of states is equivalent to defining a new MDP, wherethe transitions associated with the complement of this subsetof states are not dependent on policy decisions.

4. Related WorkThe link between geometry and reinforcement learning hasbeen so far fairly limited. However we note the former useof convex polyhedra in the following:

Simplex Method and Policy Iteration. The policy itera-

tion algorithm (Howard, 1960) closely relates to the simplexalgorithm (Dantzig, 1948). In fact, when the number ofstates where the policy can be updated is at most one, it isexactly the simplex method, sometimes referred to as simplepolicy iteration. As opposed to the limitations of the sim-plex algorithm (Littman et al., 1995), namely the worst caseconvergence in exponential time, it was demonstrated thatthe simplex algorithm applied to MDPs with an adequatepivot rule converges in polynomial time (Ye, 2011).

Linear Programming. Finding the optimal value functionof an MDP can be formulated as a linear program (Puterman,1994; Bertsekas & Tsitsiklis, 1996; De Farias & Van Roy,2003; Wang et al., 2007). In the primal form, the feasibleconstraints are defined by {V ∈ R|S|

∣∣ V < T ∗V }, whereT ∗ is the optimality Bellman operator. Notice that there isa unique value function V ∈ V that is feasible, which isexactly the optimal value function V ∗.

The dual formulation consists of maximizing the expectedreturn for a given initial state distribution, as a functionof the discounted state action visit frequency distribution.Contrary to the primal form, any feasible discounted stateaction visit frequency distribution maps to an actual policy(Wang et al., 2007).

5. Dynamics in the PolytopeIn this section we study how the behaviour of commonreinforcement learning algorithms is reflected in the valuefunction polytope. We consider two value-based methods,value iteration and policy iteration, three variants of thepolicy gradient method, and an evolutionary strategy.

Our experiments use the two-state, two-action MDP de-picted elsewhere in this paper (details in Appendix A).Value-based methods are parametrized directly in termsof the value vector in R2; policy-based methods areparametrized using the softmax distribution, with one pa-rameter per state. We initialize all methods at the samestarting value functions (indicated on Figure 7): near a ver-tex (V i1 ), near a boundary (V i2 ), and in the interior of thepolytope (V i3 ).1

We are chiefly interested in three aspects of the differentalgorithms’ learning dynamics: 1) the path taken throughthe value polytope, 2) the speed at which they traverse thepolytope, and 3) any accumulation points that occur alongthis path. As such, we compute model-based versions of allrelevant updates; in the case of evolutionary strategies, weuse large population sizes (De Boer et al., 2004).

1The use of the softmax precludes initializing policy-basedmethods exactly at boundaries.


5.1. Value Iteration

Value iteration (Bellman, 1957) consists of the repeatedapplication of the optimality Bellman operator T ∗

Vk+1 := T ∗Vk,

In all cases, V0 is initialized to the relevant starting valuefunction. Figure 7 depicts the paths in value space taken


V⇡(s

2)


a) b) c)V i

1<latexit sha1_base64="Juu1Zzv2vBxaZD57hF2HNivMKjc=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit12o8Znw28QbXm1t05yCrxClKDAs1B9as/jFkaoTRMUK17npsYP6PKcCZwVumnGhPKJnSEPUsljVD72fzcGTmzypCEsbIlDZmrvycyGmk9jQLbGVEz1steLv7n9VITXvsZl0lqULLFojAVxMQk/50MuUJmxNQSyhS3txI2pooyYxOq2BC85ZdXSfui7rl17/6y1rgp4ijDCZzCOXhwBQ24gya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPHxiIj2U=</latexit><latexit sha1_base64="Juu1Zzv2vBxaZD57hF2HNivMKjc=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit12o8Znw28QbXm1t05yCrxClKDAs1B9as/jFkaoTRMUK17npsYP6PKcCZwVumnGhPKJnSEPUsljVD72fzcGTmzypCEsbIlDZmrvycyGmk9jQLbGVEz1steLv7n9VITXvsZl0lqULLFojAVxMQk/50MuUJmxNQSyhS3txI2pooyYxOq2BC85ZdXSfui7rl17/6y1rgp4ijDCZzCOXhwBQ24gya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPHxiIj2U=</latexit><latexit sha1_base64="Juu1Zzv2vBxaZD57hF2HNivMKjc=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit12o8Znw28QbXm1t05yCrxClKDAs1B9as/jFkaoTRMUK17npsYP6PKcCZwVumnGhPKJnSEPUsljVD72fzcGTmzypCEsbIlDZmrvycyGmk9jQLbGVEz1steLv7n9VITXvsZl0lqULLFojAVxMQk/50MuUJmxNQSyhS3txI2pooyYxOq2BC85ZdXSfui7rl17/6y1rgp4ijDCZzCOXhwBQ24gya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPHxiIj2U=</latexit><latexit sha1_base64="Juu1Zzv2vBxaZD57hF2HNivMKjc=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPaWDbbSbt0swm7G6GE/ggvHhTx6u/x5r9x0+agrQ8GHu/NMDMvSATXxnW/ndLa+sbmVnm7srO7t39QPTxq6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWBym/udJ1Sax/LBTBP0IzqSPOSMGit12o8Znw28QbXm1t05yCrxClKDAs1B9as/jFkaoTRMUK17npsYP6PKcCZwVumnGhPKJnSEPUsljVD72fzcGTmzypCEsbIlDZmrvycyGmk9jQLbGVEz1steLv7n9VITXvsZl0lqULLFojAVxMQk/50MuUJmxNQSyhS3txI2pooyYxOq2BC85ZdXSfui7rl17/6y1rgp4ijDCZzCOXhwBQ24gya0gMEEnuEV3pzEeXHenY9Fa8kpZo7hD5zPHxiIj2U=</latexit>

V i2<latexit sha1_base64="Pp+y+Xn1Dy9AFJA0LBApFVeY70A=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKnfZjJmaD2qBccavuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFuTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9OhkJzhnJqCWVa2FsJG1NNGdqESjYEb/XlddKuVT236t1fVRo3eRxFOINzuAQP6tCAO2hCCxhM4Ble4c1JnBfn3flYthacfOYU/sD5/AEaDI9m</latexit><latexit sha1_base64="Pp+y+Xn1Dy9AFJA0LBApFVeY70A=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKnfZjJmaD2qBccavuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFuTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9OhkJzhnJqCWVa2FsJG1NNGdqESjYEb/XlddKuVT236t1fVRo3eRxFOINzuAQP6tCAO2hCCxhM4Ble4c1JnBfn3flYthacfOYU/sD5/AEaDI9m</latexit><latexit sha1_base64="Pp+y+Xn1Dy9AFJA0LBApFVeY70A=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKnfZjJmaD2qBccavuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFuTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9OhkJzhnJqCWVa2FsJG1NNGdqESjYEb/XlddKuVT236t1fVRo3eRxFOINzuAQP6tCAO2hCCxhM4Ble4c1JnBfn3flYthacfOYU/sD5/AEaDI9m</latexit><latexit sha1_base64="Pp+y+Xn1Dy9AFJA0LBApFVeY70A=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKUI9FLx4r2A9oY9lsN+3SzSbsToQS+iO8eFDEq7/Hm//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1gNOE+xEdKREKRtFKnfZjJmaD2qBccavuAmSdeDmpQI7moPzVH8YsjbhCJqkxPc9N0M+oRsEkn5X6qeEJZRM64j1LFY248bPFuTNyYZUhCWNtSyFZqL8nMhoZM40C2xlRHJtVby7+5/VSDK/9TKgkRa7YclGYSoIxmf9OhkJzhnJqCWVa2FsJG1NNGdqESjYEb/XlddKuVT236t1fVRo3eRxFOINzuAQP6tCAO2hCCxhM4Ble4c1JnBfn3flYthacfOYU/sD5/AEaDI9m</latexit>

V i3

<latexit sha1_base64="h95eBdxAVoRblyKddVM/IcluL3s=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Ae0sWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ildusxE9P+Zb9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QMbkI9n</latexit><latexit sha1_base64="h95eBdxAVoRblyKddVM/IcluL3s=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Ae0sWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ildusxE9P+Zb9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QMbkI9n</latexit><latexit sha1_base64="h95eBdxAVoRblyKddVM/IcluL3s=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Ae0sWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ildusxE9P+Zb9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QMbkI9n</latexit><latexit sha1_base64="h95eBdxAVoRblyKddVM/IcluL3s=">AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0m0oMeiF48V7Ae0sWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ncLa+sbmVnG7tLO7t39QPjxqmTjVjDdZLGPdCajhUijeRIGSdxLNaRRI3g7GtzO//cS1EbF6wEnC/YgOlQgFo2ildusxE9P+Zb9ccavuHGSVeDmpQI5Gv/zVG8QsjbhCJqkxXc9N0M+oRsEkn5Z6qeEJZWM65F1LFY248bP5uVNyZpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDK/9TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqESjYEb/nlVdK6qHpu1buvVeo3eRxFOIFTOAcPrqAOd9CAJjAYwzO8wpuTOC/Ou/OxaC04+cwx/IHz+QMbkI9n</latexit>

# up

date

s0

50Figure 7. Value iteration dynamics for three initialization points.

by value iteration, from the starting point to the optimalvalue function. We observe that the path does not remainwithin the polytope: value iteration generates a sequence ofvectors that may not map to any policy. Our visualizationalso highlights results by (Bertsekas, 1994) showing thatvalue iteration spends most of its time along the constant (1,1) vector, and that the “real” convergence rate is in terms ofthe second largest eigenvalue of P .

5.2. Policy Iteration

Policy iteration (Howard, 1960) consists of the repeatedapplication of a policy improvement step and a policy eval-uation step until convergence to the optimal policy. The pol-icy improvement step updates the policy by acting greedilyaccording to the current value function; the value functionof the new policy is then evaluated. The algorithm is basedon the following update rule

πk+1 := greedy(Vk)

Vk+1 := evaluate(πk+1),

with V0 initialized as in value iteration.


V⇡(s

2)


a) b) c)

Figure 8. Policy iteration. The red arrows show the sequence ofvalue functions (blue) generated by the algorithm.

The sequence of value functions visited by policy iteration(Figure 8) corresponds to value functions of deterministicpolicies, which in this specific MDP corresponds to verticesof the polytope.

5.3. Policy Gradient

Policy gradient is a popular approach for directly optimiz-ing the value function via parametrized policies (Williams,1992; Konda & Tsitsiklis, 2000; Sutton et al., 2000). For apolicy πθ with parameters θ the policy gradient is

∇θJ(θ) = Es∼dπ,a∼π(· | s)∇θ log π(a | s)[r(s, a)+γ EV (s′)]

where dπ is the discounted stationary distribution; here weassume a uniformly random initial distribution over thestates. The policy gradient update is then (η ∈ [0, 1])

θk+1 := θk + η∇θJ(θk).


V⇡(s

2)


a) b) c)

# up

date

s0

300

Figure 9. Value functions generated by policy gradient.

Figure 9 shows that the convergence rate of policy gradi-ent strongly depends on the initial condition. In particular,Figure 9a),b) show accumulation points along the updatepath (not shown here, the method does eventually convergeto V ∗). This behaviour is sensible given the dependence of∇θJ(θ) on π(· | s), with gradients vanishing at the boundaryof the polytope.

5.4. Entropy Regularized Policy Gradient

Entropy regularization adds an entropy term to the objective(Williams & Peng, 1991). The new policy gradient becomes

∇θJent(θ) = ∇θJ(θ)−∇θ Es∼dπ

H(π(· | s)),

where H(·) denotes the Shannon entropy. The entropy termencourages policies to move away from the boundary ofthe polytope. Consequent with our previous observationregarding policy gradient, we find that this improves theconvergence rate of the optimization procedure (Figure 10).One trade-off is that the policy converges to a sub-optimalpolicy, which is not deterministic.



V⇡(s

2)


a) b) c)

# up

date

s0

300

Figure 10. Value functions generated by policy gradient with en-tropy, for three different initialization points.

5.5. Natural Policy Gradient

Natural policy gradient (Kakade, 2002) is a second-orderpolicy optimization method. The gradient updates conditionthe standard policy gradient with the inverse Fisher infor-mation matrix F (Kakade, 2002), leading to the followingupdate rule:

θk+1 := θk + ηF−1∇θJ(θk).

This causes the gradient steps to follow the steepest ascentdirection in the underlying structure of the parameter space.


V⇡(s

2)


a) b) c)

# up

date

s0

300

Figure 11. Natural policy gradient.

In our experiment, we observe that natural policy gradientis less prone to accumulation than policy gradient (Fig. 11),in part because the step-size is better conditioned. Figure b)shows unregularized policy gradient does not, surprisinglyenough, take the “shortest path” through the polytope to theoptimal value function: instead, it moves from one vertex tothe next, similar to policy iteration.

5.6. Cross-Entropy Method

Gradient-free optimization methods have shown impressiveperformance over complex control tasks (De Boer et al.,2004; Salimans et al., 2017). We present the dynamics ofthe cross-entropy method (CEM), without noise and with aconstant noise factor (CEM-CN) (Szita & Lorincz, 2006).The mechanics of the algorithm is threefold: (i) sample apopulation of size N of policy parameters from a Gaussiandistribution of mean θ, covariance C; (ii) evaluate the re-turns of the population; (iii) select top K members, and fita new Gaussian onto them. In the CEM-CN variant, we

inject additional isotropic noise at each iteration. We useN = 500, K = 50, an initial covariance of 0.1I , where I isthe identity matrix of size 2, and a constant noise of 0.05I .


d) e) f)

# up

date

s0

50

V⇡(s

2)


a) b) c)

Figure 12. The cross-entropy method without noise (CEM) (a, b,c); with constant noise (CEM-CN) (d, e, f).

As observed in the original work (Szita & Lorincz, 2006),the covariance of CEM without noise collapses (Figure12.a)b)c)), and therefore reaches convergence for a subop-timal policy. However, the noise addition at each iterationprevents this undesirable behaviour (Figure 12.d)e)f)), asthe algorithm converges to the optimal value functions forall three initialization points.

6. Discussion and Concluding RemarksIn this work, we characterized the shape of value functionsand established its surprising geometric nature: a possiblynon-convex polytope. This result was based on the line theo-rem which provides guarantees of monotonic improvementas well as a line-like variation in the space of value functions.This structural property raises the question of new learningalgorithms based on a single state change, and what thismight mean in the context of function approximation.

We noticed the existence of self-intersecting spaces of valuefunctions, which have a bottleneck. However, from our sim-ple study of learning dynamics over a class of reinforcementlearning methods, it does not seem that this bottleneck leadsto any particular learning slowdown.

Some questions remain open. Although those geometricconcepts make sense for finite state action spaces, it is notclear how they generalize to the continuous case. Thereis a connection between representation learning and thepolytopal structure of value functions that we have startedexploring (Bellemare et al., 2019). Another exciting re-search direction is the relationship between the geometry ofvalue functions and function approximation.


7. AcknowledgementsThe authors would like to thank their colleagues at GoogleBrain for their help; Carles Gelada, Doina Precup, GeorgOstrovski, Marco Cuturi, Marek Petrik, Matthieu Geist,Olivier Pietquin, Pablo Samuel Castro, Remi Munos, RemiTachet, Saurabh Kumar, and Zafarali Ahmed for usefuldiscussion and feedback; Jake Levinson and Mathieu Guay-Paquet for their insights on the proof of Proposition 1; MarkRowland for providing invaluable feedback on two earlierversions of this manuscript.

ReferencesAigner, M., Ziegler, G. M., Hofmann, K. H., and Erdos, P.

Proofs from the Book, volume 274. Springer, 2010.

Bellemare, M. G., Dabney, W., Dadashi, R., Taiga, A. A.,Castro, P. S., Roux, N. L., Schuurmans, D., Lattimore,T., and Lyle, C. A geometric perspective on optimalrepresentations for reinforcement learning. arXiv preprintarXiv:1901.11530, 2019.

Bellman, R. Dynamic Programming. Dover Publications,1957.

Bertsekas, D. P. Generic rank-one corrections for value iter-ation in markovian decision problems. Technical report,M.I.T., 1994.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Pro-gramming. Athena Scientific, 1996.

Bradtke, S. J. and Barto, A. G. Linear least-squares algo-rithms for temporal difference learning. Machine Learn-ing, 22(1-3):33–57, 1996.

Brondsted, A. An Introduction to Convex Polytopes, vol-ume 90. Springer Science & Business Media, 2012.

Dantzig, G. B. Programming in a linear structure. Washing-ton, DC, 1948.

De Boer, P.-T., Kroese, D. P., Mannor, S., and Rubinstein,R. Y. A tutorial on the cross-entropy method. Annals ofOperations Research, 2004.

De Farias, D. P. and Van Roy, B. The linear programmingapproach to approximate dynamic programming. Opera-tions Research, 51(6):850–865, 2003.

Engelking, R. General Topology. Heldermann, 1989.

Grunbaum, B., Klee, V., Perles, M. A., and Shephard, G. C.Convex Polytopes. Springer, 1967.

Howard, R. A. Dynamic Programming and Markov Pro-cesses. MIT Press, 1960.

Kakade, S. M. A natural policy gradient. In Advances inNeural Information Processing Systems, pp. 1531–1538,2002.

Klee, V. Some characterizations of convex polyhedra. ActaMathematica, 102(1-2):79–107, 1959.

Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms.In Advances in Neural Information Processing Systems,pp. 1008–1014, 2000.

Littman, M. L., Dean, T. L., and Kaelbling, L. P. On thecomplexity of solving Markov decision problems. InProceedings of the Eleventh conference on Uncertainty inArtificial Intelligence, pp. 394–402. Morgan KaufmannPublishers Inc., 1995.

Mansour, Y. and Singh, S. On the complexity of policyiteration. In Proceedings of the Fifteenth Conferenceon Uncertainty in Artificial Intelligence, pp. 401–408.Morgan Kaufmann Publishers Inc., 1999.

Munos, R. Error bounds for approximate policy iteration. InProceedings of the International Conference on MachineLearning, 2003.

Puterman, M. L. Markov Decision Processes: DiscreteStochastic Dynamic Programming. John Wiley & Sons,Inc., 1994.

Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever,I. Evolution strategies as a scalable alternative to rein-forcement learning. arXiv preprint arXiv:1703.03864,2017.

Sutton, R. S. and Barto, A. G. Reinforcement Learning: AnIntroduction. MIT press, 2nd edition, 2018.

Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,Y. Policy gradient methods for reinforcement learningwith function approximation. In Advances in NeuralInformation Processing Systems, pp. 1057–1063, 2000.

Szita, I. and Lorincz, A. Learning tetris using the noisycross-entropy method. Neural Computation, 2006.

Wang, T., Bowling, M., and Schuurmans, D. Dual repre-sentations for dynamic programming and reinforcementlearning. In Approximate Dynamic Programming andReinforcement Learning, 2007. ADPRL 2007. IEEE In-ternational Symposium on, pp. 44–51. IEEE, 2007.

Williams, R. J. Simple statistical gradient-following algo-rithms for connectionist reinforcement learning. MachineLearning, 8(3-4):229–256, 1992.

Williams, R. J. and Peng, J. Function optimization usingconnectionist reinforcement learning algorithms. Con-nection Science, 3(3):241–268, 1991.


Ye, Y. The simplex and policy-iteration methods are stronglypolynomial for the markov decision problem with a fixeddiscount rate. Mathematics of Operations Research, 36(4):593–603, 2011.

Ziegler, G. M. Lectures on Polytopes, volume 152. SpringerScience & Business Media, 2012.


Appendix

A. Details of Markov Decision ProcessesIn this section we give the specifics of the Markov Decision Processes presented in this work. We will use the followingconvention:

r(si, aj) = r[i× |A|+ j]

P (sk|si, aj) = P [i× |A|+ j][k]

where P , r are the vectors given below.

In Section 3, Figure 2: (a) |A| = 2, γ = 0.9

r = [0.06, 0.38,−0.13, 0.64]

P = [[0.01, 0.99], [0.92, 0.08], [0.08, 0.92], [0.70, 0.30]]

(b) |A| = 2, γ = 0.9

r = [0.88,−0.02,−0.98, 0.42]

P = [[0.96, 0.04], [0.19, 0.81], [0.43, 0.57], [0.72, 0.28]])

(c) |A| = 3, γ = 0.9

r = [−0.93,−0.49, 0.63, 0.78, 0.14, 0.41]

P = [[0.52, 0.48], [0.5, 0.5], [0.99, 0.01], [0.85, 0.15], [0.11, 0.89], [0.1, 0.9]]

(d) |A| = 2, γ = 0.9

r = [−0.45,−0.1, 0.5, 0.5]

P = [[0.7, 0.3], [0.99, 0.01], [0.2, 0.8], [0.99, 0.01]]

In Section 3, Figure 3, 4, 5, 6: (left) |A| = 3, γ = 0.8

r = [−0.1,−1., 0.1, 0.4, 1.5, 0.1]

P = [[0.9, 0.1], [0.2, 0.8], [0.7, 0.3], [0.05, 0.95], [0.25, 0.75], [0.3, 0.7]]

(right) |A| = 2, γ = 0.9

r = [−0.45,−0.1, 0.5, 0.5]

P = [[0.7, 0.3], [0.99, 0.01], [0.2, 0.8], [0.99, 0.01]]

In Section 5: |A| = 2, γ = 0.9

r = [−0.45,−0.1, 0.5, 0.5]

P = [[0.7, 0.3], [0.99, 0.01], [0.2, 0.8], [0.99, 0.01]]


B. Notation for the proofsIn the section we present the notation that we use to establish the results in the main text. The space of policies P(A)S

describes a Cartesian product of simplices that we can express as a space of |S| × |A| matrices. However, we will adopt forpolicies, as well as the other components ofM, a convenient matrix form similar to (Wang et al., 2007).

• The transition matrix P is a |S||A| × |S| matrix denoting the probability of going to state s′ when taking action a instate s .

• A policy π is represented by a block diagonal |S| × |S||A| matrix Mπ. Suppose the state s is indexed by i and theaction a is indexed by j in the matrix form, then we have that Mπ(i, i× |A|+ j) = π(a|s). The rest of the entries ofMπ are 0. From now on, we will confound π and Mπ to enhance readability.

• The transition matrix Pπ = πP induced by a policy π is a |S| × |S| matrix denoting the probability of going fromstate s to state s′ when following the policy π.

• The reward vector r is a |S||A| × 1 matrix denoting the expected reward when taking action a in state s. The rewardvector of a policy rπ = πr is a |S| × 1 vector.

• The value function V π of a policy π is a |S| × 1 matrix.

• We note Cπi the i-th column of (I − γPπ)−1.

Under these notations, we can define the Bellman operator T π and the optimality Bellman operator T ∗ as follows:

T πV π = rπ + γPπV π = π(r + γPV π)

∀s ∈ S, T ∗V π(s) = maxπ′∈P(A)S

rπ′(s) + γPπ′V π(s).

C. Supplementary ResultsLemma 5. fv is infinitely differentiable on P(A)S .

Proof. We have that:

fv(π) = (I − γπP )−1πr

=1

det(I − γπP )adj(I − γπP )πr.

Where det is the determinant and where adj is the adjunct. ∀π ∈ P(A)S , det(I − γπP ) 6= 0, therefore fv is infinitelydifferentiable.

Lemma 6. Let π ∈ P(A)S , s1, .., sk ∈ S, and π′ ∈ Y πs1,..,sk . We have

Span(Cπk+1, .., Cπ|S|) = Span(Cπ

′

k+1, .., Cπ′

|S|).

Proof. As Pπ and Pπ′

are equal on their first k rows, we also have that (I − γPπ) and (I − γPπ′) are equal on their first krows. We note these k rows L1, ..., Lk.

By assumption, we have that:

∀i ∈ {1, . . . , k},∀j ∈ {k + 1, . . . , |S|}, LiCπj = 0, LiCπ′

j = 0.


Which we can rewrite,

Span(Cπk+1, . . . , Cπ|S|) ⊂ Span(L1, . . . , Lk)⊥

Span(Cπ′

k+1, .., Cπ′

|S|) ⊂ Span(L1, . . . , Lk)⊥

Now using, dimSpan(Cπk+1, . . . , Cπ|S|) = dimSpan(Cπ

′

k+1, .., Cπ′

|S|) = dimSpan(L1, . . . , Lk)⊥ = |S| − k, we have:

Span(Cπk+1, . . . , Cπ|S|) = Span(L1, . . . , Lk)⊥

Span(Cπ′

k+1, .., Cπ′

|S|) = Span(L1, . . . , Lk)⊥.

D. ProofsLemma 1. The space of value functions V is compact and connected.

Proof. P(A)S is connected since it is a convex space, and it is compact because it is closed and bounded in a finitedimensional real vector space. Since fv is continuous (Lemma 5), we have fv(P(A)S) = V is compact and connected.

Lemma 2. Consider two policies π1, π2 that agree on s1, . . . , sk ∈ S . Then the vector rπ1−rπ2 has zeros in the componentscorresponding to s1, . . . , sk and the matrix Pπ1 − Pπ2 has zeros in the corresponding rows.

Proof. Suppose without loss of generality that {s1, .., sk} are the first k states in the matrix form notation. We have,

rπ1 = π1r

rπ2 = π2r

Pπ1 = π1P

Pπ2 = π2P.

Since π1(· | s) = π2(· | s) for all s ∈ {s1, .., sk}, the first k rows of π1, π2 are identical in the matrix form notation.Therefore, the first k elements of rπ1

and rπ2are identical, and the first k rows of Pπ1 and Pπ2 are identical, hence the

result.

Lemma 3. Consider a policy π and k states s1, . . . , sk. Then the value functions generated by Y πs1,..,sk are contained in theaffine vector space Hπ

s1,..,sk:

fv(Yπs1,..,sk

) = V ∩Hπs1,..,sk

.

Proof. Let us first show that fv(Y πs1,..,sk) ⊂ V ∩Hπs1,..,sk

.Let π′ ∈ Y πs1,..,sk , i.e. π′ agrees with π on s1, .., sk. Using Bellman’s equation, we have:

V π′− V π = rπ′ − rπ + γPπ

′V π′− γPπV π

= rπ′ − rπ + γ(Pπ′− Pπ)V π

′+ γPπ(V π

′− V π)

= (I − γPπ)−1(rπ′ − rπ + γ(Pπ

′− Pπ)V π

′). (3)

Since the policies π′ and π agree on the states s1, . . . , sk, we have, using Lemma 2:{rπ′ − rπ is zero on its first k elementsPπ′ − Pπ is zero on its first k rows.

Hence, the right-hand side of Eq. 3 is the product of a matrix with a vector whose first k elements are 0. Therefore

V π′∈ V π + Span(Cπk+1, .., C

π|S|) .


We shall now show that V ∩Hπs1,..,sk

⊂ fv(Y πs1,..,sk).

Suppose V π ∈ Hπs1,..,sk

. We want to show that there is a policy π′ ∈ Y πs1,..,sk such that V π′

= V π. We construct π′ thefollowing way:

π′ =

{π(· | s) if s ∈ {s1, .., sk}π(· | s) otherwise.

Therefore, using the result of the first implication of this proof:

V π − V π′∈ Span(Cπk+1, .., C

π|S|) by assumption

V π − V π′∈ Span(Cπ

′

1 , .., Cπ′

k ) since π and π′ agree on sk+1, . . . , s|S|.

However, as π, π′ ∈ Y πs1,..,sk , we have using Lemma 6:

Span(Cπk+1, .., Cπ|S|) = Span(Cπ

′

k+1, .., Cπ′

|S|).

Therefore, V π − V π′ ∈ Span(Cπ′

1 , .., Cπ′

k ) ∩ Span(Cπ′

k+1, .., Cπ′

|S|) = {0}, meaning that V π = V π′ ∈ fv(Y πs1,..,sk).

Lemma 4. Consider the ensemble Y πS\{s} of policies that agree with a policy π everywhere but on s ∈ S. For π0, π1 ∈Y πS\{s} define the function g : [0, 1]→ V

g(µ) = fv(µπ1 + (1− µ)π0).

Then the following hold regarding g:

(i) g is continuously differentiable;

(ii) (Total order) g(0) 4 g(1) or g(0) < g(1);

(iii) If g(0) = g(1) then g(µ) = g(0), µ ∈ [0, 1];

(iv) (Monotone interpolation) If g(0) 6= g(1) there is a ρ : [0, 1]→ R such that g(µ) = ρ(µ)g(1) + (1− ρ(µ))g(0), and ρis a strictly monotonic rational function of µ.

Proof. (i) g is continuously differentiable as a composition of two continuously differentiable functions.

(ii) We want to show that we have either V π1 4 V π0 or V π1 < V π0 .

Suppose, without loss of generality, that s is the first state in the matrix form. Using Lemma 3, we have:

V π0 = V π1 + αCπ11 , with α ∈ R.

As (I − γPπ1)−1 =∑∞i=0(γπP )i, whose entries are all positive, Cπ1

1 is a vector with positive entries. Therefore we haveV π1 4 V π0 or V π1 < V π0 , depending on the sign of α.

(iii) We have, using Equation (3)

V π0 − V πµ = (I − γPπµ)−1(rπ0 − rπµ + γ(Pπ0 − Pπµ)V π0

)V π0 − V π1 = (I − γPπ1)−1

(rπ0− rπ1

+ γ(Pπ0 − Pπ1)V π0).


Now, using

rπ0 = π0r

rπµ = πµr = π0r + µ(π1 − π0)r

Pπ0 = π0P

Pπµ = πµP = π0P + µ(π1 − π0)P ,

we have

V π0 − V πµ = µ(I − γPπµ)−1(rπ0− rπ1

+ γ(Pπ0 − Pπ1)V π0)

= µ(I − γPπµ)−1(I − γPπ1)(V π0 − V π1). (4)

Therefore, g(0) = g(1)⇒ V π0 − V π1 = 0⇒ V π0 − V πµ = 0⇒ g(µ) = g(0).

(iv) If g(0) = g(1), the result is true since we can take ρ = 0 using (iii).

Suppose g(0) 6= g(1), let us prove the existence of ρ and that it is a rational function in µ. Reusing the Equation 4, we have

V π0 − V πµ = µ(I − γPπµ)−1(I − γPπ1)(V π0 − V π1)

= µ(I − γ(Pπ0 + µ(Pπ1 − Pπ0)))−1(I − γPπ1)(V π0 − V π1)

= µ(I − γPπ1 − γ(1− µ)(Pπ0 − Pπ1)))−1(I − γPπ1)(V π0 − V π1).

As we have that Pπ0 − Pπ1 is a rank one matrix (Lemma 2) that we can express as Pπ0 − Pπ1 = uvt with u, v ∈ R|S|.From the Sherman-Morrison formula:

(I − γPπ1 − γ(1− µ)(Pπ0 − Pπ1)))−1 = (I − γPπ1)−1 − γ(1− µ)(I − γPπ1)−1(Pπ0 − Pπ1)(I − γPπ1)−1

1 + γ(1− µ)vt(I − γPπ1)−1u.

Define ωπ1,π0 = vt(I − γPπ1)−1u, we have

V π0 − V πµ = µV π0 − µV π1 − γµ(1− µ)

1 + ωπ1,π0γ(1− µ)(I − γPπ1)−1(Pπ0 − Pπ1)(V π0 − V π1).

As in (i) we have that (Pπ0 − Pπ1)(V π0 − V π1) is zeros on its last S − 1 elements using an argument similar to Lemma 2,therefore

(I − γPπ1)−1(Pπ0 − Pπ1)(V π0 − V π1) = βπ0,π1Cπ1

1 , with βπ0,π1∈ R.

Now recall from the proof of (i) that similarly, we have

V π0 − V π1 = απ0,π1Cπ1

1 .

As by assumption V π0 − V π1 6= 0, we have:

(I − γPπ1)−1(Pπ0 − Pπ1)(V π0 − V π1) =βπ0,π1

απ0,π1

(V π0 − V π1).

Finally we have,

V π0 − V πµ = µV π0 − µV π1 − γµ(1− µ)

1 + ωπ1,π0γ(1− µ)

βπ0,π1

απ0,π1

(V π0 − V π1)

=(µ− γµ(1− µ)

1 + ωπ1,π0γ(1− µ)

βπ0,π1

απ0,π1

)(V π0 − V π1).


Therefore, ρ is a rational function in µ, hence continuous, that we can express as:

ρ(µ) = µ− γµ(1− µ)

1 + ωπ1,π0γ(1− µ)

βπ0,π1

απ0,π1

.

Now let us prove that ρ is strictly monotonic. Suppose that ρ is not strictly monotonic. As ρ is continuous, we have that ρ isnot injective. Hence, ∃µ0, µ1 ∈ [0, 1] distinct and the associated mixture of policies πµ0 , πµ1 such that

g(µ0) = g(µ1)⇔ V πµ0 = V πµ1

⇔ T πµ0V πµ0 = T πµ1V πµ1⇔ T πµ0V πµ0 = T πµ1V πµ0⇔ (µ0T π0 + (1− µ0)T π1)V πµ0 = (µ1T π0 + (1− µ1)T π1)V πµ0

⇔ µ0(T π0 − T π1)V πµ0 = µ1(T π0 − T π1)V πµ0

⇔ T π0V πµ0 = T π1V πµ0 .

Therefore we haveV πµ0 = T πµ0V πµ0 = (µ0T π0 + (1− µ0)T π1)V πµ0 = T π1V πµ0 .

ThereforeT π0V πµ0 = T π1V πµ0 = V πµ0 .

However, the Bellman operator has a unique fixed point, therefore

V π0 = V π1 = V πµ0 ,

which contradicts our assumption.

Theorem 1. [Line Theorem] Let s be a state and π, a policy. Then there are two s-deterministic policies in Y πS\{s}, denotedπl, πu, which bracket the value of all other policies π′ ∈ Y πS\{s}:

fv(πl) 4 fv(π′) 4 fv(πu).

Furthermore, the image of fv restricted to Y πS\{s} is a line segment, and the following three sets are equivalent:

(i) fv(Y πS\{s}

),

(ii) {fv(απl + (1− α)πu) |α ∈ [0, 1]},

(iii) {αfv(πl) + (1− α)fv(πu) |α ∈ [0, 1]} .

Proof. Let us start by proving the first statement of the theorem which is the existence of two s-deterministic policies πu, πlin Y πS\{s} that respectively dominates and is dominated by all other policies.

The existence of πl and πu (without enforcing their s-determinism), whose value functions are respectively dominated ordominate all other value functions of policies of Y πS\{s}, is given by:

• fv(Y πS\{s}) is compact as an intersection of a compact and an affine plane (Lemma 3).

• There is a total order on this compact space ((ii) in Lemma 4).


Suppose πl is not s-deterministic, then there is a ∈ A such that πl(a|s) = µ∗ ∈ (0, 1). Hence we can write πl as a mixtureof π1, π2 defined as follows

∀s′ ∈ S, a′ ∈ A, π1(a′|s′) =

1 if s′ = s, a′ = a0 if s′ = s, a′ 6= aπl(a

′|s′) otherwise.(5)

π2(a′|s′) =

0 if s′ = s, a′ = a

11−µ∗πl(a

′|s′) if s′ = s, a′ 6= a

πl(a′|s′) otherwise.

(6)

Therefore πl = µ∗π1 + (1− µ∗)π2. We can use (iv) in Lemma 4, that gives that g : µ 7→ fv(µπ1 + (1− µ)π2) is eitherstrictly monotonic or constant. If g was strictly monotonic we would have a contradiction on fv(πl) being minimum.Therefore g is constant, and in particular

fv(πl) = g(1) = fv(π1),

with π1 s-deterministic.

Similarly we can show there is an s-deterministic policy that has the same value function as πu, hence proving the result.

Now let us prove the equivalence between (i), (ii) and (iii).

• Let π′ ∈ Y πS\{s}, we have: fv(πl) 4 fv(π′) 4 fv(πu) and fv(πl), fv(π′), fv(πu) are on the same line (Lemma 3).

Therefore, fv(π′) is a convex combination of fv(πl) and fv(πu) hence (i) ⊂ (iii).

• By definition, (ii) ⊂ (i).

• Lemma 4 gives µ 7→ fv(µπu+(1−µ)πl) = fv(πl)+ρ(µ)(fv(πu)−fv(πl)) with ρ continuous and ρ(0) = 0, ρ(1) = 1.Using the theorem of intermediary values on ρ we have that ρ takes all values between 0 and 1. Therefore (iii) ⊂ (ii)

We have (iii) ⊂ (ii) ⊂ (i) ⊂ (iii). Therefore (i) = (ii) = (iii).

Corollary 1. For any set of states s1, .., sk ∈ S and a policy π, V π can be expressed as a convex combination of valuefunctions of {s1, .., sk}-deterministic policies. In particular, V is included in the convex hull of the value functions ofdeterministic policies.

Proof. We prove the result by induction on the number of states k. If k = 1, the result is true by Theorem 1.

Suppose the result is true for k states. Let s1, .., sk+1 ∈ S, we have by assumption that

∃n ∈ N, π1, .., πn{s1, .., sk}-deterministic ∈ P(A)S , α1, .., αn ∈ [0, 1], s.t.{V π =

∑ni=1 αiV

πi∑ni=1 αi = 1

However, using Theorem 1, we have

∀i ∈ [1, n],∃πi,l, πi,u ∈ P(A)S ,

πi,l, πi,u sk+1-deterministicπi,l, πi,u agrees with πi on s1, .., sk∃βi ∈ [0, 1], V πi = βiV

πi,l + (1− βi)V πi,u.

Therefore

V π =

n∑i=1

αi(βiVπi,l + (1− βi)V πi,u),

thus concluding the proof.

Proposition 1. P is a polyhedron in an affine subspace K ⊆ Rn if


(i) P is closed;

(ii) There are k ∈ N hyperplanes H1, ..,Hk in K whose union contains the boundary of P in K:∂KP ⊂ ∪ki=1Hi; and

(iii) For each of these hyperplanes, P ∩Hi is a polyhedron in Hi.

Proof. We will show the result by induction on the dimension of K.

For dim(K) = 1, the proposition is true since P is a polyhedron iff its boundary is a finite number of points.

Suppose the proposition is true for dim(K) = n, let us show that it is true for dim(K) = n+ 1.

We can verify that if P is a polyhedron, then:

• P is closed.

• There is a finite number of hyperplanes covering its boundaries (the boundaries of the half-spaces defining each convexpolyhedron composing P ).

• The intersection of P with these hyperplanes still are polyhedra.

Now let us consider the other direction of the implication, i.e. suppose that P is closed, ∂KP ⊂ ∩ki=1Hi, and ∀i, P ∩Hi isa polyhedron. We will show that we can express P as a finite union of polyhedra.

Suppose x ∈ P and x /∈ ∪ki=1Hi. By assumption, we have that x ∈ relintK(P ). We will show that x is in a intersection ofclosed half-spaces defined by the hyperplanes H1, ..,Hk and that any other vector in this intersection is also in P (otherwisewe would have a contradiction on the boundary assumption).

A hyperplane Hi defines two closed half-spaces denoted by H+1i and H−1i (the signs being arbitrary). And the intersections

of those half-spaces form a partition of K, therefore:

∃δ ∈ {−1, 1}k, x ∈ ∩ki=1Hδ(i)i = Pδ.

By assumption, x ∈ relintK(Pδ), since we assumed that x /∈ ∪ki=1Hi. Now suppose ∃y ∈ relintK(Pδ) s.t. y /∈ P , we have:

∃λ ∈ [0, 1], λx+ (1− λ)y = z ∈ ∂KP.

However, z ∈ relintK(Pδ) because relintK(Pδ) is convex. Therefore z /∈⋃ki=1Hi since z ∈ relintK(Pδ) which gives a

contradiction. We thus have either relintK(Pδ) ⊂ P or relintK(Pδ) ∩ P = ∅.

Now suppose relintK(Pδ) ⊂ P and relintK(Pδ) nonempty. We have that clK(relintK(Pδ)) = Pδ (Brondsted, 2012,Theorem 3.3) and P closed, meaning that Pδ ⊂ P . Therefore, we have

∃δ1, .., δj ∈ {−1, 1}k s.t. P = (∪ki=1P ∩Hi)⋃

(∪ji=1Pδi).

P is thus a finite union polyhedra, as {P ∩ Hi} are polyhedra by assumption, and {Pδi} are convex polyhedra bydefinition.

Corollary 2. Let V π and V π′

be two value functions. Then there exists a sequence of k ≤ |S| policies, π1, . . . , πk, suchthat V π = V π1 , V π

′= V πk , and for every i ∈ 1, . . . , k − 1, the set

{fv(απi + (1− α)πi+1) |α ∈ [0, 1]}

forms a line segment.


Proof. We can define the policies π2, .., π|S|−1 the following way:

∀i ∈ [2, |S| − 1],

{πi(· | sj) = π′(· | sj) if sj ∈ {s1, .., si−1}πi(· | sj) = π(· | sj) if sj ∈ {si, .., s|S|}

Therefore, two consecutive policies πi, πi+1 only differ on one state. We can apply Theorem 1 and thus conclude theproof.

Theorem 2. Consider the ensemble of policies Y πs1,..,sk that agree with π on states S = {s1, .., sk}. Suppose ∀s /∈ S ,∀a ∈ A, @π′ ∈ Y πs1,..,sk ∩Da,s s.t. fv(π′) = fv(π), then fv(π) has a relative neighborhood in V ∩Hπ

s1,..,sk.

Proof. We will prove the result by showing that V π is in |S| − k segments that are linearly independent by applying the linetheorem on a policy π that has the same value function as π. We will then be able to conclude using the regularity of fv .

We can find a policy π ∈ Y πs1,..,sk , that has the same value function as π by applying recursively Theorem 1 on the states{sk+1, .., s|S|}, such that:

∃ak+1,l, ak+1,u, .., a|S|,l, a|S|,u ∈ A,∀i ∈ {k + 1, .., |S|},

π(ai,l|si) = 1− π(ai,u|si) = µi ∈ (0, 1).

Note that µi /∈ {0, 1} because we assumed that no s-deterministic policy has the same value function as π.We define µ = (µk+1, ..., µ|S|) ∈ (0, 1)|S|−k and the function g : (0, 1)|S|−k → Hπ

s1,..,sksuch that:

g(µ) = fv(πµ), with{πµ(ai,l|si) = 1− πµ(ai,u|si) = µi if i ∈ {k + 1, . . . , |S|}πµ(· | si) = π(· | si) otherwise.

We have that:

1. g is continuously differentiable

2. g(µ) = fv(π)

3. ∂g∂µi

is non-zero at µ (Lemma 4.iv)

4. ∂g∂µi

is along the i-th column of (I − γP π)−1 (Lemma 3)

Therefore, the Jacobian of g is invertible at µ since the columns of (I − γP π)−1 are independent, therefore by the inversetheorem function, there is a neighborhood of g(µ) = fv(π) in the image space, which gives the result.

Corollary 3. Consider a policy π ∈ P(A)S , the states S = {s1, .., sk}, and the ensemble Y πs1,..,sk of policies that agreewith π on s1, .., sk. Define Vy = fv(Y

πs1,..,sk

), we have that the relative boundary of Vy in Hπs1,..,sk

is included in the valuefunctions spanned by policies in Y πs1,..,sk that are s-deterministic for s /∈ S :

∂Vy ⊂⋃s/∈S

⋃a∈A

fv(Yπs1,..,sk

∩Ds,a),

where ∂ refers to ∂Hπs1,..,sk .

Proof. Let V π ∈ Vy; from Theorem 2, we have that

V π /∈|S|⋃

i=k+1

|A|⋃j=1

fv(Yπs1,..,sk

∩Dsi,aj )⇒ V π ∈ relint(V y)


where relint refers to the relative interior in Hπs1,..,sk

.

Therefore,

∂Vy ⊂|S|⋃

i=k+1

|A|⋃j=1

fv(Yπs1,..,sk

∩Dsi,aj ).

Theorem 3. Consider a policy π ∈ P(A)S , the states s1, .., sk ∈ S, and the ensemble Y πs1,..,sk of policies that agree withπ on s1, .., sk. Then fv(Y πs1,..,sk) is a polytope and in particular, V = fv(Y

π∅ ) is a polytope.

Proof. We prove the result by induction on the cardinality of the number of states k.If k = |S|, then Y πs1,..,sk = {fv(π)} which is a polytope.

Suppose that the result is true for k + 1, let us show that it is still true for k.

Let π ∈ Π, s1, .., sk ∈ S, define the ensemble Y πs1,..,sk of policies that agree with π on {s1, .., sk} and Vy = fv(Yπs1,..,sk

).Using Lemma 3, we have that Vy = V ∩Hπ

s1,..,sk.

Now, using Corollary 3, we have that:

∂Vy ⊂|S|⋃

i=k+1

|A|⋃j=1

fv(Yπs1,..,sk

∩Dsi,aj ) =

|S|⋃i=k+1

|A|⋃j=1

Vy ∩Hi,j ,

where ∂ refer to the relative boundary in Hπs1,..,sk

, and Hi,j is an affine hyperplane of Hπs1,..,sk

(Lemma 3).

Therefore we have that:

1. Vy = V ∩Hπs1,..,sk

is closed since it is an intersection of two closed ensembles

2. ∂Vy ⊂⋃|S|i=k+1

⋃|A|j=1Hi,j affine hyperplanes in Hπ

s1,..,sk

3. Vy ∩Hi,j = fv(Yπs1,..,sk

∩Dsi,aj ) is a polyhedron (induction assumption).

Vy verifies (i), (ii), (iii) in Proposition 1, therefore it is a polyhedron. We have Vy bounded since Vy ⊂ V bounded.Therefore, Vy is a polytope.

the value function polytope in reinforcement learning

Documents