a general safety framework for learning-based control … · a general safety framework for...

A General Safety Framework for Learning-Based Control inUncertain Robotic Systems

Jaime F. Fisac˚1, Anayo K. Akametalu˚1, Melanie N. Zeilinger2,Shahab Kaynama3, Jeremy Gillula4, and Claire J. Tomlin1

Abstract—The proven efficacy of learning-based controlschemes strongly motivates their application to robotic systemsoperating in the physical world. However, guaranteeing correctoperation during the learning process is currently an unresolvedissue, which is of vital importance in safety-critical systems.

We propose a general safety framework based on Hamilton-Jacobi reachability methods that can work in conjunction with anarbitrary learning algorithm. The method exploits approximateknowledge of the system dynamics to guarantee constraint satis-faction while minimally interfering with the learning process. Wefurther introduce a Bayesian mechanism that refines the safetyanalysis as the system acquires new evidence, reducing initialconservativeness when appropriate while strengthening guaran-tees through real-time validation. The result is a least-restrictive,safety-preserving control law that intervenes only when (a) thecomputed safety guarantees require it, or (b) confidence in thecomputed guarantees decays in light of new observations.

We prove theoretical safety guarantees combining probabilisticand worst-case analysis and demonstrate the proposed frame-work experimentally on a quadrotor vehicle. Even though safetyanalysis is based on a simple point-mass model, the quadrotorsuccessfully arrives at a suitable controller by policy-gradientreinforcement learning without ever crashing, and safely retractsaway from a strong external disturbance introduced during flight.

I. INTRODUCTION

Learning-based methods in control and artificial intelligenceare generating a considerable amount of excitement in theresearch community. The auspicious results of deep rein-forcement learning schemes in virtual environments such asarcade videogames [37] and physics simulators [44], makethese techniques extremely attractive for robotics applications,in which complex dynamics and hard-to-model environmentslimit the effectiveness of purely model-based approaches.However, the difficulty of interpreting the inner workings of

˚ The first two authors contributed equally.1 Department of Electrical Engineering and Computer Sciences, Universityof California, Berkeley. Cory Hall, Berkeley, CA 94720, United States.2 Department of Mechanical and Process Engineering, ETH Zurich.Ramistrasse 101, 8092 Zurich, Switzerland.3 Clearpath Robotics. 1425 Strasburg Rd, Suite 2A, Kitchener, ON N2R1H2, Canada.4 Electronic Frontier Foundation. 815 Eddy St, San Francisco, CA 94109,United States.jfisac, kakametalu, tomlin @eecs.berkeley.edu,[email protected], [email protected],[email protected]

This work is supported by the NSF CPS project ActionWebs under grantnumber 0931843, NSF CPS project FORCES under grant number 1239166,and by ONR under the HUNT, SMARTS and Embedded Humans MURIs, andby AFOSR under the CHASE MURI. The research of J. F. Fisac has receivedfunding from the “la Caixa” Foundation. The research of A.K. Akametalu hasreceived funding from the NSF Bridge to Doctorate program. The researchof M.N. Zeilinger has received funding from the EU FP7 (FP7/2007-2013)under grant agreement no. PIOFGA-2011-301436-COGENT.

(a) With online guarantee validation (b) Without online guarantee validation

Fig. 1: Hummingbird quadrotor learning a vertical flight policyunder the requirement of not colliding. When the fan is turnedon, the system experiences an unmodeled disturbance that ithas not previously encountered. This can lead to a ground col-lision even under robust safety policies (right). The proposedBayesian validation method detects the inconsistency andprevents the vehicle from entering the uncertain region (left).

many machine learning algorithms (notably in the case of deepneural networks), makes it challenging to make meaningfulstatements about the behavior of a system during the learningprocess, especially while the system has not yet convergedto a suitable control policy. While this may not be a criticalissue in a simulated reality, it can quickly become a limitingfactor when attempting to put such an algorithm in control ofa system in the physical world, where certain failures, suchas collisions, can result in damage that would severely hinderor even terminate the learning process, in addition to materialloss or human injury. We refer to systems in which certainfailure states are unacceptable as safety-critical.

In the last decade, learning-based control schemes havebeen successfully demonstrated in robotics applications inwhich the safety-critical aspects were effectively removed ormitigated, typically by providing a manual fallback mechanismor retrofitting the environment to allow safe failure. In [1,12] a trained pilot was able to remotely take over control ofthe autonomous helicopter at any time; the power slide carmaneuvers in [31] were performed on an empty test track; andthe aerobatic quadrotor in [33] was enclosed in a safety net.While mostly effective, these ad hoc methods tend to come

with their own issues (pilot handoffs, for instance, are notori-ously prone to result in accidents [25]) and do not generalizewell beyond the context of the particular demonstration. Ittherefore seems imperative to develop principled and provablycorrect approaches to safety, attuned to the exploration-intenseneeds of learning-based algorithms, that can be built into theautonomous operation of learning robotic systems.

Current efforts in policy transfer learning propose trainingan initial control policy in simulation and then carrying it overto the physical system [11]. While progress in this directionis likely to reduce overall training time, it does not eliminatethe risk of catastrophic system misbehavior. State-of-the artneural network policies have been shown to be vulnerable tosmall changes between training and testing conditions [26],which inevitably arise between simulated and real systems.Guaranteeing correct behavior of simulation-trained schemesin the real world thus remains an important unsolved problem.

Providing guarantees about a system’s evolution inevitablyrequires some form of knowledge about the causal mechanismsthat govern it. Fortunately, in practice it is never the case thatthe designer of a robotic system has no knowledge whatsoeverof its dynamics: making use of approximate knowledge isboth possible and, we argue, advantageous for safety. Yet,perfect knowledge of the dynamics can hardly if ever be safelyassumed either. This motivates searching for points of rap-prochement between data-driven and model-based techniques.

We identify three key properties that we believe any generalsafe learning framework should satisfy:‚ High confidence. The framework should be able to keep

the system safe with high probability given the availableknowledge about the system and the environment.

‚ Modularity. The framework should work in conjunctionwith an arbitrary learning-based control algorithm, with-out requiring modifications to said algorithm.

‚ Minimal intervention. The framework should not in-terfere with the learning process unless deemed strictlynecessary to ensure safety, and should return control tothe learning algorithm as soon as possible.

We can use these criteria to evaluate the strengths andshortcomings of existing approaches to safety in intelligentsystems, and place our work in the context of prior research.

Related Work

Early proposals of safe learning date back to the turn of thecentury. Lyapunov-based reinforcement learning [40] alloweda learning agent to switch between a number of pre-computed“base-level” controllers with desirable safety and performanceproperties; this enabled solid theoretical guarantees at theexpense of substantially constraining the agent’s behavior; ina similar spirit, later work has considered constraining policysearch to the space of stabilizing controllers [43].

In risk-sensitive reinforcement learning [19], the expectedreturn was heuristically weighted with the probability (risk) ofreaching an “error state”; while this allowed for more generallearning strategies, no guarantees could be derived from theheuristic effort. Nonetheless, the ideal problem formulationproposed in the paper, to maximize performance subject to

some maximum allowable risk, inspired later work (see [18]for a survey) and is very much aligned with our own goals.

More recently, [38] proposed an ergodicity-based safe ex-ploration policy for Markov decision processes (MDPs) withuncertain transition measures, which imposed a constraint onthe probability, under the current belief, of being able toreturn to the starting state. While practical online methodsfor updating the system’s belief on the transition dynamics arenot discussed, and the toy grid-world demonstrations fall shortof capturing the criticality of dynamics in many real-worldsafety problems, the probabilistic safety analysis is extremelypowerful, and our work certainly takes inspiration from it.Later safe exploration methods have assumed fully knowndeterministic dynamics with uncertain the safety features [46],or allowed for uncertain dynamics but restricted the class ofconstraints to local stability (i.e. region of attraction) [7].

The robust model-predictive control approach in [4] learnsabout system dynamics for performance only, while enforcingconstraints based on a robust nominal model. The methodwas successfully demonstrated on problems with nontrivialdynamics, including quadrotor flight. However, using an apriori model for safety at best constrains the system’s abilityto explore, and at worst may fail to keep the real system safe.

To explicitly account for model uncertainty, the safetyproblem can be studied as a differential game [35], in whichthe controller must keep the system within the specified stateconstraints (i.e. away from failure states) in spite of theactions of an adversarial disturbance: the optimal solution tothis reachability game, obtainable through Hamilton-Jacobimethods [36, 39], has been used to guarantee safety in avariety of engineering problems [10, 15, 24]. This robust,worst-case analysis determines a safe region in the state spaceand a control policy to remain inside it; a related approachinvolves ensuring invariance through “barrier functions” [3,41, 45]. A key advantage is that in the interior of this safeset one can execute any desired action, as long as the safecontrol is applied at the boundary: in this sense, the techniqueyields a least-restrictive control law, which naturally lendsitself to minimally constrained learning-based control. Initialwork exploring this was presented in [22, 23].

The above methods are subject, however, to the fundamentallimitation of any model-based safety analysis, namely, thecontingency of all guarantees upon the validity of the model.This faces designers with a difficult tradeoff. On the onehand, if, in order to guarantee safety under large or poorlyunderstood uncertainty, they assume conservative bounds onmodel error, this will reduce the computed safe set, and therebyrestrict the freedom of the learning algorithm. If, on theother hand, the assumed bounds fail to fully capture the trueevolution of the state, the theoretical guarantees derived fromthe model may not in fact apply to the real system.

ContributionIn this work we propose a novel general safety framework

that combines model-based control-theoretical analysis withdata-driven Bayesian inference to construct and maintain high-probability guarantees around an arbitrary learning-based con-trol algorithm. Drawing on Hamilton-Jacobi robust optimal

control techniques, it defines a least-restrictive supervisorycontrol law, which allows the system to freely execute itslearning-based policy almost everywhere, but imposes a com-puted action at states where it is deemed critical for safety. Thesafety analysis is refined through Bayesian inference in lightof newly gathered evidence, both avoiding excessive conser-vativeness and improving reliability by rapidly imposing thecomputed safe actions if confidence in model-based guaranteesdecreases due to unexpected observations.

This paper consolidates the preliminary work presented in[2, 22, 23], extending the theoretical results to provide aunified treatment of model learning and guarantee validation,and presenting significant novel experimental results. To ourknowledge this is the first work in the area of reachabilityanalysis that reasons online about the validity of computedguarantees and uses a resilient mechanism to continue exploit-ing them under inaccurate prior assumptions on model error.

Our framework relies on reachability analysis for the model-based safety guarantees, and on Gaussian processes for the on-line Bayesian analysis. It is important to acknowledge that bothof these techniques are computationally intensive and scalepoorly with the dimensionality of the underlying continuousspaces, which can generally limit their applicability to complexdynamical systems. However, recent compositional approacheshave dramatically increased the tractability of lightly coupledhigh-dimensional systems [8–10, 17, 27–29], while new ana-lytic solutions entirely overcome the “curse of dimensionality”in some relevant cases [14, 30]. The key contribution of thiswork is in the principled methodology for incorporating safetyinto learning-based systems: we thus focus our exampleson problems of low dimensionality, implicitly bypassing thecomputational issues, and note that our method can readily beused in conjunction with these decomposition techniques toextend its application to more complex systems.

We demonstrate our method on a quadrotor vehicle learningto track a vertical trajectory close to the ground (Fig. 1), usinga policy gradient algorithm [32]. The reliability of our methodis evidenced under uninformative policy initializations, inac-curate safe set estimation and strong unmodeled disturbances.

The remainder of the paper is organized as follows: InSection II we introduce the modeling framework and for-mally state the safe learning problem. Section III presentsthe differential game analysis and derives some importantproperties. The proposed methodology is described in SectionIV with the proofs of its fundamental guarantees, as wellas a computationally tractable alternative with weaker, butpractically useful, properties. Lastly, in Section V we presentthe experimental results.

Notation

Throughout the paper, we use boldface symbols to denotetime signals or trajectories. For any nonempty set M Ă Rm,sM : Rm Ñ R denotes the signed distance function to M

sMpzq :“

#

infyPM |z ´ y|, z P RmzM,

´ infyPRmzM |z ´ y|, z PM,

where | ¨ | denotes a norm on Rm.

II. PROBLEM FORMULATION

A. System Model and State-Dependent Uncertainty

The analysis in this paper considers a fully observablesystem whose underlying dynamics are assumed deterministic,but only partially known. This modeling framework can inpractice be applied to a wide range of systems for which anapproximate dynamic model is available but the exact behavioris hard to model a priori (due to manufacturing tolerances,aerodynamic effects, uncertain environments, etc.).

We can formalize this as a dynamical system with statex P Rn, and two inputs, u P U Ă Rnu , d P D Ă Rnd (with Uand D compact) which we will refer to as the controller andthe disturbance:

9x “ fpx, u, dq. (1)

In this context, however, d is thought of as a deterministicstate-dependent disturbance capturing unmodeled dynamics,given by an unknown Lipschitz function d : Rn Ñ D.That is, we could in principle write the unknown dynamicsas F px, uq “ f

`

x, u, dpxq˘

. Unlike F , f is a known func-tion, with all uncertainty captured by dp¨q. The flow fieldf : Rn ˆ U ˆ D Ñ Rn is assumed uniformly continuousand bounded, as well as Lipschitz in x and d for all u: thisensures that the unknown dynamics F are Lipschitz in x.

Letting U and D denote the collections of measurable1

functions u : r0,8q Ñ U and d : r0,8q Ñ D respectively,and allowing the controller and disturbance to choose any suchsignals, the evolution of the system from any initial state x isdetermined (see for example [13], Ch. 2, Theorems 1.1, 2.1)by the unique continuous trajectory ξ : r0,8q Ñ Rn solving

9ξpsq “ fpξpsq,upsq,dpsqq, a.e. s ě 0,

ξp0q “ x.(2)

Note that this is a solution in Caratheodory’s extended sense,that is, it satisfies the differential equation almost everywhere(i.e. except on a subset of Lebesgue measure zero).

Since dpxq is unknown, we attempt to bound it at each stateby a compact set Dpxq Ď D, which is allowed to vary in thestate space. This can be a conservative a priori bound on thesystem disturbance, or can be obtained through any form ofsystem identification procedure. In Section IV-B, we present aBayesian approach to find a bound Dpxq based on a Gaussianprocess model of dpxq; alternative methods are possible aslong as they meet some basic conditions to ensure that thedynamical system resulting from (2) with the restriction d PDpxq remains well defined. For the interested reader, sufficientconditions on Dpxq are discussed in the Appendix.

Any model-based safety guarantees for the system willrequire that the bound Dpxq correctly captures the unknownpart of the dynamics given by dpxq, at least at some critical setof states x (discussed in Section III). One key insight in thiswork is that the system should take action to ensure safetynot only when the model predicts that this action may be

1A function f : X Ñ Y between two measurable spaces pX,ΣXq andpY,ΣY q is said to be measurable if the preimage of a measurable set in Yis a measurable set in X , that is: @V P ΣY , f

´1pV q P ΣX , with ΣX ,ΣY

σ-algebras on X ,Y .

necessary, but also when the system detects that the modelitself may become unreliable in the predictable future.

We state here a preliminary result that will be useful later onin the paper, and introduce the notion of local model reliability.

Proposition 1. If dpxq P int Dpxq and the set-valued map D :Rn Ñ 2D is Lipschitz-continuous under the Hausdorff metric2,then there exists ∆t ą 0 such that all possible trajectoriesfollowed by the system starting at x will satisfy d

`

ξpτq˘

P

D`

ξpτq˘

for all τ P rt, t`∆ts.

Proof. Let LD be the Lipschitz (Hausdorff) constant ofD, Ld the Lipschitz constant of d, and Cf a norm boundon the dynamics f . We then have that over an arbitrarytime interval rt, t ` ∆ts, regardless of the control anddisturbance signals up¨q, dp¨q, any system trajectory startingat ξptq “ x satisfies |ξpτq ´ x| ď Cf∆t,@τ P rt, t`∆ts.This implies both |dpξpτqq ´ dpxq| ď LdCf∆t anddH

`

Dpξpτqq, Dpxq˘

ď LDCf∆t. Requiring that the open ballB`

dpxq, pLd ` LDqCf∆t˘

be contained in Dpxq ensuresdpξpτqq P Dpξpτqq. Since dpxq P int Dpxq, there must exist asmall enough ∆t ą 0 for which this condition is met.

Corollary 1. If the Lipschitz constants are known, thendpxq P int Dpxq implies dpξpτqq P Dpξpτqq for all timesτ P rt, t`∆ts, with

∆t “´sDpxq

`

dpxq˘

pLd ` LDqCf.

The disturbance bounds D derived in this paper satisfy thehypothesis of Proposition 1 (see Appendix for details), and wethus refer to the condition dpxq P int Dpxq as the model beinglocally reliable at x.

Finally, we assume that the effect of the disturbance onthe dynamics is independent of the action applied by thecontroller.

9x “ f`

x, u, dpxq˘

“ g`

x, u˘

` h`

dpxq˘

. (3)

with g : Rn ˆ Rnu Ñ Rn, h : Rnp Ñ Rn, where g, andh inherit Lipschitz continuity in their first argument from fand h is injective onto its image. This decoupling assumption,made for ease of exposition, is not strictly necessary, and thetheoretical results in this paper can be easily adapted to thecoupled case.

Throughout our analysis, we will use the notation ξu,dx,Dp¨q to

denote the state trajectory t ÞÑ x corresponding to the initialcondition x P Rn, the control signal u P U and the disturbancesignal d P D, subjecting the latter to satisfy dptq P D

`

ξu,dx,Dptq

˘

for all t ě 0.

B. State Constraints

A central element in our problem is the constraint set, whichdefines a region K Ď Rn of the state space, typically the com-plement of all unacceptable failure states, where the system is

2The Hausdorff metric (or Hausdorff distance) betweentwo sets A,B in a metric space pM,dM q is defined asdHpA,Bq “ maxt supaPA infbPB dM pa, bq, supbPB infaPA dM pa, bq u.

required to remain throughout the learning process. This setis assumed closed and time-invariant; no further assumptions(boundedness, connectedness, convexity, etc.) are needed.

From closedness, we can implicitly characterize K as thezero superlevel set of a Lipschitz surface function l : Rn Ñ R:

x P K ðñ lpxq ě 0. (4)

This function always exists, since we can simply chooselpxq “ ´sKpxq, which is Lipschitz continuous by definition.

To express whether a given trajectory ever violates theconstraints, let the functional V : Rn ˆ UˆDÑ R assign toeach initial state x and input signals up¨q, dp¨q the lowest valueof lp¨q achieved by trajectory ξu,d

x,Dp¨q over all times t ě 0:

V`

x,up¨q,dp¨q˘

:“ inftě0

l`

ξu,dx,Dptq

˘

. (5)

This outcome V will be strictly smaller than zero if there existsany t P r0,8q at which the trajectory leaves the constraint set,and will be nonnegative if the system remains in the constraintset for all of t ě 0. Denoting Vu,dpxq “ V

`

x,up¨q,dp¨q˘

, thefollowing statement follows from (4) and (5) by construction.

Proposition 2. The set of points x from which the systemtrajectory ξu,d

x,Dp¨q under given inputs up¨q P U,dp¨q P D will

remain in the constraint set K at all times t ě 0 is equal tothe zero superlevel set of Vu,dp¨q:

tx P Rn : @t ě 0, ξu,dx,Dptq P Ku “ tx P Rn : Vu,dp¨q ě 0u.

Guaranteeing safe evolution from a given point x P Rngiven an uncertainty bound D requires determining whetherthere exists a control input up¨q P U such that, for alldisturbance inputs dp¨q P D satisfying dptq P D

`

ξu,dx,Dptq

˘

,the evolution of the system remains in K, or equivalentlyVu,dpxq ě 0. In Section III, we review how to answerthis question using differential game theory and state someimportant properties of the associated solution.

C. Objective: Safe Learning

Learning-based control aims to achieve desirable systembehavior by autonomously improving a policy κl : Rn Ñ U ,typically seeking to optimize an objective function. Safelearning additionally requires that certain constraints K re-main satisfied while searching for such a policy. With fullknowledge of the system dynamics F px, uq “ f

`

x, u, dpxq˘

,we would like to find a safe control policy κ˚ : Rn Ñ Uproducing trajectories ξptq P K, @t ě 0, for the largest set ofinitial states x “ ξp0q, then restrict any learned policy so thatκlpxq “ κ˚pxq wherever required to ensure safety. When dpxqis not known exactly, however, this problem cannot be solved.

Instead, given an estimated disturbance set Dpxq, we canfind an inner approximation of the set of safe states byconsidering all the possible trajectories that can be producedunder the bounded uncertainty dpxq P Dpxq. Our goal, then,is to find the set of robustly safe states x for which thereexists a control policy κ˚ that can keep the closed-loop systemevolution in K, and consistently limit κl to ensure that κ˚ isapplied when necessary. To formally state this, we introducean important notion from robust control theory.

Definition 1. A subset M Ă Rn is a robust controlled invari-ant set under uncertain dynamics 9x “ fpx, u, dq, d P Dpxq, iffor every point x PM there exists a feedback control policyκ : Rn Ñ U such that, for all possible realizations, the closed-loop system trajectory satisfies ξptq PM for all time t ě 0.

Given that trajectories are continuous, the system state canonly leave M by crossing its boundary BM. Hence if Mis closed, applying the feedback policy κpxq for x P BM isenough to render M robust controlled invariant, allowing anarbitrary control action to be applied in the interior of M.

Definition 2. The safe set ΩD is the maximal robust controlledinvariant set M under uncertain dynamics 9x “ fpx, u, dq,d P Dpxq, that is fully contained in the constraint set K.

Our goal is clearly dependent on the estimated model ofthe system: a tighter bound Dpxq on dpxq yields a lessconservative safe set ΩD, which in turn reduces the restrictionson the learning process. However, an estimated bound that failsto fully capture dpxq may allow the system to execute controlactions resulting in a constraint violation. The disturbancebound should thus be as tight as possible, to allow the systemgreater freedom in learning, yet wide enough to confidentlycapture the unknown dynamics, in order to ensure safety.

In the following two sections, we formalize this tradeoffand propose a framework to reason about safety guaranteesunder uncertainty. Section III poses the safety problem as adifferential game between the controller and an adversarialdisturbance, presenting a stronger result than commonly usedin the reachability safety literature, which exploits the entirevalue function of the game rather than only its zero levelset. Section IV leverages this result to provide a principledapproach to global safety under model uncertainty, as well asa fast local alternative that may often be useful in practice.

III. SAFETY AS A DIFFERENTIAL GAME

Following the classical approach in the literature [35],[34],the safety problem can be posed as a two-player zero-sumdifferential game between the system controller and the dis-turbance. Intuitively, we are requiring the controller to keepthe system from violating the constraints for all possibledisturbance inputs within a certain family: in particular, itmust be able to do so for the worst possible disturbance inthe set Dpxq for each visited state x. By conducting a worst-case analysis assuming an optimally adversarial disturbance,one is implicitly protecting the system against all “suboptimal”disturbances as well.

A. Game Formulation and Safe Set

To obtain a safe set of the state space from which thesystem can indefinitely be kept from violating the constraints,we formulate a game whose outcome is determined by thefunctional V

`

x,up¨q,dp¨q˘

introduced in (5) (which will benegative if the associated trajectory ξu,d

x,Dp¨q ever leaves the

constraints K).In the robust safety problem, the controller seeks to max-

imize the outcome of the game, while the disturbance tries

to minimize it: that is, the disturbance is trying to drive thesystem out of the constraint set, and the controller wants toprevent it from succeeding. Following [16, 47], we define theset of nonanticipative strategies for the disturbance containingthe functionals B “ tβ : U Ñ D | @t ě 0, @up¨q, up¨q P U,`

upτq“ upτq a.e.τ ě 0˘

ñ`

βruspτq“βruspτq a.e.τ ě 0˘

u.Since the disturbance and the control inputs are decoupled inthe system dynamics, Isaacs’ minimax condition holds3 andthe value of the game is well defined as:

V pxq :“ infβrusp¨qPB

supup¨qPU

V`

x,up¨q,βrusp¨q˘

. (6)

We can then introduce the discriminating kernel from viabilitytheory, for the infinite time horizon case that we are concernedwith.

Definition 3. We say that a point x is in the discriminatingkernel DiscDpKq of constraints K if the system trajectory ξu,d

x,Dstarting at x, with both players acting optimally, remains inK for all time t ě 0:

DiscDpKq :“ tx P Rn : @βp¨q P B, Dup¨q P U,

@t ě 0, ξu,βrus

x,Dptq P Ku,

The following classical result follows from Proposition 2.

Proposition 3. The discriminating kernel of the constraint setK is the zero superlevel set of the value function V :

DiscDpKq “ tx P Rn : V pxq ě 0u.

Remark 1. From Definitions 2 and 3, it can be seen that thesafe set ΩD is identical to the discriminating kernel DiscDpKq.

In the following section we will see how the value functionand discriminating kernel can be computed by solving aHamilton-Jacobi partial differential equation.

B. Hamilton-Jacobi-Isaacs Equation

It has been shown that the value function for minimumpayoff games of the form presented in Section III-A (i.e.games in which the payoff is the minimum of a state functionover time) can be characterized as the unique viscosity solutionto a variational inequality involving an appropriate Hamilto-nian [5, 6]; an alternative formulation involves a modifiedpartial differential equation [35]. In a finite-horizon setting,with the game taking place over the compact time intervalr0, T s, the value function V px, tq can be computed by solvingthe Hamilton-Jacobi-Isaacs (HJI) equation:

´BV

Btpx, tq “ min

#

0,maxuPU

mindPDpxq

BV

Bxpx, tqfpx, u, dq

+

(7a)V px, T q “ lpxq. (7b)

As long as there exists a nonempty safe set in the problem,V px, tq becomes independent of t inside of this set as T Ñ8.

3This means that we could have alternatively let the controller use nonan-ticipative strategies, without affecting the solution of the game.

We accordingly drop the dependence on t and recover V pxqas defined in (6), which we refer to as the safety function.

Definition 4. The optimal safe policy κ˚p¨q is the solution tothe optimization4:

κ˚pxq “ arg maxuPU

mindPDpxq

BV

Bxpxqfpx, u, dq.

Policy κ˚pxq attempts to drive the system to the safestpossible state always assuming an adversarial disturbance. Ifthe disturbance bound Dpxq is correct everywhere, then onecan allow the system to execute any desired control while inthe interior of ΩD, as long as the safety preserving actionκ˚pxq is taken whenever the state reaches the boundary BΩD;the system is then guaranteed to remain inside ΩD for all time.This least-restrictive control law can be used in conjunctionwith an arbitrary learning-based control policy κlpxq (whichmay be repeatedly updated by the corresponding learningalgorithm), to produce a safe learning policy:

κpxq “

#

κlpxq, if V pxq ą 0,

κ˚pxq, otherwise.(8)

C. Invariance Properties of Level Sets

Traditionally, the implicit hypothesis made to guaranteesafety using a least-restrictive law in the form of (8) has beencorrectness of the estimated disturbance bound D everywherein the state space, (i.e. dpxq P Dpxq @x P Rn), or at leasteverywhere in the constraint set K [23, 35]. We will now arguethat the necessary hypothesis for safety is in fact much lessstringent, by proving an important result that we will use in thefollowing section to strengthen the proposed safety frameworkand retain safety guarantees under partially incorrect models.

Proposition 4. Any nonnegative superlevel set of V pxq is arobust controlled invariant set with respect to d P Dpxq.

Proof. By Lipschitz continuity of f and l, we have that Vis Lipschitz continuous [16] and hence, by Rademacher’stheorem, almost everywhere differentiable. The convergenceof V px, tq to V pxq as T Ñ 8 implies that at the limitBVBt px, tq “ 0. Therefore, given any α ě 0, for any pointx P tx | V pxq ě αu there must exist a control actionu˚ such that @d P Dpxq, BV

Bx pxqfpx, u˚, dq ě 0; otherwise

the right hand side of (7a) would be strictly negative forT Ñ 8, contradicting convergence. Then, the value of Vfrom any such state x can always be kept from decreasing, sotx|V pxq ě αu is a robust controlled invariant set with respectto d P Dpxq.

Proposition 5. Consider two disturbance sets D1pxq andD2pxq, and a closed set M Ă Rn that is robustly controlledinvariant under D1pxq. If D2pxq Ď D1pxq @x P BM, then Mis robustly controlled invariant also under D2pxq.

4Typically the optimal solution will be a singleton, although in general thearg max need not be unique, which leads to κ˚ : Rn Ñ 2U . However, sincewe can always choose one element of the set arbitrarily, we will assume forsimplicity a policy κ˚ : Rn Ñ U , that is, a unique mapping from states tocontrol inputs.

Proof. Consider an arbitrary trajectory ξu,dx0,D2P XD2

under thedisturbance set D2pxq, starting at x0 PM, such that for someτ ă 8, ξpτq RM. Since trajectories in XD2 are continuous,there must then exist s P rt0, τ s such that ξu,dx0,D2

psq P BM.On the other hand, because M is robustly controlled invariantunder D1pxq, we know that Dκ : Rn Ñ U such that no possibledisturbance d P D1pxq can drive the system out of M. SinceD2pxq Ď D1pxq @x P BM, the same control policy κ˚pxqon the boundary guarantees that no disturbance d P D2pxq ĎD1pxq can drive the system out of M. Hence, for ξu,dx0,D2

,switching to policy κ at time s guarantees that the system willremain in M. Therefore M is a robust controlled invariant setunder D2pxq.

Corollary 2. Let Qα “ tx P Rn : V pxq “ αu with α ě 0 beany nonnegative level set of the safety function V , computedfor some disturbance set Dpxq. If dpxq P Dpxq @x P Qα, thenthe superlevel set tx P Rn : V pxq ě αu is an invariant setunder the computed optimal safe control policy κ˚pxq.

This corollary, which follows from Propositions 4 and 5 byconsidering the singleton tdpxqu, is an important result thatwill be at the core of our data-driven safety enhancement.It provides a sufficient condition for safety, but unlike thestandard HJI solution, it does not readily prescribe a least-restrictive control law to exploit it: how should one determinewhat candidate α ě 0 to choose, or whether a valid Qα existsat all? Deciding when the safe controller should interveneand what guarantees are possible is nontrivial and requiresadditional analysis.

The next section proposes a Bayesian approach enabling thesafety controller to reason about its confidence in the model-based guarantees described in this section. If this confidencereaches a prescribed minimum value in light of the observeddata, the controller can intervene early to ensure that safetywill be maintained with high probability.

IV. BAYESIAN SAFETY ANALYSIS

A. Learning-Based Safe Learning

As we have seen, robust optimal control and dynamic gametheory provide powerful analytical tools to study the safety ofa dynamical model. However, it is important to realize thatthe applicability of any theoretically derived guarantee to thereal system is contingent upon the validity of the underlyingmodeling assumptions; in the formulation considered here,this amounts to the state disturbance function dpxq beingcaptured by the bound Dpxq on at least a certain subsetof the state space. The system designer therefore faces aninevitable tradeoff between risk and conservativeness, due tothe impossibility of accounting for every aspect of the realsystem in a tractable model.

In many cases, choosing a parametric model a priori forcesone to become overly conservative in order to ensure that thesystem behavior will be adequately captured: this results in alarge bound Dpxq on the disturbance, which typically leadsto a small safe set ΩD, limiting the learning agent’s abilityto explore and perform the assigned tasks. In other cases,insufficient caution in the definition of the model can lead

to an estimated disturbance set Dpxq that fails to contain theactual model error dpxq, and therefore the computed safe setΩD may not in fact be controlled invariant in practice, whichcan end all safety guarantees.

In order to avoid excessive conservativeness and keeptheoretical guarantees valid, it is imperative to have botha principled method for refining one’s model with acquiredmeasurements and a reliable mechanism to detect and react tomodel discrepancies with the real system’s behavior; both ofthese elements are necessarily data-driven. We thus arrive atwhat is perhaps the most important insight in this work: therelation between safety and learning is reciprocal. Not only issafety a key requirement for learning in autonomous systems:learning about the real system’s behavior is itself indispensableto provide practical safety guarantees.

In the remainder of this section we propose a method forreasoning about the uncertain system dynamics using Gaussianprocesses, and introduce a Bayesian approach for online vali-dation of model-based guarantees. We then define an adaptivesafety control strategy based on this real-time validation,which leverages the theoretical results from Hamilton-Jacobianalysis to provide stronger guarantees for safe learning underpossible model inaccuracies.

B. Gaussian Process

To estimate the disturbance function dpxq over the statespace, we model it as being drawn from a Gaussian process.Gaussian processes are a powerful abstraction that extendsmultivariate Gaussian regression to the infinite-dimensionalspace of functions, allowing Bayesian inference based on(possibly noisy) observations of a function’s value at finitelymany points5.

A Gaussian process is a random process or field definedby a mean function µ : Rn Ñ R and a positive semidefinitecovariance kernel function k : Rn ˆ Rn Ñ R. We will treateach component dj , j P t1, ..., ndu, of the disturbance functionas an independent Gaussian process:

djpxq „ GPpµjpxq, kjpx, x1qq. (9)

A defining characteristic of a Gaussian process is that themarginal probability distribution of the function value at anyfinite number of points is a multivariate Gaussian. This willallow us to obtain the disturbance bound Dpxq as a Cartesianproduct of confidence intervals for the components of dpxq ateach state x, choosing the bound to capture a desired degreeof confidence.

Gaussian processes allow incorporating new observationsin a nonparametric Bayesian setting. First, assume a priorGaussian process distribution over the j-th component of dp¨q,with mean µjp¨q and covariance kernel kjp¨, ¨q. The class ofthe prior mean function and covariance kernel function ischosen to capture the characteristics of the model (linearity,periodicity, etc), and is associated to a set of hyperparametersθp. These are typically set to maximize the marginal likelihood

5We give here an overview of Gaussian process regression and direct theinterested reader to [42] for a more comprehensive introduction.

of an available set of training data, or possibly to reflect someprior belief about the system.

Next, consider N measurements dj “ rdj1, . . . , djN s, ob-

served with independent Gaussian noise εji „ N p0, pσjnq2q atthe points X “ rx1, . . . , xN s, i.e. dji “ djpxiq`ε

ji . Combined

with the prior distribution (9), this new evidence induces aGaussian process posterior; in particular, the value of dj atfinitely many points X˚ is distributed as a multivariate normal:

ErdjpX˚q | dj , Xs “ (10a)

µjpX˚q `KjpX˚, XqpK

jpX,Xq ` pσjnq2Iq´1pdj ´ µjpXqq,

covrdjpX˚q | Xs “ (10b)KjpX˚, X˚q ´K

jpX˚, XqpKjpX,Xq ` pσjnq

2Iq´1KjpX,X˚q,

where dji pXq “ djpxiq, µji pXq “ µjpxiq, and for any

X,X 1 the matrix KjpX,X 1q is defined component-wise asKjikpX,X

1q “ kjpxi, x1kq. If a single query point is con-

sidered, i.e. X˚ “ tx˚u, the marginalized Gaussian processposterior becomes a univariate normal distribution quantifyingboth the expected value of the disturbance function, djpx˚q,and the uncertainty of this estimate,

`

σjpx˚q˘2

,

djpx˚q “ Erdjpx˚q|dj , Xs (11a)`

σjpx˚q˘2“ covrdjpx˚q|Xs. (11b)

We can use the Bayesian machinery of Gaussian processregression to compute a likely bound Dpxq on the disturbancefunction dpxq based on the history of commanded inputs uiand state measurements xi, i P t1, ..., Nu. To this effect, weassume that a method for approximately measuring the statederivatives is available (e.g. by numerical differentiation), anddenote each of these measurements by fi. Based on (3), wecan obtain measurements of dpxiq from the residuals betweenthe observed dynamics and the model’s prediction:

dpxiq “ h´1´

fi ´ g`

xi, ui˘

¯

. (12)

The residuals d “ rdpx1q, . . . , dpxN qs are processed through(10) to infer the marginal distribution of dpx˚q for an arbitrarypoint x˚, specified by the expected value djpx˚q and the stan-dard deviation σjpx˚q of each component of the disturbance.This distribution can be used to construct a disturbance setDpx˚q Ď D at any point x˚; in practice, this will be done atfinitely many points xi on a grid, and used in the numericalreachability computation to obtain the safety function and thesafe control policy.

We now introduce the design parameter p as the desiredmarginal probability that the disturbance function dpxq willbelong to the bound Dpxq at each point x; typically, p shouldbe chosen to be close to 1. The set Dpxq is then chosen foreach x as follows. Let z “

?2 erf´1

pp1ndq, where erfp¨qdenotes the Gauss error function; that is, define z so that theprobability that a sample from a standard normal distributionN p0, 1q lies within r´z, zs is p1nd . We construct Dpxq bytaking a Cartesian product of confidence intervals:

Dpxq “ndź

j“1

rdjpxq ´ zσjpxq, djpxq ` zσjpxqs. (13)

Since each component djpxq is given by an independent Gaus-sian N

`

djpxq, σjpxq˘

, the probability of dpxq lying within theabove hyperrectangle is by construction

`

p1nd˘nd

“ p.

Remark 2. It is commonplace to use Gaussian distributionsto capture beliefs on variables that are otherwise known to bebounded. While one might object that the unbounded supportof (11) contradicts our problem formulation (in which thedisturbance d took values from some compact set D Ă Rnd ),the hyperrectangle Dpxq in (13) is always a compact set. Notethat the theoretical input set D is never needed in practice, soit can always be assumed to contain Dpxq for all x.

Under Lipschitz continuous prior means µj and covariancekernels kj , the disturbance bound (13) varies (Hausdorff)Lipschitz-continuously in x, satisfying the hypotheses ofProposition 1. This is formalized and proved in the Appendix.

The safety analysis described in Section III can be carriedout by solving the Hamilton-Jacobi equation (7) for Dpxqgiven by (13), which—based on the information available atthe time of computation—will be a correct disturbance boundat any single state x with probability p. As the system goes onto gather new information, however, the posterior probabilityfor dpxq P Dpxq will change at each x (and will typically nolonger equal p). More generally, we have the following result.

Proposition 6. Let q be the probability that dpxq P Dpxqfor some state x with V pxq ě 0. Then the probability thatL˚fV pxq :“ DxV pxq ¨ f px, κ

˚pxq, dpxqq ě 0 is at least q.

Proof. Omitting x for conciseness, we have: P pL˚fV ě 0q “

P`

L˚fV ě0|d P D˘

P`

d P D˘

`P`

L˚fV ě0|d R D˘

P`

d R D˘

.

By Corollary 2, the first term evaluates to 1 ¨ q; the secondterm is nonnegative (and will typically be positive, since notall values of d R D will be unfavorable for safety, and theremay be some for which the input κ˚pxq leads the system tolocally increase V ).

Based on this result, we can begin to reason about theguarantees of the reachability analysis applied to the realsystem in a Bayesian framework, inherited from the Gaussianprocess model.

C. Online Safety Guarantee Validation

In order to ensure safety under the possibility of modelmismatch with the real system, it may become necessary tointervene not only on the boundary of the computed safeset, but also whenever the observed evolution of the systemindicates that the model-based safety guarantees may losevalidity. Indeed, failure to take a safe action in time may leadto complete loss of guarantees if the system enters a regionof the state space where the model is consistently incorrect.

While the estimated model D and the associated safetyguarantees should certainly be recomputed as frequently aspossible in the light of new evidence, this process cantypically take seconds or minutes, and in some cases mayeven require offline computation. This motivates the need toaugment model-based guarantees through a data-driven safetymechanism while new, improved guarantees are computed.

Bayesian analysis allows us to update our belief on thedisturbance function as new observations are obtained. Thisin turn can be used to provide a probabilistic guarantee onthe validity of the safety results obtained from the robustdynamical model generated from the older observations. In theremainder of this section, we will discuss how to update the be-lief on the disturbance function, and then provide two differenttheoretical criteria for safety intervention. The first criterionprovides global probabilistic guarantees, but has computationalchallenges associated to its practical implementation. Thealternative method only provides a local guarantee, but canmore easily be applied in real time.

Let us denote Xold and djold as the evidence used in

computing the disturbance set Dpxq, and Xnew and djnew

as the evidence acquired online after the disturbance set iscomputed. Conditioned on the old evidence, the function djpxqis normally distributed with mean and variance given by (10)with X “ Xold and dj “ dj

old, and the disturbance set isgiven by (13). If we also condition on the new evidence, thenthe mean and variance are updated by modifying (10) withX “ rXold, Xnews and dj “ rdj

old, djnews.

Remark 3. Performing the update requires invertingKjprXold, Xnews, rXold, Xnewsq. This can be done efficientlyemploying standard techniques: since KjpXold, Xoldq hasalready been inverted (in order to compute the disturbancebound D), all that is needed is inverting the Schur Complementof KjpXold, Xoldq in KjprXold, Xnews, rXold, Xnewsq, which hasthe same size as KjpXnew, Xnewq.

Based on the new Gaussian distribution, we can reasonabout the posterior confidence in the safety guarantees pro-duced by our original safety analysis, which relied on theprior Gaussian distribution resulting from measurements dj

oldat states Xold.

1) Global Bayesian safety analysis: The strongest resultavailable for guaranteeing safety under the present frame-work is Corollary 2, which allows the system to exploit anysuperzero level set Qα (α ě 0) of the safety function Vthroughout which the model is locally correct; all that isneeded is for such a Qα to exist for α P r0, V pxqs giventhe current state x.

It is possible to devise a safety policy to fully exploitthe sufficient condition in Corollary 2 in a Bayesian setting:if the posterior probability that the corollary’s hypotheseswill hold drops to some arbitrary global confidence thresholdγ0, the safe controller can override the learning agent. Withprobability γ0, the corollary will still apply, in which casethe system is guaranteed to remain safe for all time; even ifCorollary 2 does not apply at this time (which could happenwith probability 1´γ0), it is still possible that the disturbancedpxq will not consistently take adversarial values that forcethe computed safety function V pxq to decrease, in whichcase the system may still evolve safely. Therefore, this policyguarantees a lower bound on the probability of maintainingsafety for all time.

In order to apply this safety criterion, the autonomoussystem needs to maintain a Bayesian posterior of the sufficientcondition in Corollary 2. We refer to this posterior probabil-

ity as the global safety confidence γpx;X, djq, or γpxq forconciseness:

γpx;X, djq :“ P`

Dα P r0, V pxqs,@x P Qα : dpxq P Dpxq|X, dj˘

.

(14)

Based on this, we propose the following least-restrictive lawfor the control strategy:

κpxq “

#

κlpxq, if`

γpxq ą γ0˘

^`

V pxq ą 0˘

,

κ˚pxq, otherwise,(15)

so the system applies any action it desires if the globalsafety confidence is above the threshold, but applies the safecontroller once this is no longer the case.

Note that if confidence in the safety guarantees is restoredafter applying the safety action the learning algorithm will beallowed to resume control of the system. This can happen bymultiple mechanisms: moving to a region with higher V pxqwill tend to increase the probability that some lower levelset may satisfy the hypotheses of Corollary 2; moving to aregion with less inconsistency between expected and observeddynamics will typically lead to higher posterior belief thatnearby level sets will satisfy the hypotheses of Corollary 2;and generally acquiring new data may, in some cases, increasethe posterior confidence that Corollary 2 may apply.

Computing the joint probability that the bound Dpxq cap-tures the Gaussian process dpxq throughout a level set Qα isnot generally possible, since the Gaussian process posteriorover a continuum of points does not have a closed form.Similarly, evaluating the joint probability for a continuum oflevel sets Qα for α P r0, V pxqs is not feasible. However, itis possible to approximate the sought probability γpxq by itsmarginal distribution over a sufficiently dense set of samplepoints on each Qα and a sufficiently dense collection of levelsets between 0 and V pxq.

One can then use numerical methods [20, 21] to computethe multivariate normal cumulative distribution function andestimate the marginal probability:

γpxq « P

ˆ Sł

s“1

Iľ

i“1

dpxs,iq P Dpxs,iq˙

, (16)

over S level sets 0 “ α0 ă ... ă αS “ V pxq and I samplepoints from each level set Qαs . As the density of samplesincreases with larger S and I , the marginal probability (16)asymptotically approaches the Gaussian process probability(14). Unfortunately, however, current numerical methods canonly efficiently approximate these probabilities for multivariateGaussians of about 25 dimensions [20, 21], which drasticallylimits the number of sample points (S ˆ I « 25) that themarginal probability can be evaluated over, making it difficultto obtain a useful estimate. In view of this, a viable approachmay be to bound (14) below as follows:

γpxq ě γpxq :“ maxαPr0,V pxqs

P`

@x P Qα : dpxq P Dpxq˘

, (17)

and approximately compute this as

γpxq « maxsPt1,...,Su

P

ˆ Iľ

i“1

dpxs,iq P Dpxs,iq˙

, (18)

with the advantage that a separate multivariate Gaussianevaluation can be done now for each level set (I « 25).Computing this approximate probability as the system exploresits state space provides a decision mechanism to guarantee safeoperation of the system with a desired degree of confidence,which the system designer or operator can adjust through theγ0 parameter.

2) Local Bayesian safety analysis: Evaluating the expres-sion in (18) is still computationally intensive, which can limitthe practicality of this method for real-time validation of safetyguarantees in some applications, such as mobile robots relyingon on-board processing. An alternative is to replace the globalsafety analysis with a local criterion that offers much fastercomputation traded off with a weaker safety guarantee.

Instead of relying on Corollary 2, this lighter methodexploits Propositions 1 and 6. The system is allowed to explorethe computed safe set freely as long as the probability ofthe estimated model D being locally reliable remains abovea certain threshold λ0; if this threshold is reached, the safecontroller intervenes, and the system is guaranteed to locallymaintain or increase the computed safety value V pxq withprobability no less than λ0. While this local guarantee doesnot ensure safety globally, it does constitute a useful heuristiceffort to prevent the system from entering unexplored and po-tentially unsafe regions of the state space. Further, although themethod is not explicitly tracking the hypotheses of Corollary 2,the local result becomes a global guarantee if these hypothesesdo indeed hold.

We define the local safety confidence λpx;X, djq, moreconcisely λpxq, as the posterior probability that dpxq will becontained in Dpxq at the current state x, given all observationsmade until now:

λpx;X, djq :“ P`

dpxq P Dpxq | X, dj˘

(19)

We then have the following local safety certificate.

Proposition 7. Let the disturbance dp¨q be distributedcomponent-wise as nd independent Gaussian processes (9).The safety policy κ˚p¨q is guaranteed to locally maintain orincrease the system’s computed safety V p¨q with probabilitygreater than or equal to the local safety confidence λpxq.

Proof. The proof follows directly from Propositions 1 and6, and the definition of λpxq, noting that the boundary ofDpxq has zero Lebesgue measure and thus under any Gaussiandistribution P pd P int Dpxq|d P Dpxqq “ 1.

A local confidence threshold λ0 P p0, pq can be establishedsuch that whenever λpxq ă λ0 the model is consideredinsufficiently reliable (reachability guarantees may fail locallywith probability greater than 1´λ0), and the safety control isapplied. The proposed safety control strategy is therefore asfollows:

κpxq “

#

κlpxq, if`

V pxq ą 0˘

^`

λpxq ą λ0˘

,

κ˚pxq, otherwise,(20)

Similarly to (15), under this least-restrictive law, if confidenceon the local reliability of the model is restored after applyingthe safe action and making new observations, the system will

be allowed to resume its learning process, as long as it is inthe interior of the computed safe set.

After generating a new Gaussian process model and definingDpxq as described in Section IV-B, the prior probabilitywith which the disturbance function dpxq belongs to the setDpxq is by design p everywhere in the state space. As thesystem evolves, more evidence is gathered in the form ofmeasurements of the disturbance along the system trajectory,so that the belief that dpxq P Dpxq is updated for each x.In particular, in the Gaussian process model, this additionalevidence amounts to augmenting the covariance matrix Kj in(10) with additional data points and reevaluating the mean andvariance of the posterior distribution of dpxq. Based on the newGaussian distribution, λpx;X, djq can readily be evaluated foreach x as

λpxq “ndź

j“1

1

2

«

erf

˜

dj`pxq ´mjpxq

sjpxq?

2

¸

´ erf

˜

dj´pxq ´mjpxq

sjpxq?

2

¸ff

,

(21)

with dj`pxq “ djpxq ` zσjpxq, dj´pxq “ djpxq ´ zσjpxq,mjpxq “ Erdjpxq|X, djs, sjpxq “

a

varpdjpxq|Xq; recallthat z was defined to yield the desired probability mass p inDpxq at the time of safety computation, as per (13).

Parameters p and λ0 (or, in its case, γ0) allow the systemdesigner to choose the degree of conservativeness in thesystem: while p regulates the amount of uncertainty accountedfor by the robust model-based safety computation, λ0 (γ0)determines the acceptable degradation in the resulting cer-tificate’s posterior confidence before a safety intervention isinitiated. A value of p close to 1 will lead to a large, high-confidence Dpxq throughout the state space, but this analysismay result in a small or even empty safe set; on the otherhand, if p is low, Dpxq will be smaller and the computed safeset will be larger, but guarantees are more likely to be deemedunreliable (as per λ0 or γ0) in light of later observations.

In the case of local safety analysis, immediately aftercomputing a new model D, λpxq is by construction equalto p everywhere in the state space. As more measurementsare obtained, the posterior distribution over the disturbancechanges, as illustrated in Fig. 2, which can result in λpxqlocally increasing or decreasing. If λ0 is chosen to be closeto p, it is likely that the safety override will take place underminor deviations with respect to the model’s prediction; asλ0 becomes lower, however, the probability that the distur-bance will violate the modeling assumptions before the safetycontroller intervenes increases. This reflects the fundamentaltradeoff between risk and conservativeness in safety-criticaldecision making under uncertainty. The proposed frameworktherefore allows the system designer to adjust the degree ofconservativeness according to the needs and characteristics ofthe system at hand.

V. EXPERIMENTAL RESULTS

We demonstrate our framework on a practical applicationwith an autonomous quadrotor helicopter learning a flightcontroller in different scenarios. Our method is tested onthe Stanford-Berkeley Testbed of Autonomous Rotorcraft for

Fig. 2: Evolution of the probability distribution of the distur-bance dpxq at a particular state x. The prior distribution is usedto compute the bound Dpxq using confidence intervals, suchthat it contains a specified probability mass p. As more dataare obtained, the distribution may shift, leading to a differentposterior probability mass contained within Dpxq.

Multi-Agent Control (STARMAC), using Ascending Tech-nologies Pelican and Hummingbird quadrotors (Fig. 1). Thesystem receives full state feedback from a VICON motion cap-ture system. For the purpose of this series of experiments, thevehicle’s dynamics are approximately decoupled through anon-board controller responsible for providing lateral stabilityaround hover and vertical flight; our framework is then used tolearn the feedback gains for a hybrid vertical flight controller.The learning and safety controllers were implemented andexecuted in MATLAB, on a Lenovo Thinkpad with an Inteli7 processor that communicated wirelessly with the vehicle’s1.99 GHz Quadcore Intel Atom processor. This was all doneusing the Indigo version of the Robot Operating System (ROS)framework.

The purpose of the results presented here is not to advancethe state of the art of quadrotor flight control or reinforce-ment learning techniques, but to illustrate how the proposedmethod can allow safe execution of an arbitrary learning-based controller without requiring any particular convergencerate guarantees. To fully demonstrate the reliability of oursafe learning framework, in our first setup we let the vehiclebegin its online learning in mid-air starting with a completelyuntrained controller. The general functioning of the frameworkcan be observed in the second flight experiment, in whichthe vehicle starts with a conservative model and iterativelycomputes empirical estimates of the disturbance, graduallyexpanding its computed safe set while avoiding overrelianceon poor predictions. Finally, we include an experiment inwhich an unexpected disturbance is introduced into the system.The vehicle reacts by immediately applying the safe actiondictated by its local safety policy and retracting from theperturbed region, successfully maintaining safety throughoutits trajectory. We show how the absence of online guaranteevalidation in the same scenario can result in loss of safety.

We use an affine dynamical model of quadrotor vertical

flight, with state equations:

9x1 “ x2

9x2 “ kTu` g ` k0 ` dpxq(22)

where x1 is the vehicle’s altitude, x2 is its vertical velocity,and u P r0, 1s is the normalized motor thrust command. Thegravitational acceleration is g “ ´9.8 m/s2. The parametersof the affine model kT and k0 are determined for the Pelicanand the Hummingbird vehicles through a simple on-the-groundexperimental procedure—a scale is used to measure the normalforce reduction for different values of u. Finally, d is anunknown, state-dependent scalar disturbance term represent-ing unmodeled forces in the system. The state constraintK “ tx : 0 m ď x1 ď 2.8 mu encodes the position of the floorand the ceiling, which must be avoided.

As the learning-based controller, we choose an easily im-plementable policy gradient reinforcement learning algorithm[32], which learns the weights for a linear mapping fromstate features to control commands. Following [23], we definedifferent features for positive and negative velocities andposition errors, since the (unmodeled) rotor dynamics may bedifferent in ascending and descending flight. This can be seenas the policy gradient algorithm learning the feedback gainsfor a hybrid proportional-integral-derivative (PID) controller.

A. From Fall to Flight

To demonstrate the strength of Hamilton-Jacobi-based guar-antees for safely performing learning-based control on a phys-ical system, we first require a Pelican quadrotor to learn an ef-fective vertical trajectory tracking controller with an arbitrarilypoor initialization. To do this, the policy gradient algorithm isinitialized with all feature weights set to 0. The pre-computedsafety controller (numerically obtained using [36]) is basedon a conservative uncertainty bound of ˘1.5 m/s2 everywherein the state space; no new bounds are learned during thisexperiment. The reference trajectory requires the quadrotor toaggressively alternate between hovering at two altitudes, oneof them (1.5 m) near the center of the room, the other (0.1 m)close to the floor.

This first experiment illustrates the interplay between thelearning controller and the safety policy. The iterative safetyre-computation and Bayesian guarantee validation compo-nents of the framework are not active here. Consistently, thisdemonstration uses (8) as the least-restrictive safe policy.

The experiment, shown in Fig. 3, is initialized with thevehicle in mid-air. Since all feature weights are initially set tozero, the vehicle’s initial action is to enter free fall. However,as the quadrotor is accelerated by gravity towards the floor,the boundary of the computed safe set is reached, triggeringthe intervention of the safety controller, which automaticallyoverrides the learning controller and commands the maximumavailable thrust to the motors (u “ 1), causing the vehicleto decelerate and hover at a small distance from the ground.For the next few seconds, there is some chattering near theboundary of the safe set, and the policy gradient algorithmhas some occasions to attempt to control the vehicle whenit is momentarily pushed into the interior of the safe set.

Time (s)0 20 40 60 80 100 120 140

Altitude (

m)

0

0.5

1

1.5

2

2.5

Altit

ude

(m)

0

1

2

0 20 60 8040 100 120 140Time (s)

Reference Learning Override: max

Fig. 3: Vehicle altitude and reference trajectory over time.Initial feedback gains are set to zero. When the learningcontroller (green) lets the vehicle drop, the safety control (red)takes over preventing a collision. Within a few seconds, thelearned feedback gains allow rough trajectory tracking and aresubsequently tuned as the vehicle attempts to minimize error.

Initially it has little success, which leads the safety controllerto continually intervene to prevent the quadrotor from collidingwith the floor; this has the undesirable effect of slowing downthe learning process, since observations under this interferenceare uninformative about the behavior of the vehicle whenactually executing the commands produced by the learningcontroller (which is an “on-policy” algorithm). However, atapproximately t “ 40 s, the learning controller is ableto make the vehicle ascend towards its tracking reference,retaining control of the vehicle for a longer span of time andaccelerating the learning process. By t “ 60 s, the quadrotor isapproximately tracking the reference, with the safety controlleronly intervening during the aggressive descent phase of therepeated trajectory, to ensure (under its conservative model)that there is no risk of a ground collision. The controllercontinues to learn in subsequent iterations, overall improvingits tracking accuracy.

The remarkable result in this experiment is not in thequality of the learned tracking controller after only a fewseconds of active exploration (a merit that corresponds tothe reinforcement learning method [32]), but the system’sability to achieve competent performance at its task from anextremely poor initial policy while remaining safe at all times.

B. When in Doubt

In the second experiment, we demonstrate the iterativeupdating of the safe set and safety policy using observationsof the system dynamics gathered over time, as well as theonline validation of the resulting guarantees. All componentsof the framework are active during the test, namely learningcontroller, safety policy, iterative safety re-computation, andBayesian guarantee validation, with the main focus being onthe latter two.

Here, the Pelican quadrotor attempts to safely track thesame reference trajectory, while using the gathered informationabout the system’s evolution to refine its notion of safety. Inthis case, the policy gradient learning algorithm is initializedto a hand-tuned set of parameter values. The initial dynamicmodel available to the safety algorithm is identical to the oneused in the previous experiment, with a uniform uncertainty

bound of ˘1.5m/s2. However, the system is now allowedto update this bound, throughout the state space, based onthe disturbance posterior computed using a Gaussian processmodel.

To learn the disturbance function, the system starts witha Gaussian process prior over dp¨q defined by a zero meanfunction and a squared exponential covariance function:

kpx, x1q “ σ2f exp

ˆ

px´ x1qTL´1px´ x1q

2

˙

, (23)

where L is a diagonal matrix, with Li as the ith diagonalelement, and θp “

“

σ2f , σ

2n,L1,L2

‰

are the hyperparameters,σ2f being the signal variance, σ2

n the measurement noisevariance, and the Li the squared exponential’s characteristiclength for position and velocity respectively. The hyperparam-eters are chosen to maximize the marginal likelihood of thetraining data set, and are recomputed for each new batch ofdata when a new disturbance model Dpxq is generated forsafety analysis. Finally, the chosen prior mean and covariancekernel classes are both Lipschitz continuous, ensuring that allrequired technical conditions for the theoretical results hold(proofs are presented in the Appendix).

The expressions (10), (11) give the marginal Gaussian pro-cess posterior on dpx˚q for a query point x˚. To numericallycompute the safe set, the system first evaluates (13) to obtainthe disturbance bound Dpxiq at every point xi on a state-spacegrid, as the 95% confidence interval (p “ 0.95) of the Gaussianprocess posterior over dpxq; next, it performs the robust safetyanalysis by numerically solving the Hamilton-Jacobi-Isaacsequation (7) on this grid (using [36]) and obtaining the safetyfunction V pxq.

The trajectory followed by the quadrotor in this experimentis shown in Fig. 4. The vehicle starts off with an a prioriconservative global bound on dpxq and computes an initialconservative safe set Ω1 (Fig. 5). It then attempts to trackthe reference trajectory avoiding the unsafe regions by transi-tioning to the safe control u˚pxq on BΩ1. The disturbance ismeasured and monitored online during this test, under the localsafety confidence criterion, and found to be locally consistentwith the initial conservative bound. After collecting 10 s ofdata, a new disturbance bound Dpxq is constructed usingthe corresponding Gaussian process posterior, from which asecond safety function V2pxq and safe set Ω2 are computedvia Hamilton-Jacobi reachability analysis. This process takesroughly 2 seconds, and at approximately t “ 12 s the newsafety guarantees and policy are substituted in.

The Pelican continues its flight under the results of this newsafety analysis: however, shortly after, the vehicle measuresvalues of d that consistently approach the boundary of Dpxq,and reacts by applying the safe control policy and locallyclimbing the computed safety function. This confidence-basedintervention takes place several times during the test run, asthe vehicle measures disturbances that lower its confidencein the local model bounds, effectively preventing the vehiclefrom approaching the ground.

After a few seconds, a new Gaussian process posterior iscomputed based on the first 20 s of flight data, resulting in anestimated safe set Ω3, an intermediate result between the initial

Time (s)0 5 10 15 20 25 30 35 40

Alti

tude

(m

)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8Learning Safety: max Safety: min

0

1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

0 5 10 15 25 30 35 40Time (s)

20

Altit

ude

(m)

Reference

Fig. 4: Vehicle altitude and reference trajectory over time.After flying with an initial conservative model, the vehiclecomputes a first Gaussian process model of the disturbancewith only a few data points, resulting in an insufficientlyaccurate bound. The safety policy detects the low confidenceand refuses to follow the reference to low altitudes. Oncea more accurate disturbance bound is computed, tracking isresumed, with a less restrictive safe set than the original one.

conservative Ω1 and the overly permissive Ω2 (Fig. 5). Thelearning algorithm is then allowed to resume tracking underthis new safety analysis, and no further safety overrides takeplace due to loss of safety confidence.

This experiment demonstrates the algorithm’s ability tosafely refine its notion of safety as more data become available,without requiring the process to consist in a series of strictlyconservative under-approximations.

C. Gone with the Wind

In this last experimental result, we display the efficacyof online safety guarantee validation in handling alterationsin operating conditions unforeseen by the system designer.All components of the framework are active, except for theiterative safety re-computation, which is not relevant to thisdemonstration.

This experiment is performed using the lighter Humming-bird quadrotor, which is more agile than the Pelican but alsomore susceptible to wind. We initialize the disturbance setto a conservative range of ˘2 m/s2, which amply capturesthe error in the double-integrator model for vertical flight.The vehicle begins tracking a slow sinusoidal trajectory,using policy gradient [32] to improve upon the manuallyinitialized controller parameters. At approximately t “ 45 san unmodeled disturbance is introduced by activating a fanaimed laterally at the quadrotor. The fan is positioned on theground and angled slightly upward, so that its effect increasesas the quadrotor flies closer to the ground. The presence ofthe airflow causes the attitude and lateral position controllersto use additional control authority to stabilize the quadrotor,

Altitude (m)-0.5 0 0.5 1 1.5 2 2.5 3

Vel

ocity

(m

/s)

-3

-2

-1

0

1

2

3

4

1

2

3

4

4

3

2

1

Reference

Executed

Velo

city

(m/s

)4

Altitude (m)

3

2

1

0

-1

-2

-3-0.5 0 0.5 1 1.5 2 2.5 3

Fig. 5: Safe sets computed online by the safety algorithm asit gathers data and successively updates its Gaussian processdisturbance model. The vehicle’s trajectory eventually leavesthe initial, conservative Ω1, but remains in the converged safeset (Ω4) at all times, even before this set is computed. Whilethe intermediate set Ω2 would have been overly permissive,this is remedied by the intervention of the safety controller assoon as the model is observed to behave poorly.

which couples into the vertical dynamics as an unmodeledforce.

The experiment is performed with and without the Bayesianguarantee validation component, with resulting trajectoriesshown in Fig. 6. Without validation, the quadrotor violates theconstraints, repeatedly striking the ground. With validation,the fan’s airflow is quickly detected as a discrepancy withthe model near the floor, and the safety controller override istriggered. The vehicle avoids entering the affected region forthe remainder of the flight. Although only the local confidencemethod is used, providing a strictly local safety guarantee,the safe controller succeeds in maintaining safety throughoutthe experiment. This provides strong evidence suggestingthat, beyond its theoretical guarantees, the local Bayesiananalysis also constitutes an effective best-effort approach tosafety in more general conditions, given limited computationalresources and available knowledge about the system.

VI. CONCLUSION

We have introduced a safe learning framework that com-bines robust reachability guarantees from control theory withBayesian analysis based on empirical observations. This re-sults in a minimally restrictive supervisory controller that canallow an arbitrary learning algorithm to safely explore its stateand strategy spaces. As more data are gathered online, theframework allows the system to probabilistically reason aboutthe validity of its robust model-based safety guarantees in lightof the latest empirical evidence.

We firmly believe that providing strong and practicallyuseful safety guarantees for systems that navigate unstructured

0 30 60 90 120Time (s)

0

0.5

1

1.5

Altit

ude

(m)

With model validation

0 30 60 90 120Time (s)

0

0.5

1

1.5

Altit

ude

(m)

Without model validationFan Off Fan On

Fan Off Fan On

Time (s)0 60 12030 90

Time (s)0 60 12030 90

Altit

ude

(m)

0

1

1.5

0.5

Altit

ude

(m)

0

1

1.5

0.5

Without online guarantee validation

With online guarantee validation

Fig. 6: Vehicle altitude and reference trajectory over time,shown with and without online model validation. After thefan is turned on, the vehicle checking local model reliabilitydetects the inconsistency and overrides the learning controller,avoiding the region with unmodeled dynamics; the vehiclewithout model validation enters this region and collides withthe ground multiple times. The behavior is repeated when thereference trajectory enters the perturbed region a second time.

environments requires a rapprochement between model-basedand data-driven techniques, often regarded as a dichotomy byboth theoreticians and practitioners. With this work we intendto provide mathematical arguments and empirical proof ofthe potential that these two approaches hold when used inconjunction.

Scaling up safety certificates as intelligent systems achieveincreasing complexity poses an important open researchproblem. Our prediction is that, as autonomous systemsinteract increasingly closely with human beings, the gracefulinterplay of safety and learning, combining theoreticalguarantees with empirical grounding, will become central totheir success.

APPENDIX

Restricting one of the control inputs d to a state-dependentbound Dpxq introduces questions as to whether a uniqueCaratheodory solution to (2) continues to exist. The basicexistence and uniqueness theorems [13] assume fixed controlsets. If the variation in the control sets can instead be expressedthrough the dynamic equation itself without breaking thecontinuity conditions, then it is easy to extend the classicalresult to at least a class of space-dependent control sets. Weintroduce two technical assumptions, which give sufficientconditions for the existence and uniqueness of a solution tothe dynamical equations, and prove that any disturbance setDpxq obtained from a Gaussian process model with Lipschitzprior mean and covariance kernel satisfies these assumptions.

Assumption 1. For all x P Rn, Dpxq is a closed deformationretract of D, that is, there exists a continuous map Hx : D ˆr0, 1s Ñ Dpxq such that for every d P D and d P Dpxq,

Hxpd, 0q “ d, Hxpd, 1q P Dpxq, Hxpd, 1q “ d.

Assumption 2. Let r : Rn ˆ D Ñ D be such that rpx, dq “Hxpd, 1q as defined above. Then r is Lipschitz continuous inx, and uniformly continuous in d.

Intuitively, the first assumption means that D can be con-tinuously deformed into Dpxq for any x, while the secondprevents the disturbance bound Dpxq from changing abruptlyas one moves in the state space. The retraction map r allowsus to absorb the state-dependence of the disturbance boundinto the system dynamics, enabling us to use the standardanalysis for differential games, which considers measurabletime signals drawn from fixed input sets. This is formalizedin the following result.

Proposition 8. The saturated system dynamics fDpx, u, dq :“f`

x, u, rpx, dq˘

are bounded and uniformly continuous in allvariables, and Lipschitz in x.

Proof. Noting that boundedness and uniform continuity of fDin u are trivially inherited from f , we focus on d and x.

First, since r is uniformly continuous in d, and f is Lipschitz(hence uniformly continuous) in its third argument, we haveby composition that fD is uniformly continuous in d.

Lipschitz continuity in x is less immediate due to itsappearance in both the first and third arguments of f . Againby composition, Lipschitz continuity of r in x implies thatf`

y, u, rp¨, dq˘

is also Lipschitz for all y P Rn, u P U andd P D. Letting Lr be the Lipschitz constant of r and Lx bethe Lipschitz constant of f in its first argument, we have, forany x, x P Rn:

|fpx, u, rpx, dqq ´ fpx, u, rpx, dqq|

ď |fpx, u, rpx, dqq ´ fpx, u, rpx, dqq| ` |fpx, u, rpx, dqq ´ fpx, u, rpx, dqq|

ď pLx ` LdLrq|x´ x|

Thus fDp¨, u, dq “ fp¨, u, rp¨, dqq is indeed Lipschitz in x.

Corollary 3. The dynamical system given by (2) with dconstrained to lie in a state-dependent set Dpxq satisfyingAssumptions 1 and 2 has a unique continuous solution in theextended (Caratheodory) sense.

The above result is important in that it guarantees existenceand uniqueness of system trajectories for any state-dependentdisturbance bound Dp¨q that satisfies Assumptions 1 and 2.Moreover, the above construction allows us to transform thesystem dynamics fpx, u, dq with d P Dpxq into the standardform with fixed input sets (i.e. fDpx, u, dq with u P U , d P D),so that all results from the differential games literature canreadily be applied to our formulation.

Let us now see that the posterior mean and standarddeviation of the components of dpxq are Lipschitz continuousfunctions of the state x under our Gaussian process framework.

Proposition 9. Let the prior mean function µj be Lipschitzcontinuous, and the covariance kernel kj be jointly Lipschitzcontinuous in its two variables, for the jth component of the

disturbance function dpxq. Then the posterior mean djpxq andstandard deviation σjpxq, as given by (10), (11) are Lipschitzcontinuous in x.

Proof. The result follows from applying the hypotheses to(10), (11). Note that the standard deviation σjpxq is thesquare root of the variance in (11); the square root functionis Lipschitz everywhere except at 0, and Bayesian inferenceunder nondegenerate prior and likelihood never results in 0posterior variance. Thus σjp¨q is also Lipschitz.

The following proposition relates the state-dependent boundDpxq obtained from Gaussian process regression to Assump-tions 1 and 2, ensuring that the dynamical system (2) is well-defined, and therefore the associated dynamic game be solvedusing the methods presented in Section III.

Proposition 10. Let the prior mean function µj be Lipschitzcontinuous, and the covariance kernel kj be jointly Lipschitzcontinuous in its two variables, for all components j of thedisturbance function dpxq. Then the disturbance bound Dpxq,as defined in (13), satisfies Assumptions 1 and 2.

Proof. Assumption 1 holds independently of the Lipschitzcondition. The bound Dpxq given by (13) is a compact convexset in D. As a result, the retraction map πx : D Ñ Dpxqassigning every d P D its (unique) nearest point in Dpxq is aLipschitz continuous function (with Lipschitz constant equalto 1); of course with πxpdq “ d for all d P Dpxq. Then,the function Hxpd, tq :“ p1 ´ tqd ` tπxpdq is continuous bycomposition and further satisfies Hxpd, 0q “ d,Hxpd, 1q PDpxq, Hxpd, 1q “ d for all d P D and d P Dpxq.

Assumption 2 can be shown to hold by noting that theextrema of each of the intervals in (13) are affine in djpxq andσjpxq, which are Lipschitz continuous in x by Proposition 9.This implies that the position of all vertices of the hyperrect-angle in (13) varies as a Lipschitz continuous function of x,and so does, as a result, the nearest point in Dpxq to any fixedd P D. The map rpx, dq “ πxpdq is hence Lipschitz continuousin x. Finally, since πx is Lipschitz continuous in d, r is alsouniformly continuous in d.

Finally, we can show that, under the same Lipschitz assump-tions on the Gaussian process prior, the disturbance boundDpxq is Lipschitz continuous under the Hausdorff distance,which we required in Proposition 1.

Proposition 11. Let the prior mean function µj be Lipschitzcontinuous, and the covariance kernel kj be jointly Lipschitzcontinuous in its two variables, for all components j of thedisturbance function dpxq. Then the set-valued map D isLipschitz continuous under the Hausdorff distance.

Proof. Since the disturbance set Dpxq given by (13) is a hyper-rectangle, it is easy to see that the Hausdorff distance betweenthe disturbance sets Dpx1q and Dpx2q is upper-bounded by themaximum distance between any pair of corners:

δpx1, x2q :“ dH`

Dpx1q, Dpx2q˘

ď maxi

maxk|ci ´ ck|

with ci,ck used to enumerate all corners of each of thetwo hyperrectangles. For simplicity of exposition, we use the

equivalence of all norms in Rdn to upper-bound the above,arbitrary norm in Rnd , by the infinity norm, up to a constantfactor m, which in combination with (13) gives:

δpx1, x2q ď m¨maxj

`

|djpx1q´djpx2q|`|zσjpx1q´zσ

jpx2q|˘

.

Now, by Proposition 9, djpxq and σjpxq are Lipschitz contin-uous in x; let their respective constants be Ljµ and Ljσ . Wethen have

dH`

Dpx1q, Dpx2q˘

ď m ¨maxj

`

Ljµ ` zLjσ

˘

|x1 ´ x2|,

which proves Hausdorff Lipschitz continuity of the set-valuedmap D, with a Lipschitz constant LD upper bounded bym ¨maxj L

jµ ` zL

jσ .

REFERENCES

[1] P. Abbeel et al. “An application of reinforcement learn-ing to aerobatic helicopter flight”. Advances in NeuralInformation Processing Systems (2007).

[2] A. K. Akametalu et al. “Reachability-based safe learn-ing with Gaussian processes”. Conference on Decisionand Control (2014).

[3] A. Ames, J. Grizzle, and P. Tabuada. “Control BarrierFunction based Quadratic Programs with Application toAdaptive Cruise Control”. Conference on Decision andControl (2014).

[4] A. Aswani et al. “Provably safe and robust learning-based model predictive control”. Automatica 49.5(2013).

[5] E. Barron. “Differential Games with Maximum Cost”.Nonlinear analysis: Theory, methods & applications(1990).

[6] E. Barron and H. Ishii. “The Bellman equation forminimizing the maximum cost”. Nonlinear Analysis:Theory, Methods & Applications (1989).

[7] F. Berkenkamp et al. “Safe learning of regions of at-traction for uncertain, nonlinear systems with Gaussianprocesses”. Conference on Decision and Control (2016).

[8] M. Chen, S. Herbert, and C. J. Tomlin. “Fast Reach-able Set Approximations via State Decoupling Distur-bances”. Conference on Decision and Control (2016).

[9] M. Chen, S. Herbert, and C. J. Tomlin. “Exact and Ef-ficient Hamilton-Jacobi-based Guaranteed Safety Anal-ysis via System Decomposition”. International Confer-ence on Robotics and Automation (2017).

[10] M. Chen et al. “Safe sequential path planning of multi-vehicle systems via double-obstacle Hamilton-Jacobi-Isaacs variational inequality”. European Control Con-ference (2015).

[11] P. Christiano et al. “Transfer from Simulation toReal World through Learning Deep Inverse DynamicsModel”. arXiv preprint (2016).

[12] A. Coates, P. Abbeel, and A. Y. Ng. “Learning forcontrol from multiple demonstrations”. InternationalConference on Machine Learning (2008).

[13] E. A. Coddington and N. Levinson. Theory of ordinarydifferential equations. Tata McGraw-Hill, 1955.

[14] J. Darbon and S. Osher. “Algorithms for overcomingthe curse of dimensionality for certain Hamilton-Jacobiequations arising in control theory and elsewhere”.Research in the Mathematical Sciences 3.1 (2016).

[15] J. Ding et al. “Toward reachability-based controllerdesign for hybrid systems in robotics”. InternationalSymposium on Artificial Intelligence, Robotics and Au-tomation in Space (2011).

[16] L. C. Evans and P. E. Souganidis. “Differential gamesand representation formulas for solutions of Hamilton-Jacobi-Isaacs equations”. Indiana University mathemat-ics journal 33.5 (1984).

[17] J. F. Fisac and S. S. Sastry. “The Pursuit-Evasion-Defense Differential Game in Dynamic ConstrainedEnvironments”. Conference on Decision and Control(2015).

[18] J. Garcıa and F. Fernandez. “A Comprehensive Surveyon Safe Reinforcement Learning”. Journal of MachineLearning Research 16 (2015).

[19] P. Geibel and F. Wysotzki. “Risk-sensitive reinforce-ment learning applied to control under constraints”.Journal of Artificial Intelligence Research 24 (2005).

[20] A. Genz. “Numerical computation of multivariate nor-mal probabilities”. Journal of computational and graph-ical statistics 1.2 (1992).

[21] A. Genz. “Numerical computation of rectangular bivari-ate and trivariate normal and t probabilities”. Statisticsand Computing 14.3 (2004).

[22] J. H. Gillula and C. J. Tomlin. “Guaranteed Safe OnlineLearning via Reachability: tracking a ground target us-ing a quadrotor”. International Conference on Roboticsand Automation (2012).

[23] J. H. Gillula and C. J. Tomlin. “Reducing Conserva-tiveness in Safety Guarantees by Learning DisturbancesOnline: Iterated Guaranteed Safe Online Learning.”Robotics: Science and Systems (2012).

[24] S. L. Herbert et al. “FaSTrack: a Modular Frameworkfor Fast and Guaranteed Safe Motion Planning”. Con-ference on Decision and Control (submitted) (2017).

[25] A. Hobbs. “Unmanned aircraft systems”. Human factorsin aviation. Ed. by E. Salas and D. Maurino. 2nd ed.Elsevier, 2010.

[26] S. Huang et al. “Adversarial Attacks on Neural NetworkPolicies”. arXiv preprint (2017).

[27] S. Kaynama and M. Oishi. “Complexity reductionthrough a Schur-based decomposition for reachabilityanalysis of linear time-invariant systems”. InternationalJournal of Control 84.1 (2011).

[28] S. Kaynama and M. Oishi. “A modified riccati trans-formation for decentralized computation of the viabilitykernel under LTI dynamics”. IEEE Transactions onAutomatic Control 58.11 (2013).

[29] S. Kaynama et al. “Scalable safety-preserving robustcontrol synthesis for continuous-time linear systems”.IEEE Transactions on Automatic Control 60.11 (2015).

[30] M. R. Kirchner et al. “Time-Optimal CollaborativeGuidance using the Generalized Hopf Formula”. Con-ference on Decision and Control (submitted) (2017).

[31] J. Z. Kolter et al. “A probabilistic approach to mixedopen-loop and closed-loop control, with application toextreme autonomous driving”. International Conferenceon Robotics and Automation (2010).

[32] J. Z. Kolter and A. Y. Ng. “Policy search via thesigned derivative.” Robotics: Science and Systemsys-tems (2009).

[33] S. Lupashin et al. “A simple learning strategy for high-speed quadrocopter multi-flips”. International Confer-ence on Robotics and Automation (2010).

[34] J. Lygeros. “On reachability and minimum cost optimalcontrol”. Automatica 40.6 (2004).

[35] I. M. Mitchell, A. M. Bayen, and C. J. Tomlin. “A time-dependent Hamilton-Jacobi formulation of reachablesets for continuous dynamic games”. IEEE Transactionson Automatic Control 50.7 (2005).

[36] I. Mitchell and J. Templeton. “A Toolbox of Hamilton-Jacobi Solvers for Analysis of Nondeterministic Con-tinuous and Hybrid Systems”. Hybrid Systems: Compu-tation and Control (2005).

[37] V. Mnih et al. “Human-level control through deepreinforcement learning”. Nature 518.7540 (2015).

[38] T. M. Moldovan and P. Abbeel. “Safe exploration inMarkov decision processes”. International Conferenceon Machine Learning (2012).

[39] S. Osher and R. Fedkiw. Level set methods and dynamicimplicit surfaces. Vol. 153. Springer Science & BusinessMedia, 2003.

[40] T. J. Perkins and A. G. Barto. “Lyapunov design forsafe reinforcement learning”. The Journal of MachineLearning Research 3 (2003).

[41] S. Prajna and A. Jadbabaie. “Safety verification of hy-brid systems using barrier certificates”. Hybrid Systems:Computation and Control (2004).

[42] C. E. Rasmussen and C. K. I. Williams. Gaussianprocesses for machine learning. MIT Press, 2006.

[43] J. W. Roberts, I. R. Manchester, and R. Tedrake. “Feed-back controller parameterizations for ReinforcementLearning”. Symposium on Adaptive Dynamic Program-ming and Reinforcement Learning (2011).

[44] J. Schulman et al. “Trust Region Policy Optimization”.International Conference on Machine Learning (2015).

[45] C. Sloth, G. J. Pappas, and R. Wisniewski. “Composi-tional safety analysis using barrier certificates”. Inter-national Conference on Hybrid Systems: Computationand Control (2012).

[46] M. Turchetta, F. Berkenkamp, and A. Krause. “SafeExploration in Finite Markov Decision Processes withGaussian Processes”. Advances in Neural InformationProcessing Systems (2016).

[47] P. Varaiya. “On the existence of solutions to a differen-tial game”. SIAM Journal on Control 5.1 (1967).

Jaime F. Fisac is a Ph.D. candidate in Electrical En-gineering and Computer Sciences at the Universityof California, Berkeley. He received a B.S./M.S. de-gree in Electrical Engineering from the UniversidadPolitecnica de Madrid, Spain, in 2012, and a M.Sc.in Autonomous Vehicle Dynamics and Control fromCranfield University, UK, in 2013. He is a recipientof the “la Caixa” Foundation Fellowship (2012-2015). His research interests lie in control theory,artificial intelligence, and cognitive science, with afocus on safety for human-centered robotic systems.

Anayo K. Akametalu is a PhD. candidate in Electri-cal Engineering and Computer Sciences at the Uni-versity of California, Berkeley. He obtained his B.S.degree in Electrical Engineering from the Universityof California, Santa Barbara in 2012. His researchinterests lie at the intersection of control theory andreinforcement learning. He has been funded throughthe National Science Foundation Bridge to DoctorateFellowship, UC Berkeley Chancellor’s Fellowship,and GEM Fellowship. In 2016 he took a break fromhis graduate studies to work at Apple Inc.

Melanie N. Zeilinger is an Assistant Professor at theDepartment of Mechanical and Process Engineering,ETH Zurich, Switzerland. She received a Diploma inEngineering Cybernetics (2006) from the Universityof Stuttgart, Germany, and a Ph.D. in Electrical En-gineering (2011) from ETH Zurich. She was a MarieCurie Fellow and held Postdoctoral Researcher andFellow appointments at the Max Planck Institutefor Intelligent Systems, Germany (until 2015), UCBerkeley, CA (2012-14), and EPFL, Switzerland(2011-12). Her research interests include distributed

control and optimization, as well as safe learning-based control, with appli-cations to energy distribution systems and human-in-the-loop control.

Shahab Kaynama is an Autonomy Manager atClearpath Robotics. From 2014 to 2016 he was aSenior Autonomy Engineer and later a Team Lead,Navigation & Controls. Previously, he was a Post-Doctoral Research Scholar at UC Berkeley (since2012) and at the University of British Columbia(since 2014). Dr. Kaynama received a Ph.D. (2012)in Electrical and Computer Engineering from theUniversity of British Columbia, Canada and anM.Sc. (2006) in Advanced Control and SystemsEngineering from the University of Manchester, UK.

Jeremy Gillula received a B.S. in Computer Sciencefrom Caltech, CA, in 2006, and a M.S. and Ph.D.in Computer Science from Stanford University, CA,in 2011 and 2013. After spending eight monthsas a Postdoctoral Researcher at UC Berkeley, hebecame a Staff Technologist at the Electronic Fron-tier Foundation, a San Francisco-based nonprofitorganization dedicated to defending civil libertiesin the digital world. Dr. Gillula is the recipient ofthe Olin E. Teague Memorial Scholarship from theNational Space Club (2002), a Caltech Associates

Endowed Undergraduate Scholarship (2004), a Carnation Merit Award (2005),a National Science Foundation Graduate Fellowship (2006-2009), and aNational Defense Science and Engineering Graduate Fellowship (2009-2010).

Claire J. Tomlin is the Charles A. Desoer Pro-fessor of Engineering in Electrical Engineering andComputer Sciences at the University of California,Berkeley. She was an Assistant, Associate, andFull Professor in Aeronautics and Astronautics atStanford from 1998 to 2007, and in 2005 joinedBerkeley. Claire works in the area of control theoryand hybrid systems, with applications to air trafficmanagement, UAV systems, energy, robotics, andsystems biology. She is a MacArthur FoundationFellow (2006) and an IEEE Fellow (2010), and in

2010 held the Tage Erlander Professorship of the Swedish Research Councilat KTH in Stockholm.

a general safety framework for learning-based control … · a general safety framework for...

Documents