an inductive synthesis framework for verifiable

16
An Inductive Synthesis Framework for Verifiable Reinforcement Learning He Zhu Galois, Inc. Zikang Xiong Purdue University Stephen Magill Galois, Inc. Suresh Jagannathan Purdue University Abstract Despite the tremendous advances that have been made in the last decade on developing useful machine-learning ap- plications, their wider adoption has been hindered by the lack of strong assurance guarantees that can be made about their behavior. In this paper, we consider how formal verifi- cation techniques developed for traditional software systems can be repurposed for verification of reinforcement learning- enabled ones, a particularly important class of machine learn- ing systems. Rather than enforcing safety by examining and altering the structure of a complex neural network imple- mentation, our technique uses blackbox methods to synthe- sizes deterministic programs, simpler, more interpretable, approximations of the network that can nonetheless guaran- tee desired safety properties are preserved, even when the network is deployed in unanticipated or previously unob- served environments. Our methodology frames the problem of neural network verification in terms of a counterexam- ple and syntax-guided inductive synthesis procedure over these programs. The synthesis procedure searches for both a deterministic program and an inductive invariant over an infinite state transition system that represents a specification of an application’s control logic. Additional specifications defining environment-based constraints can also be provided to further refine the search space. Synthesized programs de- ployed in conjunction with a neural network implementation dynamically enforce safety conditions by monitoring and preventing potentially unsafe actions proposed by neural policies. Experimental results over a wide range of cyber- physical applications demonstrate that software-inspired formal verification techniques can be used to realize trust- worthy reinforcement learning systems with low overhead. 1 Introduction Neural networks have proven to be a promising software ar- chitecture for expressing a variety of machine learning appli- cations. However, non-linearity and stochasticity inherent in their design greatly complicate reasoning about their behav- ior. Many existing approaches to verifying [17, 23, 31, 37] and testing [43, 45, 48] these systems typically attempt to tackle implementations head-on, reasoning directly over the struc- ture of activation functions, hidden layers, weights, biases, and other kinds of low-level artifacts that are far-removed from the specifications they are intended to satisfy. Moreover, the notion of safety verification that is typically considered in these efforts ignore effects induced by the actual envi- ronment in which the network is deployed, significantly weakening the utility of any safety claims that are actually proven. Consequently, effective verification methodologies in this important domain still remains an open problem. To overcome these difficulties, we define a new verification toolchain that reasons about correctness extensionally, using a syntax-guided synthesis framework [5] that generates a simpler and more malleable deterministic program guaran- teed to represent a safe control policy of a reinforcement learning (RL)-based neural network, an important class of machine learning systems, commonly used to govern cyber- physical systems such as autonomous vehicles, where high assurance is particularly important. Our synthesis procedure is designed with verification in mind, and is thus structured to incorporate formal safety constraints drawn from a logical specification of the control system the network purports to implement, along with additional salient environment prop- erties relevant to the deployment context. Our synthesis procedure treats the neural network as an oracle, extract- ing a deterministic program P intended to approximate the policy actions implemented by the network. Moreover, our procedure ensures that a synthesized program P is formally verified safe. To this end, we realize our synthesis proce- dure via a counterexample guided inductive synthesis (CEGIS) loop [5] that eliminates any counterexamples to safety of P. More importantly, rather than repairing the network directly to satisfy the constraints governing P, we instead treat P as a safety shield that operates in tandem with the network, over- riding network-proposed actions whenever such actions can be shown to lead to a potentially unsafe state. Our shielding mechanism thus retains performance, provided by the neural policy, while maintaining safety, provided by the program. Our approach naturally generalizes to infinite state systems with a continuous underlying action space. Taken together, these properties enable safety enforcement of RL-based neu- ral networks without having to suffer a loss in performance to achieve high assurance. We show that over a range of cyber-physical applications defining various kinds of control systems, the overhead of runtime assurance is nominal, less than a few percent, compared to running an unproven, and arXiv:1907.07273v1 [cs.LG] 16 Jul 2019

Upload: others

Post on 03-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Inductive Synthesis Framework for Verifiable

An Inductive Synthesis Framework for VerifiableReinforcement Learning

He ZhuGalois Inc

Zikang XiongPurdue University

Stephen MagillGalois Inc

Suresh JagannathanPurdue University

AbstractDespite the tremendous advances that have been made inthe last decade on developing useful machine-learning ap-plications their wider adoption has been hindered by thelack of strong assurance guarantees that can be made abouttheir behavior In this paper we consider how formal verifi-cation techniques developed for traditional software systemscan be repurposed for verification of reinforcement learning-enabled ones a particularly important class of machine learn-ing systems Rather than enforcing safety by examining andaltering the structure of a complex neural network imple-mentation our technique uses blackbox methods to synthe-sizes deterministic programs simpler more interpretableapproximations of the network that can nonetheless guaran-tee desired safety properties are preserved even when thenetwork is deployed in unanticipated or previously unob-served environments Our methodology frames the problemof neural network verification in terms of a counterexam-ple and syntax-guided inductive synthesis procedure overthese programs The synthesis procedure searches for botha deterministic program and an inductive invariant over aninfinite state transition system that represents a specificationof an applicationrsquos control logic Additional specificationsdefining environment-based constraints can also be providedto further refine the search space Synthesized programs de-ployed in conjunction with a neural network implementationdynamically enforce safety conditions by monitoring andpreventing potentially unsafe actions proposed by neuralpolicies Experimental results over a wide range of cyber-physical applications demonstrate that software-inspiredformal verification techniques can be used to realize trust-worthy reinforcement learning systems with low overhead

1 IntroductionNeural networks have proven to be a promising software ar-chitecture for expressing a variety of machine learning appli-cations However non-linearity and stochasticity inherent intheir design greatly complicate reasoning about their behav-ior Many existing approaches to verifying [17 23 31 37] andtesting [43 45 48] these systems typically attempt to tackleimplementations head-on reasoning directly over the struc-ture of activation functions hidden layers weights biasesand other kinds of low-level artifacts that are far-removed

from the specifications they are intended to satisfy Moreoverthe notion of safety verification that is typically consideredin these efforts ignore effects induced by the actual envi-ronment in which the network is deployed significantlyweakening the utility of any safety claims that are actuallyproven Consequently effective verification methodologiesin this important domain still remains an open problem

To overcome these difficulties we define a new verificationtoolchain that reasons about correctness extensionally usinga syntax-guided synthesis framework [5] that generates asimpler and more malleable deterministic program guaran-teed to represent a safe control policy of a reinforcementlearning (RL)-based neural network an important class ofmachine learning systems commonly used to govern cyber-physical systems such as autonomous vehicles where highassurance is particularly important Our synthesis procedureis designed with verification in mind and is thus structuredto incorporate formal safety constraints drawn from a logicalspecification of the control system the network purports toimplement along with additional salient environment prop-erties relevant to the deployment context Our synthesisprocedure treats the neural network as an oracle extract-ing a deterministic program P intended to approximate thepolicy actions implemented by the network Moreover ourprocedure ensures that a synthesized program P is formallyverified safe To this end we realize our synthesis proce-dure via a counterexample guided inductive synthesis (CEGIS)loop [5] that eliminates any counterexamples to safety of PMore importantly rather than repairing the network directlyto satisfy the constraints governingP we instead treatP as asafety shield that operates in tandem with the network over-riding network-proposed actions whenever such actions canbe shown to lead to a potentially unsafe state Our shieldingmechanism thus retains performance provided by the neuralpolicy while maintaining safety provided by the programOur approach naturally generalizes to infinite state systemswith a continuous underlying action space Taken togetherthese properties enable safety enforcement of RL-based neu-ral networks without having to suffer a loss in performanceto achieve high assurance We show that over a range ofcyber-physical applications defining various kinds of controlsystems the overhead of runtime assurance is nominal lessthan a few percent compared to running an unproven and

arX

iv1

907

0727

3v1

[cs

LG

] 1

6 Ju

l 201

9

thus potentially unsafe network with no shield support Thispaper makes the following contributions

1 We present a verification toolchain for ensuring thatthe control policies learned by an RL-based neural net-work are safe Our notion of safety is defined in termsof a specification of an infinite state transition systemthat captures for example the system dynamics of acyber-physical controller

2 We develop a counterexample-guided inductive syn-thesis framework that treats the neural control policyas an oracle to guide the search for a simpler deter-ministic program that approximates the behavior ofthe network but which is more amenable for verifi-cation The synthesis procedure differs from prior ef-forts [9 47] because the search procedure is boundedby safety constraints defined by the specification (iestate transition system) as well as a characterizationof specific environment conditions defining the appli-cationrsquos deployment context

3 We use a verification procedure that guarantees ac-tions proposed by the synthesized program alwayslead to a state consistent with an inductive invariantof the original specification and deployed environmentcontext This invariant defines an inductive propertythat separates all reachable (safe) and unreachable (un-safe) states expressible in the transition system

4 We develop a runtime monitoring framework thattreats the synthesized program as a safety shield [4]overriding proposed actions of the network wheneversuch actions can cause the system to enter into anunsafe region

We present a detailed experimental study over a widerange of cyber-physical control systems that justify the util-ity of our approach These results indicate that the cost ofensuring verification is low typically on the order of a fewpercent The remainder of the paper is structured as followsIn the next section we present a detailed overview of ourapproach Sec 3 formalizes the problem and the context De-tails about the synthesis and verification procedure are givenin Sec 4 A detailed evaluation study is provided in Sec 5Related work and conclusions are given in Secs 6 and 7 resp

2 Motivation and OverviewTo motivate the problem and to provide an overview of ourapproach consider the definition of a learning-enabled con-troller that operates an inverted pendulum While the spec-ification of this system is simple it is nonetheless repre-sentative of a number of practical control systems such asSegway transporters and autonomous drones that have thusfar proven difficult to verify but for which high assurance isvery desirable

Figure 1 Inverted Pendulum State Transition System Thependulum has massm and length l A system state is s =[ηω]T where η is the its angle andω is its angular velocity Acontinuous control action amaintains the pendulum upright

21 State Transition SystemWe model an inverted pendulum system as an infinite statetransition system with continuous actions in Fig 1 Here thependulum has massm and length l A system state is s =[ηω]T where η is the pendulumrsquos angle and ω is its angularvelocity A controller can use a 1-dimensional continuouscontrol action a to maintain the pendulum upright

Since modern controllers are typically implemented digi-tally (using digital-to-analog converters for interfacing be-tween the analog system and a digital controller) we as-sume that the pendulum is controlled in discrete time in-stants kt where k = 0 1 2 middot middot middot ie the controller uses thesystem dynamics the change of rate of s denoted as Ucircs totransition every t time period with the conjecture that thecontrol action a is a constant during each discrete time in-terval Using Eulerrsquos method for example a transition fromstate sk = s(kt) at time kt to time kt + t is approximated ass(kt + t) = s(kt) + Ucircs(kt) times t We specify the change of rate Ucircsusing the differential equation shown in Fig 11 Intuitivelythe control action a is allowed to affect the change of rate ofη andω to balance a pendulum Thus small values of a resultin small swing and velocity changes of the pendulum ac-tions that are useful when the pendulum is upright (or nearlyso) while large values of a contribute to larger changes inswing and velocity actions that are necessary when the pen-dulum enters a state where it may risk losing balance Incase the system dynamics are unknown we can use knownalgorithms to infer dynamics from online experiments [1]

Assume the state transition system of the inverted pendu-lum starts from a set of initial states S0

S0 (ηω) | minus 20 le η le 20 and minus20 le ω le 20

The global safety property we wish to preserve is that thependulum never falls down We define a set of unsafe statesSu of the transition system (colored in yellow in Fig 1)

Su (ηω) | not(minus90 lt η lt 90 and minus90 lt ω lt 90)

1We derive the control dynamics equations assuming that an invertedpendulum satisfies general Lagrangian mechanics and approximate non-polynomial expressions with their Taylor expansions

2

Figure 2 The Framework of Our Approach

We assume the existence of a neural network control pol-icy πw R2 rarr R that executes actions over the pendu-lum whose weight values of w are learned from trainingepisodes This policy is a state-dependent function mappinga 2-dimensional state s (η and ω) to a control action a Ateach transition the policy mitigates uncertainty throughfeedback over state variables in s Environment Context An environment context C[middot] de-fines the behavior of the application where [middot] is left opento deploy a reasonable neural controller πw The actions ofthe controller are dictated by constraints imposed by the en-vironment In its simplest form the environment is simply astate transition system In the pendulum example this wouldbe the equations given in Fig 1 parameterized by pendulummass and length In general however the environment mayinclude additional constraints (eg a constraining boundingbox that restricts the motion of the pendulum beyond thespecification given by the transition system in Fig 1)

22 Synthesis Verification and ShieldingIn this paper we assume that a neural network is trainedusing a state-of-the-art reinforcement learning strategy [2840] Even though the resulting neural policy may appear towork well in practice the complexity of its implementationmakes it difficult to assert any strong and provable claimsabout its correctness since the neurons layers weights andbiases are far-removed from the intent of the actual controllerWe found that state-of-the-art neural network verifiers [1723] are ineffective for verification of a neural controller overan infinite time horizon with complex system dynamicsFramework We construct a policy interpretation mecha-nism to enable verification inspired by prior work on im-itation learning [36 38] and interpretable machine learn-ing [9 47] Fig 2 depicts the high-level framework of ourapproach Our idea is to synthesize a deterministic policyprogram from a neural policy πw approximating πw (whichwe call an oracle) with a simpler structural program P Likeπw P takes as input a system state and generates a controlaction a To this end P is simulated in the environmentused to train the neural policy πw to collect feasible statesGuided by πwrsquos actions on such collected states P is furtherimproved to resemble πw

The goal of the synthesis procedure is to search for a de-terministic program Plowast satisfying both (1) a quantitativespecification such that it bears reasonably close resemblanceto its oracle so that allowing it to serve as a potential sub-stitute is a sensible notion and (2) a desired logical safetyproperty such that when in operation the unsafe states de-fined in the environment C cannot be reached Formally

Plowast = argmaxPisinSafe(CJHK)

d(πwPC) (1)

where d(πwPC) measures proximity of P with its neuraloracle in an environment C JHK defines a search space forP with prior knowledge on the shape of target deterministicprograms and Safe(C JHK) restricts the solution space toa set of safe programs A program P is safe if the safetyof the transition system C[P] the deployment of P in theenvironment C can be formally verified

The novelty of our approach against prior work on neuralpolicy interpretation [9 47] is thus two-fold1 We bake in the concept of safety and formal safety verifi-

cation into the synthesis of a deterministic program froma neural policy as depicted in Fig 2 If a candidate programis not safe we rely on a counterexample-guided inductivesynthesis loop to improve our synthesis outcome to en-force the safety conditions imposed by the environment

2 We allowP to operate in tandemwith the high-performingneural policy P can be viewed as capturing an inductiveinvariant of the state transition system which can be usedas a shield to describe a boundary of safe states withinwhich the neural policy is free to make optimal control de-cisions If the system is about to be driven out of this safetyboundary the synthesized program is used to take an ac-tion that is guaranteed to stay within the space subsumedby the invariant By allowing the synthesis procedureto treat the neural policy as an oracle we constrain thesearch space of feasible programs to be those whose ac-tions reside within a proximate neighborhood of actionsundertaken by the neural policy

Synthesis Reducing a complex neural policy to a simpleryet safe deterministic program is possible because we donot require other properties from the oracle specificallywe do not require that the deterministic program preciselymirrors the performance of the neural policy For exampleexperiments described in [35] show that while a linear-policycontrolled robot can effectively stand up it is unable to learnan efficient walking gait unlike a sufficiently-trained neuralpolicy However if we just care about the safety of the neuralnetwork we posit that a linear reduction can be sufficientlyexpressive to describe necessary safety constraints Based onthis hypothesis for our inverted pendulum example we canexplore a linear program space from which a deterministicprogramPθ can be drawn expressed in terms of the followingprogram sketch

3

def P[θ1 θ2](η ω ) return θ1η + θ2ω

Here θ = [θ1θ2] are unknown parameters that need to besynthesized Intuitively the programweights the importanceof η and ω at a particular state to provide a feedback controlaction to mitigate the deviation of the inverted pendulumfrom (η = 0ω = 0)Our search-based synthesis sets θ to 0 initially It runs

the deterministic program Pθ instead of the oracle neuralpolicy πw within the environment C defined in Fig 1 (in thiscase the state transition system represents the differentialequation specification of the controller) to collect a batch oftrajectories A run of the state transition system of C[Pθ ]produces a finite trajectory s0 s1 middot middot middot sT We find θ from thefollowing optimization task that realizes (1)

maxθ isinR2E[ΣTt=0d(πwPθ st )] (2)

where d(πwPθ st ) equivminus(Pθ (st ) minus πw(st ))2 st lt SuminusMAX st isin Su

This

equation aims to search for a programPθ at minimal distancefrom the neural oracle πw along sampled trajectories whilesimultaneously maximizing the likelihood that Pθ is safe

Our synthesis procedure described in Sec 41 is a randomsearch-based optimization algorithm [30] We sample a newposition of θ iteratively from its hypersphere of a given smallradius surrounding the current position of θ and move to thenew position (wrt a learning rate) as dictated by Equation(2) For the running example our search synthesizes

def P(η ω ) return minus1205η + minus587ω

The synthesized program can be used to intuitively interprethow the neural oracle works For example if a pendulumwith a positive angle η gt 0 leans towards the right (ω gt 0)the controller will need to generate a large negative controlaction to force the pendulum to move leftVerification Since our method synthesizes a deterministicprogram P we can leverage off-the-shelf formal verificationalgorithms to verify its safety with respect to the state transi-tion system C defined in Fig 1 To ensure that P is safe wemust ascertain that it can never transition to an unsafe stateie a state that causes the pendulum to fall When framedas a formal verification problem answering such a questionis tantamount to discovering an inductive invariant φ thatrepresents all safe states over the state transition system

1 Safe φ is disjoint with all unsafe states Su 2 Init φ includes all initial states S03 Induction Any state in φ transitions to another state

in φ and hence cannot reach an unsafe stateInspired by template-based constraint solving approaches oninductive invariant inference [19 20 24 34] the verificationalgorithm described in Sec 42 uses a constraint solver tolook for an inductive invariant in the form of a convex barriercertificate [34] E(s) le 0 that maps all the states in the (safe)

reachable set to non-positive reals and all the states in theunsafe set to positive reals The basic idea is to identify apolynomial function E Rn rarr R such that 1) E(s) gt 0 forany state s isin Su 2) E(s) le 0 for any state s isin S0 and 3)E(s prime) minus E(s) le 0 for any state s that transitions to s prime in thestate transition system C[P] The second and third conditioncollectively guarantee that E(s) le 0 for any state s in thereachable set thus implying that an unsafe state in Su cannever be reachedFig 3(a) draws the discovered invariant in blue for C[P]

given the initial and unsafe states where P is the synthesizedprogram for the inverted pendulum systemWe can concludethat the safety property is satisfied by the P controlled sys-tem as all reachable states do not overlap with unsafe statesIn case verification fails our approach conducts a counterex-ample guided loop (Sec 42) to iteratively synthesize safedeterministic programs until verification succeedsShielding Keep in mind that a safety proof of a reduceddeterministic program of a neural network does not auto-matically lift to a safety argument of the neural networkfrom which it was derived since the network may exhibitbehaviors not fully captured by the simpler deterministicprogram To bridge this divide we propose to recover sound-ness at runtime by monitoring system behaviors of a neuralnetwork in its actual environment (deployment) contextFig 4 depicts our runtime shielding approach with more

details given in Sec 43 The inductive invariant φ learnt fora deterministic program P of a neural network πw can serveas a shield to protect πw at runtime under the environmentcontext and safety property used to synthesize P An obser-vation about a current state is sent to both πw and the shieldA high-performing neural network is allowed to take any ac-tions it feels are optimal as long as the next state it proposesis still within the safety boundary formally characterized bythe inductive invariant of C[P] Whenever a neural policyproposes an action that steers its controlled system out ofthe state space defined by the inductive invariant we havelearned as part of deterministic program synthesis our shieldcan instead take a safe action proposed by the deterministicprogram P The action given by P is guaranteed safe be-cause φ defines an inductive invariant of C[P] taking theaction allows the system to stay within the provably safe re-gion identified by φ Our shielding mechanism is sound dueto formal verification Because the deterministic programwas synthesized using πw as an oracle it is expected that theshield derived from the program will not frequently inter-rupt the neural networkrsquos decision allowing the combinedsystem to perform (close-to) optimallyIn the inverted pendulum example since the 90 bound

given as a safety constraint is rather conservative we do notexpect a well-trained neural network to violate this boundaryIndeed in Fig 3(a) even though the inductive invariant ofthe synthesized program defines a substantially smaller statespace than what is permissible in our simulation results

4

Figure 3 Invariant Inference on Inverted Pendulum

Figure 4 The Framework of Neural Network Shielding

we find that the neural policy is never interrupted by thedeterministic program when governing the pendulum Notethat the non-shaded areas in Fig 3(a) while appearing safepresumably define states from which the trajectory of thesystem can be led to an unsafe state and would thus not beinductive

Our synthesis approach is critical to ensuring safety whenthe neural policy is used to predict actions in an environmentdifferent from the one used during training Consider a neu-ral network suitably protected by a shield that now operatessafely The effectiveness of this shield would be greatly di-minished if the network had to be completely retrained fromscratch whenever it was deployed in a new environmentwhich imposes different safety constraints

In our running example suppose we wish to operate theinverted pendulum in an environment such as a Segwaytransporter in which the model is prohibited from swingingsignificantly and whose angular velocity must be suitablyrestricted We might specify the following new constraintson state parameters to enforce these conditions

Su (ηω) |not(minus30 lt η lt 30 and minus30 lt ω lt 30)Because of the dependence of a neural network to the qualityof training data used to build it environment changes thatdeviate from assumptions made at training-time could resultin a costly retraining exercise because the network mustlearn a new way to penalize unsafe actions that were previ-ously safe However training a neural network from scratchrequires substantial non-trivial effort involving fine-grainedtuning of training parameters or even network structures

In our framework the existing network provides a reason-able approximation to the desired behavior To incorporatethe additional constraints defined by the new environment

Cprime we attempt to synthesize a new deterministic programP prime for Cprime a task based on our experience is substantiallyeasier to achieve than training a new neural network policyfrom scratch This new program can be used to protect theoriginal network provided that we can use the aforemen-tioned verification approach to formally verify that Cprime[P prime] issafe by identifying a new inductive invariant φ prime As depictedin Fig 4 we simply build a new shield Sprime that is composedof the program P prime and the safety boundary captured by φ primeThe shield Sprime can ensure the safety of the neural network inthe environment context Cprime with a strengthened safety con-dition despite the fact that the neural network was trainedin a different environment context CFig 3(b) depicts the new unsafe states in Cprime (colored in

red) It extends the unsafe range draw in Fig 3(a) so theinductive invariant learned there is unsafe for the new oneOur approach synthesizes a new deterministic program forwhich we learn a more restricted inductive invariant de-picted in green in Fig 3(b) To characterize the effectivenessof our shielding approach we examined 1000 simulationtrajectories of this restricted version of the inverted pendu-lum system each of which is comprised of 5000 simulationsteps with the safety constraints defined by Cprime Without theshield Sprime the pendulum entered the unsafe region Su 41times All of these violations were prevented by Sprime Notablythe intervention rate of Sprime to interrupt the neural networkrsquosdecision was extremely low Out of a total of 5000times1000 de-cisions we only observed 414 instances (000828) wherethe shield interfered with (ie overrode) the decision madeby the network

3 Problem SetupWemodel the context C of a control policy as an environmentstate transition system C[middot] = (X ASS0Su Tt [middot] f r )Note that middot is intentionally left open to deploy neural controlpolicies Here X is a finite set of variables interpreted overthe reals R and S = RX is the set of all valuations of thevariables X We denote s isin S to be an n-dimensional envi-ronment state and a isin A to be a control action where Ais an infinite set ofm-dimensional continuous actions thata learning-enabled controller can perform We use S0 isin Sto specify a set of initial environment states and Su isin S tospecify a set of unsafe environment states that a safe con-troller should avoid The transition relation Tt [middot] defines howone state transitions to another given an action by an un-known policyWe assume thatTt [middot] is governed by a standarddifferential equation f defining the relationship between acontinuously varying state s(t) and action a(t) and its rateof change Ucircs(t) over time t

Ucircs(t) = f (s(t)a(t))We assume f is defined by equations of the formRntimesRm rarrRn such as the example in Fig 1 In the following we oftenomit the time variable t for simplicity A deterministic neural

5

network control policy πw Rn rarr Rm parameterized by aset of weight valuesw is a function mapping an environmentstate s to a control action a where

a = πw(s)By providing a feedback action a policy can alter the rate ofchange of state variables to realize optimal system controlThe transition relation Tt [middot] is parameterized by a con-

trol policy π that is deployed in C We explicitly modelthis deployment as Tt [π ] Given a control policy πw weuse Tt [πw] S times S to specify all possible state transitionsallowed by the policy We assume that a system transitionsin discrete time instants kt where k = 0 1 2 middot middot middot and t is afixed time step (t gt 0) A state s transitions to a next state s primeafter time t with the assumption that a control action a(τ ) attime τ is a constant between the time period τ isin [0 t) Us-ing Eulerrsquos method2 we discretize the continuous dynamicsf with finite difference approximation so it can be used inthe discretized transition relation Tt [πw] Then Tt [πw] cancompute these estimates by the following difference equation

Tt [πw] = (s s prime) | s prime = s + f (sπw(s)) times tEnvironment Disturbance Our model allows boundedexternal properties (eg additional environment-imposedconstraints) by extending the definition of change of rate Ucircs =f (sa)+d whered is a vector of random disturbances We used to encode environment disturbances in terms of boundednondeterministic values We assume that tight upper andlower bounds of d can be accurately estimated at runtimeusing multivariate normal distribution fitting methodsTrajectoryA trajectoryh of a state transition system C[πw]which we denote as h isin C[πw] is a sequence of statess0 middot middot middot si si+1 middot middot middot where s0 isin S0 and (si si+1) isin Tt [πw] forall i ge 0 We use C sube C[πw] to denote a set of trajectoriesThe reward that a control policy receives on performing anaction a in a state s is given by the reward function r (sa)Reinforcement Learning The goal of reinforcement learn-ing is to maximize the reward that can be collected by a neu-ral control policy in a given environment context C Suchproblems can be abstractly formulated as

maxwisinRn

J (w)

J (w) = E[r (πw)](3)

Assume that s0 s1 sT is a trajectory of length T of thestate transition system and r (πw) = lim

TrarrinfinΣTk=0r (sk πw(sk ))

is the cumulative reward achieved by the policy πw fromthis trajectory Thus this formulation uses simulations ofthe transition system with finite length rollouts to estimatethe expected cumulative reward collected over T time stepsand aim to maximize this reward Reinforcement learning2Eulerrsquos method may sometimes poorly approximate the true system tran-sition function when f is highly nonlinear More precise higher-order ap-proaches such as Runge-Kutta methods exist to compensate for loss ofprecision in this case

assumes polices are expressible in some executable structure(eg a neural network) and allows samples generated fromone policy to influence the estimates made for others Thetwo main approaches for reinforcement learning are valuefunction estimation and direct policy searchWe refer readersto [27] for an overviewSafety Verification Since reinforcement learning only con-siders finite length rollouts we wish to determine if a controlpolicy is safe to use under an infinite time horizon Givena state transition system the safety verification problem isconcerned with verifying that no trajectories contained in Sstarting from an initial state inS0 reach an unsafe state inSu However a neural network is a representative of a class ofdeep and sophisticated models that challenges the capabilityof the state-of-the-art verification techniques This level ofcomplexity is exacerbated in our work because we considerthe long term safety of a neural policy deployed within anontrivial environment context C that in turn is describedby a complex infinite-state transition system

4 Verification ProcedureTo verify a neural network control policy πw with respect toan environment context C we first synthesize a deterministicpolicy program P from the neural policy We require that Pboth (a) broadly resembles its neural oracle and (b) addition-ally satisfies a desired safety property when it is deployedin C We conjecture that a safety proof of C[P] is easier toconstruct than that of C[πw] More importantly we leveragethe safety proof of C[P] to ensure the safety of C[πw]

41 SynthesisFig 5 defines a search space for a deterministic policy pro-gram P where E and φ represent the basic syntax of (polyno-mial) program expressions and inductive invariants respec-tively Here v ranges over a universe of numerical constantsx represents variables and oplus is a basic operator including +and times A deterministic program P also features conditionalstatements using φ as branching predicatesWe allow the user to define a sketch [41 42] to describe

the shape of target policy programs using the grammar inFig 5 We use P[θ ] to represent a sketch where θ representsunknowns that need to be filled-in by the synthesis proce-dure We use Pθ to represent a synthesized program withknown values of θ Similarly the user can define a sketch ofan inductive invariant φ[middot] that is used (in Sec 42) to learna safety proof to verify a synthesized program Pθ

We do not require the user to explicitly define conditionalstatements in a program sketch Our synthesis algorithmusesverification counterexamples to lazily add branch predicatesφ under which a program performs different computationsdepending on whether φ evaluates to true or false The enduser simply writes a sketch over basic expressions For ex-ample a sketch that defines a family of linear function over

6

E = v | x | oplus (E1 Ek )φ = E le 0P = return E | if φ then return E else P

Figure 5 Syntax of the Policy Programming Language

Algorithm 1 Synthesize (πw P[θ ] C[middot])1 θ larr 02 do3 Sample δ from a zero mean Gaussian vector4 Sample a set of trajectories C1 using C[Pθ+νδ ]5 Sample a set of trajectories C2 using C[Pθminusνδ ]6 θ larr θ + α[d (πwPθ+νδ C1)minusd (πwPθminusνδ C2)

ν ]δ 7 while θ is not converged8 return Pθ

a collection of variables can be expressed as

P[θ ](X ) = return θ1x1 + θ2x2 + middot middot middot + θnxn + θn+1 (4)

Here X = (x1x2 middot middot middot ) are system variables in which thecoefficient constants θ = (θ1θ2 middot middot middot ) are unknownThe goal of our synthesis algorithm is to find unknown

values of θ that maximize the likelihood that Pθ resemblesthe neural oracle πw while still being a safe program withrespect to the environment context C

θ lowast = argmaxθ isinRn+1

d(πwPθ C) (5)

where d measures the distance between the outputs of theestimate program Pθ and the neural policy πw subject tosafety constraints To avoid the computational complexitythat arises if we consider a solution to this equation analyti-cally as an optimization problem we instead approximated(πwPθ C) by randomly sampling a set of trajectories Cthat are encountered by Pθ in the environment state transi-tion system C[Pθ ]

d(πwPθ C) asymp d(πwPθ C) s t C sube C[Pθ ]We estimate θ lowast using these sampled trajectoriesC and define

d(πwPθ C) =sumhisinC

d(πwPθ h)

Since each trajectory h isin C is a finite rollout s0 sT oflength T we have

d(πwPθ h) =Tsumt=0

minus ∥(Pθ (st ) minus πw(st ))∥ st lt SuminusMAX st isin Su

where ∥middot∥ is a suitable norm As illustrated in Sec 22 weaim to minimize the distance between a synthesized programPθ from a sketch space and its neural oracle along sampledtrajectories encountered by Pθ in the environment contextbut put a large penalty on states that are unsafe

Random SearchWe implement the idea encoded in equa-tion (5) in Algorithm 1 that depicts the pseudocode of ourpolicy interpretation approach It take as inputs a neuralpolicy a policy program sketch parameterized by θ and anenvironment state transition system and outputs a synthe-sized policy program Pθ

An efficient way to solve equation (5) is to directly perturbθ in the search space by adding random noise and thenupdate θ based on the effect on this perturbation [30] Wechoose a direction uniformly at random on the sphere inparameter space and then optimize the goal function alongthis direction To this end in line 3 of Algorithm 1 we sampleGaussian noise δ to be added to policy parameters θ in bothdirections where ν is a small positive real number In line 4and line 5 we sample trajectories C1 and C2 from the statetransition systems obtained by running the perturbed policyprogram Pθ+νδ and Pθminusνδ in the environment context C

We evaluate the proximity of these two policy programs tothe neural oracle and in line 6 of Algorithm 1 to improve thepolicy program we also optimize equation (5) by updating θwith a finite difference approximation along the direction

θ larr θ + α[d(Pθ+νδ πwC1) minus d(Pθminusνδ πwC2)ν

]δ (6)

where α is a predefined learning rate Such an update incre-ment corresponds to an unbiased estimator of the gradient ofθ [29 33] for equation (5) The algorithm iteratively updatesθ until convergence

42 VerificationA synthesized policy program P is verified with respect to anenvironment context given as an infinite state transition sys-tem defined in Sec 3 C[P] = (X ASS0Su Tt [P] f r )Our verification algorithm learns an inductive invariant φover the transition relation Tt [P] formally proving that allpossible system behaviors are encapsulated in φ and φ isrequired to be disjoint with all unsafe states Su We follow template-based constraint solving approaches

for inductive invariant inference [19 20 24 34] to discoverthis invariant The basic idea is to identify a function E Rn rarr R that serves as a barrier [20 24 34] between reach-able system states (evaluated to be nonpositive by E) andunsafe states (evaluated positive by E) Using the invariantsyntax in Fig 5 the user can define an invariant sketch

φ[c](X ) = E[c](X ) le 0 (7)

over variables X and c unknown coefficients intended to besynthesized Fig 5 carefully restricts that an invariant sketchE[c] can only be postulated as a polynomial function as thereexist efficient SMT solvers [13] and constraint solvers [26]for nonlinear polynomial reals Formally assume real coef-ficients c = (c0 middot middot middot cp ) are used to parameterize E[c] in anaffine manner

E[c](X ) = Σpi=0cibi (X )

7

where the bi (X )rsquos are some monomials in variables X As aheuristic the user can simply determine an upper bound onthe degree of E[c] and then include all monomials whosedegrees are no greater than the bound in the sketch Largevalues of the bound enable verification of more complexsafety conditions but impose greater demands on the con-straint solver small values capture coarser safety propertiesbut are easier to solve

Example 41 Consider the inverted pendulum system inSec 22 To discover an inductive invariant for the transitionsystem the user might choose to define an upper bound of 4which results in the following polynomial invariant sketchφ[c](ηω) = E[c](ηω) le 0 where

E[c](ηω) = c0η4+c1η3ω+c2η2ω2+c3ηω3+c4ω

4+c5η3+middot middot middot+cp

The sketch includes all monomials over η and ω whose de-grees are no greater than 4 The coefficients c = [c0 middot middot middot cp ]are unknown and need to be synthesized

To synthesize these unknowns we require that E[c] mustsatisfy the following verification conditions

forall(s) isin Su E[c](s) gt 0 (8)forall(s) isin S0 E[c](s) le 0 (9)forall(s s prime) isin Tt [P] E[c](s prime) minus E[c](s) le 0 (10)

We claim that φ = E[c](x) le 0 defines an inductive in-variant because verification condition (9) ensures that anyinitial state s0 isin S0 satisfies φ since E[c](s0) le 0 verificationcondition (10) asserts that along the transition from a states isin φ (so E[c](s) le 0) to a resulting state s prime E[c] cannotbecome positive so s prime satisfies φ as well Finally accordingto verification condition (8) φ does not include any unsafestate su isin Su as E[c](su ) is positive

Verification conditions (8) (9) (10) are polynomials over re-als Synthesizing unknown coefficients can be left to an SMTsolver [13] after universal quantifiers are eliminated using avariant of Farkas Lemma as in [20] However observe thatE[c] is convex3 We can gain efficiency by finding unknowncoefficients c using off-the-shelf convex constraint solversfollowing [34] Encoding verification conditions (8) (9) (10)as polynomial inequalities we search c that can prove non-negativity of these constraints via an efficient and convexsum of squares programming solver [26] Additional techni-cal details are provided in the supplementary material [49]Counterexample-guided Inductive Synthesis (CEGIS)Given the environment state transition system C[P] de-ployed with a synthesized program P the verification ap-proach above can compute an inductive invariant over thestate transition system or tell if there is no feasible solutionin the given set of candidates Note however that the latterdoes not necessarily imply that the system is unsafe

3For arbitrary E1(x ) and E2(x ) satisfying the verification conditions andα isin [0 1] E(x ) = αE1(x ) + (1 minus α )E2(x ) satisfies the conditions as well

Algorithm 2 CEGIS (πw P[θ ] C[middot])1 policieslarr empty2 coverslarr empty3 while CS0 ⊈ covers do4 search s0 such that s0 isin CS0 and s0 lt covers5 rlowastlarr Diameter(CS0)6 do7 ϕbound larr s | s isin s0 minus rlowast s0 + rlowast8 C larr C where CS0 = (CS0 cap ϕbound )9 θ larr Synthesize(πw P[θ ] C[middot])

10 φlarr Verify(C[Pθ ])11 if φ is False then12 rlowastlarr rlowast213 else14 coverslarr covers cup s | φ(s)15 policieslarr policies cup (Pθ φ)16 break17 while True18 end19 return policies

Since our goal is to learn a safe deterministic programfrom a neural network we develop a counterexample guidedinductive program synthesis approach A CEGIS algorithmin our context is challenging because safety verification isnecessarily incomplete and may not be able to produce acounterexample that serves as an explanation for why averification attempt is unsuccessful

We solve the incompleteness challenge by leveraging thefact that we can simultaneously synthesize and verify aprogram Our CEGIS approach is given in Algorithm 2 Acounterexample is an initial state on which our synthesizedprogram is not yet proved safe Driven by these counterex-amples our algorithm synthesizes a set of programs from asketch along with the conditions under which we can switchfrom from one synthesized program to another

Algorithm 2 takes as input a neural policy πw a programsketch P[θ ] and an environment context C It maintains syn-thesized policy programs in policies in line 1 each of whichis inductively verified safe in a partition of the universe statespace that is maintained in covers in line 2 For soundnessthe state space covered by such partitions must be able toinclude all initial states checked in line 3 of Algorithm 2 byan SMT solver In the algorithm we use CS0 to access afield of C such as its initial state space

The algorithm iteratively samples a counterexample initialstate s0 that is currently not covered by covers in line 4 Sincecovers is empty at the beginning this choice is uniformlyrandom initially we synthesize a presumably safe policyprogram in line 9 of Algorithm 2 that resembles the neuralpolicy πw considering all possible initial states S0 of thegiven environment model C using Algorithm 1

8

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 2: An Inductive Synthesis Framework for Verifiable

thus potentially unsafe network with no shield support Thispaper makes the following contributions

1 We present a verification toolchain for ensuring thatthe control policies learned by an RL-based neural net-work are safe Our notion of safety is defined in termsof a specification of an infinite state transition systemthat captures for example the system dynamics of acyber-physical controller

2 We develop a counterexample-guided inductive syn-thesis framework that treats the neural control policyas an oracle to guide the search for a simpler deter-ministic program that approximates the behavior ofthe network but which is more amenable for verifi-cation The synthesis procedure differs from prior ef-forts [9 47] because the search procedure is boundedby safety constraints defined by the specification (iestate transition system) as well as a characterizationof specific environment conditions defining the appli-cationrsquos deployment context

3 We use a verification procedure that guarantees ac-tions proposed by the synthesized program alwayslead to a state consistent with an inductive invariantof the original specification and deployed environmentcontext This invariant defines an inductive propertythat separates all reachable (safe) and unreachable (un-safe) states expressible in the transition system

4 We develop a runtime monitoring framework thattreats the synthesized program as a safety shield [4]overriding proposed actions of the network wheneversuch actions can cause the system to enter into anunsafe region

We present a detailed experimental study over a widerange of cyber-physical control systems that justify the util-ity of our approach These results indicate that the cost ofensuring verification is low typically on the order of a fewpercent The remainder of the paper is structured as followsIn the next section we present a detailed overview of ourapproach Sec 3 formalizes the problem and the context De-tails about the synthesis and verification procedure are givenin Sec 4 A detailed evaluation study is provided in Sec 5Related work and conclusions are given in Secs 6 and 7 resp

2 Motivation and OverviewTo motivate the problem and to provide an overview of ourapproach consider the definition of a learning-enabled con-troller that operates an inverted pendulum While the spec-ification of this system is simple it is nonetheless repre-sentative of a number of practical control systems such asSegway transporters and autonomous drones that have thusfar proven difficult to verify but for which high assurance isvery desirable

Figure 1 Inverted Pendulum State Transition System Thependulum has massm and length l A system state is s =[ηω]T where η is the its angle andω is its angular velocity Acontinuous control action amaintains the pendulum upright

21 State Transition SystemWe model an inverted pendulum system as an infinite statetransition system with continuous actions in Fig 1 Here thependulum has massm and length l A system state is s =[ηω]T where η is the pendulumrsquos angle and ω is its angularvelocity A controller can use a 1-dimensional continuouscontrol action a to maintain the pendulum upright

Since modern controllers are typically implemented digi-tally (using digital-to-analog converters for interfacing be-tween the analog system and a digital controller) we as-sume that the pendulum is controlled in discrete time in-stants kt where k = 0 1 2 middot middot middot ie the controller uses thesystem dynamics the change of rate of s denoted as Ucircs totransition every t time period with the conjecture that thecontrol action a is a constant during each discrete time in-terval Using Eulerrsquos method for example a transition fromstate sk = s(kt) at time kt to time kt + t is approximated ass(kt + t) = s(kt) + Ucircs(kt) times t We specify the change of rate Ucircsusing the differential equation shown in Fig 11 Intuitivelythe control action a is allowed to affect the change of rate ofη andω to balance a pendulum Thus small values of a resultin small swing and velocity changes of the pendulum ac-tions that are useful when the pendulum is upright (or nearlyso) while large values of a contribute to larger changes inswing and velocity actions that are necessary when the pen-dulum enters a state where it may risk losing balance Incase the system dynamics are unknown we can use knownalgorithms to infer dynamics from online experiments [1]

Assume the state transition system of the inverted pendu-lum starts from a set of initial states S0

S0 (ηω) | minus 20 le η le 20 and minus20 le ω le 20

The global safety property we wish to preserve is that thependulum never falls down We define a set of unsafe statesSu of the transition system (colored in yellow in Fig 1)

Su (ηω) | not(minus90 lt η lt 90 and minus90 lt ω lt 90)

1We derive the control dynamics equations assuming that an invertedpendulum satisfies general Lagrangian mechanics and approximate non-polynomial expressions with their Taylor expansions

2

Figure 2 The Framework of Our Approach

We assume the existence of a neural network control pol-icy πw R2 rarr R that executes actions over the pendu-lum whose weight values of w are learned from trainingepisodes This policy is a state-dependent function mappinga 2-dimensional state s (η and ω) to a control action a Ateach transition the policy mitigates uncertainty throughfeedback over state variables in s Environment Context An environment context C[middot] de-fines the behavior of the application where [middot] is left opento deploy a reasonable neural controller πw The actions ofthe controller are dictated by constraints imposed by the en-vironment In its simplest form the environment is simply astate transition system In the pendulum example this wouldbe the equations given in Fig 1 parameterized by pendulummass and length In general however the environment mayinclude additional constraints (eg a constraining boundingbox that restricts the motion of the pendulum beyond thespecification given by the transition system in Fig 1)

22 Synthesis Verification and ShieldingIn this paper we assume that a neural network is trainedusing a state-of-the-art reinforcement learning strategy [2840] Even though the resulting neural policy may appear towork well in practice the complexity of its implementationmakes it difficult to assert any strong and provable claimsabout its correctness since the neurons layers weights andbiases are far-removed from the intent of the actual controllerWe found that state-of-the-art neural network verifiers [1723] are ineffective for verification of a neural controller overan infinite time horizon with complex system dynamicsFramework We construct a policy interpretation mecha-nism to enable verification inspired by prior work on im-itation learning [36 38] and interpretable machine learn-ing [9 47] Fig 2 depicts the high-level framework of ourapproach Our idea is to synthesize a deterministic policyprogram from a neural policy πw approximating πw (whichwe call an oracle) with a simpler structural program P Likeπw P takes as input a system state and generates a controlaction a To this end P is simulated in the environmentused to train the neural policy πw to collect feasible statesGuided by πwrsquos actions on such collected states P is furtherimproved to resemble πw

The goal of the synthesis procedure is to search for a de-terministic program Plowast satisfying both (1) a quantitativespecification such that it bears reasonably close resemblanceto its oracle so that allowing it to serve as a potential sub-stitute is a sensible notion and (2) a desired logical safetyproperty such that when in operation the unsafe states de-fined in the environment C cannot be reached Formally

Plowast = argmaxPisinSafe(CJHK)

d(πwPC) (1)

where d(πwPC) measures proximity of P with its neuraloracle in an environment C JHK defines a search space forP with prior knowledge on the shape of target deterministicprograms and Safe(C JHK) restricts the solution space toa set of safe programs A program P is safe if the safetyof the transition system C[P] the deployment of P in theenvironment C can be formally verified

The novelty of our approach against prior work on neuralpolicy interpretation [9 47] is thus two-fold1 We bake in the concept of safety and formal safety verifi-

cation into the synthesis of a deterministic program froma neural policy as depicted in Fig 2 If a candidate programis not safe we rely on a counterexample-guided inductivesynthesis loop to improve our synthesis outcome to en-force the safety conditions imposed by the environment

2 We allowP to operate in tandemwith the high-performingneural policy P can be viewed as capturing an inductiveinvariant of the state transition system which can be usedas a shield to describe a boundary of safe states withinwhich the neural policy is free to make optimal control de-cisions If the system is about to be driven out of this safetyboundary the synthesized program is used to take an ac-tion that is guaranteed to stay within the space subsumedby the invariant By allowing the synthesis procedureto treat the neural policy as an oracle we constrain thesearch space of feasible programs to be those whose ac-tions reside within a proximate neighborhood of actionsundertaken by the neural policy

Synthesis Reducing a complex neural policy to a simpleryet safe deterministic program is possible because we donot require other properties from the oracle specificallywe do not require that the deterministic program preciselymirrors the performance of the neural policy For exampleexperiments described in [35] show that while a linear-policycontrolled robot can effectively stand up it is unable to learnan efficient walking gait unlike a sufficiently-trained neuralpolicy However if we just care about the safety of the neuralnetwork we posit that a linear reduction can be sufficientlyexpressive to describe necessary safety constraints Based onthis hypothesis for our inverted pendulum example we canexplore a linear program space from which a deterministicprogramPθ can be drawn expressed in terms of the followingprogram sketch

3

def P[θ1 θ2](η ω ) return θ1η + θ2ω

Here θ = [θ1θ2] are unknown parameters that need to besynthesized Intuitively the programweights the importanceof η and ω at a particular state to provide a feedback controlaction to mitigate the deviation of the inverted pendulumfrom (η = 0ω = 0)Our search-based synthesis sets θ to 0 initially It runs

the deterministic program Pθ instead of the oracle neuralpolicy πw within the environment C defined in Fig 1 (in thiscase the state transition system represents the differentialequation specification of the controller) to collect a batch oftrajectories A run of the state transition system of C[Pθ ]produces a finite trajectory s0 s1 middot middot middot sT We find θ from thefollowing optimization task that realizes (1)

maxθ isinR2E[ΣTt=0d(πwPθ st )] (2)

where d(πwPθ st ) equivminus(Pθ (st ) minus πw(st ))2 st lt SuminusMAX st isin Su

This

equation aims to search for a programPθ at minimal distancefrom the neural oracle πw along sampled trajectories whilesimultaneously maximizing the likelihood that Pθ is safe

Our synthesis procedure described in Sec 41 is a randomsearch-based optimization algorithm [30] We sample a newposition of θ iteratively from its hypersphere of a given smallradius surrounding the current position of θ and move to thenew position (wrt a learning rate) as dictated by Equation(2) For the running example our search synthesizes

def P(η ω ) return minus1205η + minus587ω

The synthesized program can be used to intuitively interprethow the neural oracle works For example if a pendulumwith a positive angle η gt 0 leans towards the right (ω gt 0)the controller will need to generate a large negative controlaction to force the pendulum to move leftVerification Since our method synthesizes a deterministicprogram P we can leverage off-the-shelf formal verificationalgorithms to verify its safety with respect to the state transi-tion system C defined in Fig 1 To ensure that P is safe wemust ascertain that it can never transition to an unsafe stateie a state that causes the pendulum to fall When framedas a formal verification problem answering such a questionis tantamount to discovering an inductive invariant φ thatrepresents all safe states over the state transition system

1 Safe φ is disjoint with all unsafe states Su 2 Init φ includes all initial states S03 Induction Any state in φ transitions to another state

in φ and hence cannot reach an unsafe stateInspired by template-based constraint solving approaches oninductive invariant inference [19 20 24 34] the verificationalgorithm described in Sec 42 uses a constraint solver tolook for an inductive invariant in the form of a convex barriercertificate [34] E(s) le 0 that maps all the states in the (safe)

reachable set to non-positive reals and all the states in theunsafe set to positive reals The basic idea is to identify apolynomial function E Rn rarr R such that 1) E(s) gt 0 forany state s isin Su 2) E(s) le 0 for any state s isin S0 and 3)E(s prime) minus E(s) le 0 for any state s that transitions to s prime in thestate transition system C[P] The second and third conditioncollectively guarantee that E(s) le 0 for any state s in thereachable set thus implying that an unsafe state in Su cannever be reachedFig 3(a) draws the discovered invariant in blue for C[P]

given the initial and unsafe states where P is the synthesizedprogram for the inverted pendulum systemWe can concludethat the safety property is satisfied by the P controlled sys-tem as all reachable states do not overlap with unsafe statesIn case verification fails our approach conducts a counterex-ample guided loop (Sec 42) to iteratively synthesize safedeterministic programs until verification succeedsShielding Keep in mind that a safety proof of a reduceddeterministic program of a neural network does not auto-matically lift to a safety argument of the neural networkfrom which it was derived since the network may exhibitbehaviors not fully captured by the simpler deterministicprogram To bridge this divide we propose to recover sound-ness at runtime by monitoring system behaviors of a neuralnetwork in its actual environment (deployment) contextFig 4 depicts our runtime shielding approach with more

details given in Sec 43 The inductive invariant φ learnt fora deterministic program P of a neural network πw can serveas a shield to protect πw at runtime under the environmentcontext and safety property used to synthesize P An obser-vation about a current state is sent to both πw and the shieldA high-performing neural network is allowed to take any ac-tions it feels are optimal as long as the next state it proposesis still within the safety boundary formally characterized bythe inductive invariant of C[P] Whenever a neural policyproposes an action that steers its controlled system out ofthe state space defined by the inductive invariant we havelearned as part of deterministic program synthesis our shieldcan instead take a safe action proposed by the deterministicprogram P The action given by P is guaranteed safe be-cause φ defines an inductive invariant of C[P] taking theaction allows the system to stay within the provably safe re-gion identified by φ Our shielding mechanism is sound dueto formal verification Because the deterministic programwas synthesized using πw as an oracle it is expected that theshield derived from the program will not frequently inter-rupt the neural networkrsquos decision allowing the combinedsystem to perform (close-to) optimallyIn the inverted pendulum example since the 90 bound

given as a safety constraint is rather conservative we do notexpect a well-trained neural network to violate this boundaryIndeed in Fig 3(a) even though the inductive invariant ofthe synthesized program defines a substantially smaller statespace than what is permissible in our simulation results

4

Figure 3 Invariant Inference on Inverted Pendulum

Figure 4 The Framework of Neural Network Shielding

we find that the neural policy is never interrupted by thedeterministic program when governing the pendulum Notethat the non-shaded areas in Fig 3(a) while appearing safepresumably define states from which the trajectory of thesystem can be led to an unsafe state and would thus not beinductive

Our synthesis approach is critical to ensuring safety whenthe neural policy is used to predict actions in an environmentdifferent from the one used during training Consider a neu-ral network suitably protected by a shield that now operatessafely The effectiveness of this shield would be greatly di-minished if the network had to be completely retrained fromscratch whenever it was deployed in a new environmentwhich imposes different safety constraints

In our running example suppose we wish to operate theinverted pendulum in an environment such as a Segwaytransporter in which the model is prohibited from swingingsignificantly and whose angular velocity must be suitablyrestricted We might specify the following new constraintson state parameters to enforce these conditions

Su (ηω) |not(minus30 lt η lt 30 and minus30 lt ω lt 30)Because of the dependence of a neural network to the qualityof training data used to build it environment changes thatdeviate from assumptions made at training-time could resultin a costly retraining exercise because the network mustlearn a new way to penalize unsafe actions that were previ-ously safe However training a neural network from scratchrequires substantial non-trivial effort involving fine-grainedtuning of training parameters or even network structures

In our framework the existing network provides a reason-able approximation to the desired behavior To incorporatethe additional constraints defined by the new environment

Cprime we attempt to synthesize a new deterministic programP prime for Cprime a task based on our experience is substantiallyeasier to achieve than training a new neural network policyfrom scratch This new program can be used to protect theoriginal network provided that we can use the aforemen-tioned verification approach to formally verify that Cprime[P prime] issafe by identifying a new inductive invariant φ prime As depictedin Fig 4 we simply build a new shield Sprime that is composedof the program P prime and the safety boundary captured by φ primeThe shield Sprime can ensure the safety of the neural network inthe environment context Cprime with a strengthened safety con-dition despite the fact that the neural network was trainedin a different environment context CFig 3(b) depicts the new unsafe states in Cprime (colored in

red) It extends the unsafe range draw in Fig 3(a) so theinductive invariant learned there is unsafe for the new oneOur approach synthesizes a new deterministic program forwhich we learn a more restricted inductive invariant de-picted in green in Fig 3(b) To characterize the effectivenessof our shielding approach we examined 1000 simulationtrajectories of this restricted version of the inverted pendu-lum system each of which is comprised of 5000 simulationsteps with the safety constraints defined by Cprime Without theshield Sprime the pendulum entered the unsafe region Su 41times All of these violations were prevented by Sprime Notablythe intervention rate of Sprime to interrupt the neural networkrsquosdecision was extremely low Out of a total of 5000times1000 de-cisions we only observed 414 instances (000828) wherethe shield interfered with (ie overrode) the decision madeby the network

3 Problem SetupWemodel the context C of a control policy as an environmentstate transition system C[middot] = (X ASS0Su Tt [middot] f r )Note that middot is intentionally left open to deploy neural controlpolicies Here X is a finite set of variables interpreted overthe reals R and S = RX is the set of all valuations of thevariables X We denote s isin S to be an n-dimensional envi-ronment state and a isin A to be a control action where Ais an infinite set ofm-dimensional continuous actions thata learning-enabled controller can perform We use S0 isin Sto specify a set of initial environment states and Su isin S tospecify a set of unsafe environment states that a safe con-troller should avoid The transition relation Tt [middot] defines howone state transitions to another given an action by an un-known policyWe assume thatTt [middot] is governed by a standarddifferential equation f defining the relationship between acontinuously varying state s(t) and action a(t) and its rateof change Ucircs(t) over time t

Ucircs(t) = f (s(t)a(t))We assume f is defined by equations of the formRntimesRm rarrRn such as the example in Fig 1 In the following we oftenomit the time variable t for simplicity A deterministic neural

5

network control policy πw Rn rarr Rm parameterized by aset of weight valuesw is a function mapping an environmentstate s to a control action a where

a = πw(s)By providing a feedback action a policy can alter the rate ofchange of state variables to realize optimal system controlThe transition relation Tt [middot] is parameterized by a con-

trol policy π that is deployed in C We explicitly modelthis deployment as Tt [π ] Given a control policy πw weuse Tt [πw] S times S to specify all possible state transitionsallowed by the policy We assume that a system transitionsin discrete time instants kt where k = 0 1 2 middot middot middot and t is afixed time step (t gt 0) A state s transitions to a next state s primeafter time t with the assumption that a control action a(τ ) attime τ is a constant between the time period τ isin [0 t) Us-ing Eulerrsquos method2 we discretize the continuous dynamicsf with finite difference approximation so it can be used inthe discretized transition relation Tt [πw] Then Tt [πw] cancompute these estimates by the following difference equation

Tt [πw] = (s s prime) | s prime = s + f (sπw(s)) times tEnvironment Disturbance Our model allows boundedexternal properties (eg additional environment-imposedconstraints) by extending the definition of change of rate Ucircs =f (sa)+d whered is a vector of random disturbances We used to encode environment disturbances in terms of boundednondeterministic values We assume that tight upper andlower bounds of d can be accurately estimated at runtimeusing multivariate normal distribution fitting methodsTrajectoryA trajectoryh of a state transition system C[πw]which we denote as h isin C[πw] is a sequence of statess0 middot middot middot si si+1 middot middot middot where s0 isin S0 and (si si+1) isin Tt [πw] forall i ge 0 We use C sube C[πw] to denote a set of trajectoriesThe reward that a control policy receives on performing anaction a in a state s is given by the reward function r (sa)Reinforcement Learning The goal of reinforcement learn-ing is to maximize the reward that can be collected by a neu-ral control policy in a given environment context C Suchproblems can be abstractly formulated as

maxwisinRn

J (w)

J (w) = E[r (πw)](3)

Assume that s0 s1 sT is a trajectory of length T of thestate transition system and r (πw) = lim

TrarrinfinΣTk=0r (sk πw(sk ))

is the cumulative reward achieved by the policy πw fromthis trajectory Thus this formulation uses simulations ofthe transition system with finite length rollouts to estimatethe expected cumulative reward collected over T time stepsand aim to maximize this reward Reinforcement learning2Eulerrsquos method may sometimes poorly approximate the true system tran-sition function when f is highly nonlinear More precise higher-order ap-proaches such as Runge-Kutta methods exist to compensate for loss ofprecision in this case

assumes polices are expressible in some executable structure(eg a neural network) and allows samples generated fromone policy to influence the estimates made for others Thetwo main approaches for reinforcement learning are valuefunction estimation and direct policy searchWe refer readersto [27] for an overviewSafety Verification Since reinforcement learning only con-siders finite length rollouts we wish to determine if a controlpolicy is safe to use under an infinite time horizon Givena state transition system the safety verification problem isconcerned with verifying that no trajectories contained in Sstarting from an initial state inS0 reach an unsafe state inSu However a neural network is a representative of a class ofdeep and sophisticated models that challenges the capabilityof the state-of-the-art verification techniques This level ofcomplexity is exacerbated in our work because we considerthe long term safety of a neural policy deployed within anontrivial environment context C that in turn is describedby a complex infinite-state transition system

4 Verification ProcedureTo verify a neural network control policy πw with respect toan environment context C we first synthesize a deterministicpolicy program P from the neural policy We require that Pboth (a) broadly resembles its neural oracle and (b) addition-ally satisfies a desired safety property when it is deployedin C We conjecture that a safety proof of C[P] is easier toconstruct than that of C[πw] More importantly we leveragethe safety proof of C[P] to ensure the safety of C[πw]

41 SynthesisFig 5 defines a search space for a deterministic policy pro-gram P where E and φ represent the basic syntax of (polyno-mial) program expressions and inductive invariants respec-tively Here v ranges over a universe of numerical constantsx represents variables and oplus is a basic operator including +and times A deterministic program P also features conditionalstatements using φ as branching predicatesWe allow the user to define a sketch [41 42] to describe

the shape of target policy programs using the grammar inFig 5 We use P[θ ] to represent a sketch where θ representsunknowns that need to be filled-in by the synthesis proce-dure We use Pθ to represent a synthesized program withknown values of θ Similarly the user can define a sketch ofan inductive invariant φ[middot] that is used (in Sec 42) to learna safety proof to verify a synthesized program Pθ

We do not require the user to explicitly define conditionalstatements in a program sketch Our synthesis algorithmusesverification counterexamples to lazily add branch predicatesφ under which a program performs different computationsdepending on whether φ evaluates to true or false The enduser simply writes a sketch over basic expressions For ex-ample a sketch that defines a family of linear function over

6

E = v | x | oplus (E1 Ek )φ = E le 0P = return E | if φ then return E else P

Figure 5 Syntax of the Policy Programming Language

Algorithm 1 Synthesize (πw P[θ ] C[middot])1 θ larr 02 do3 Sample δ from a zero mean Gaussian vector4 Sample a set of trajectories C1 using C[Pθ+νδ ]5 Sample a set of trajectories C2 using C[Pθminusνδ ]6 θ larr θ + α[d (πwPθ+νδ C1)minusd (πwPθminusνδ C2)

ν ]δ 7 while θ is not converged8 return Pθ

a collection of variables can be expressed as

P[θ ](X ) = return θ1x1 + θ2x2 + middot middot middot + θnxn + θn+1 (4)

Here X = (x1x2 middot middot middot ) are system variables in which thecoefficient constants θ = (θ1θ2 middot middot middot ) are unknownThe goal of our synthesis algorithm is to find unknown

values of θ that maximize the likelihood that Pθ resemblesthe neural oracle πw while still being a safe program withrespect to the environment context C

θ lowast = argmaxθ isinRn+1

d(πwPθ C) (5)

where d measures the distance between the outputs of theestimate program Pθ and the neural policy πw subject tosafety constraints To avoid the computational complexitythat arises if we consider a solution to this equation analyti-cally as an optimization problem we instead approximated(πwPθ C) by randomly sampling a set of trajectories Cthat are encountered by Pθ in the environment state transi-tion system C[Pθ ]

d(πwPθ C) asymp d(πwPθ C) s t C sube C[Pθ ]We estimate θ lowast using these sampled trajectoriesC and define

d(πwPθ C) =sumhisinC

d(πwPθ h)

Since each trajectory h isin C is a finite rollout s0 sT oflength T we have

d(πwPθ h) =Tsumt=0

minus ∥(Pθ (st ) minus πw(st ))∥ st lt SuminusMAX st isin Su

where ∥middot∥ is a suitable norm As illustrated in Sec 22 weaim to minimize the distance between a synthesized programPθ from a sketch space and its neural oracle along sampledtrajectories encountered by Pθ in the environment contextbut put a large penalty on states that are unsafe

Random SearchWe implement the idea encoded in equa-tion (5) in Algorithm 1 that depicts the pseudocode of ourpolicy interpretation approach It take as inputs a neuralpolicy a policy program sketch parameterized by θ and anenvironment state transition system and outputs a synthe-sized policy program Pθ

An efficient way to solve equation (5) is to directly perturbθ in the search space by adding random noise and thenupdate θ based on the effect on this perturbation [30] Wechoose a direction uniformly at random on the sphere inparameter space and then optimize the goal function alongthis direction To this end in line 3 of Algorithm 1 we sampleGaussian noise δ to be added to policy parameters θ in bothdirections where ν is a small positive real number In line 4and line 5 we sample trajectories C1 and C2 from the statetransition systems obtained by running the perturbed policyprogram Pθ+νδ and Pθminusνδ in the environment context C

We evaluate the proximity of these two policy programs tothe neural oracle and in line 6 of Algorithm 1 to improve thepolicy program we also optimize equation (5) by updating θwith a finite difference approximation along the direction

θ larr θ + α[d(Pθ+νδ πwC1) minus d(Pθminusνδ πwC2)ν

]δ (6)

where α is a predefined learning rate Such an update incre-ment corresponds to an unbiased estimator of the gradient ofθ [29 33] for equation (5) The algorithm iteratively updatesθ until convergence

42 VerificationA synthesized policy program P is verified with respect to anenvironment context given as an infinite state transition sys-tem defined in Sec 3 C[P] = (X ASS0Su Tt [P] f r )Our verification algorithm learns an inductive invariant φover the transition relation Tt [P] formally proving that allpossible system behaviors are encapsulated in φ and φ isrequired to be disjoint with all unsafe states Su We follow template-based constraint solving approaches

for inductive invariant inference [19 20 24 34] to discoverthis invariant The basic idea is to identify a function E Rn rarr R that serves as a barrier [20 24 34] between reach-able system states (evaluated to be nonpositive by E) andunsafe states (evaluated positive by E) Using the invariantsyntax in Fig 5 the user can define an invariant sketch

φ[c](X ) = E[c](X ) le 0 (7)

over variables X and c unknown coefficients intended to besynthesized Fig 5 carefully restricts that an invariant sketchE[c] can only be postulated as a polynomial function as thereexist efficient SMT solvers [13] and constraint solvers [26]for nonlinear polynomial reals Formally assume real coef-ficients c = (c0 middot middot middot cp ) are used to parameterize E[c] in anaffine manner

E[c](X ) = Σpi=0cibi (X )

7

where the bi (X )rsquos are some monomials in variables X As aheuristic the user can simply determine an upper bound onthe degree of E[c] and then include all monomials whosedegrees are no greater than the bound in the sketch Largevalues of the bound enable verification of more complexsafety conditions but impose greater demands on the con-straint solver small values capture coarser safety propertiesbut are easier to solve

Example 41 Consider the inverted pendulum system inSec 22 To discover an inductive invariant for the transitionsystem the user might choose to define an upper bound of 4which results in the following polynomial invariant sketchφ[c](ηω) = E[c](ηω) le 0 where

E[c](ηω) = c0η4+c1η3ω+c2η2ω2+c3ηω3+c4ω

4+c5η3+middot middot middot+cp

The sketch includes all monomials over η and ω whose de-grees are no greater than 4 The coefficients c = [c0 middot middot middot cp ]are unknown and need to be synthesized

To synthesize these unknowns we require that E[c] mustsatisfy the following verification conditions

forall(s) isin Su E[c](s) gt 0 (8)forall(s) isin S0 E[c](s) le 0 (9)forall(s s prime) isin Tt [P] E[c](s prime) minus E[c](s) le 0 (10)

We claim that φ = E[c](x) le 0 defines an inductive in-variant because verification condition (9) ensures that anyinitial state s0 isin S0 satisfies φ since E[c](s0) le 0 verificationcondition (10) asserts that along the transition from a states isin φ (so E[c](s) le 0) to a resulting state s prime E[c] cannotbecome positive so s prime satisfies φ as well Finally accordingto verification condition (8) φ does not include any unsafestate su isin Su as E[c](su ) is positive

Verification conditions (8) (9) (10) are polynomials over re-als Synthesizing unknown coefficients can be left to an SMTsolver [13] after universal quantifiers are eliminated using avariant of Farkas Lemma as in [20] However observe thatE[c] is convex3 We can gain efficiency by finding unknowncoefficients c using off-the-shelf convex constraint solversfollowing [34] Encoding verification conditions (8) (9) (10)as polynomial inequalities we search c that can prove non-negativity of these constraints via an efficient and convexsum of squares programming solver [26] Additional techni-cal details are provided in the supplementary material [49]Counterexample-guided Inductive Synthesis (CEGIS)Given the environment state transition system C[P] de-ployed with a synthesized program P the verification ap-proach above can compute an inductive invariant over thestate transition system or tell if there is no feasible solutionin the given set of candidates Note however that the latterdoes not necessarily imply that the system is unsafe

3For arbitrary E1(x ) and E2(x ) satisfying the verification conditions andα isin [0 1] E(x ) = αE1(x ) + (1 minus α )E2(x ) satisfies the conditions as well

Algorithm 2 CEGIS (πw P[θ ] C[middot])1 policieslarr empty2 coverslarr empty3 while CS0 ⊈ covers do4 search s0 such that s0 isin CS0 and s0 lt covers5 rlowastlarr Diameter(CS0)6 do7 ϕbound larr s | s isin s0 minus rlowast s0 + rlowast8 C larr C where CS0 = (CS0 cap ϕbound )9 θ larr Synthesize(πw P[θ ] C[middot])

10 φlarr Verify(C[Pθ ])11 if φ is False then12 rlowastlarr rlowast213 else14 coverslarr covers cup s | φ(s)15 policieslarr policies cup (Pθ φ)16 break17 while True18 end19 return policies

Since our goal is to learn a safe deterministic programfrom a neural network we develop a counterexample guidedinductive program synthesis approach A CEGIS algorithmin our context is challenging because safety verification isnecessarily incomplete and may not be able to produce acounterexample that serves as an explanation for why averification attempt is unsuccessful

We solve the incompleteness challenge by leveraging thefact that we can simultaneously synthesize and verify aprogram Our CEGIS approach is given in Algorithm 2 Acounterexample is an initial state on which our synthesizedprogram is not yet proved safe Driven by these counterex-amples our algorithm synthesizes a set of programs from asketch along with the conditions under which we can switchfrom from one synthesized program to another

Algorithm 2 takes as input a neural policy πw a programsketch P[θ ] and an environment context C It maintains syn-thesized policy programs in policies in line 1 each of whichis inductively verified safe in a partition of the universe statespace that is maintained in covers in line 2 For soundnessthe state space covered by such partitions must be able toinclude all initial states checked in line 3 of Algorithm 2 byan SMT solver In the algorithm we use CS0 to access afield of C such as its initial state space

The algorithm iteratively samples a counterexample initialstate s0 that is currently not covered by covers in line 4 Sincecovers is empty at the beginning this choice is uniformlyrandom initially we synthesize a presumably safe policyprogram in line 9 of Algorithm 2 that resembles the neuralpolicy πw considering all possible initial states S0 of thegiven environment model C using Algorithm 1

8

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 3: An Inductive Synthesis Framework for Verifiable

Figure 2 The Framework of Our Approach

We assume the existence of a neural network control pol-icy πw R2 rarr R that executes actions over the pendu-lum whose weight values of w are learned from trainingepisodes This policy is a state-dependent function mappinga 2-dimensional state s (η and ω) to a control action a Ateach transition the policy mitigates uncertainty throughfeedback over state variables in s Environment Context An environment context C[middot] de-fines the behavior of the application where [middot] is left opento deploy a reasonable neural controller πw The actions ofthe controller are dictated by constraints imposed by the en-vironment In its simplest form the environment is simply astate transition system In the pendulum example this wouldbe the equations given in Fig 1 parameterized by pendulummass and length In general however the environment mayinclude additional constraints (eg a constraining boundingbox that restricts the motion of the pendulum beyond thespecification given by the transition system in Fig 1)

22 Synthesis Verification and ShieldingIn this paper we assume that a neural network is trainedusing a state-of-the-art reinforcement learning strategy [2840] Even though the resulting neural policy may appear towork well in practice the complexity of its implementationmakes it difficult to assert any strong and provable claimsabout its correctness since the neurons layers weights andbiases are far-removed from the intent of the actual controllerWe found that state-of-the-art neural network verifiers [1723] are ineffective for verification of a neural controller overan infinite time horizon with complex system dynamicsFramework We construct a policy interpretation mecha-nism to enable verification inspired by prior work on im-itation learning [36 38] and interpretable machine learn-ing [9 47] Fig 2 depicts the high-level framework of ourapproach Our idea is to synthesize a deterministic policyprogram from a neural policy πw approximating πw (whichwe call an oracle) with a simpler structural program P Likeπw P takes as input a system state and generates a controlaction a To this end P is simulated in the environmentused to train the neural policy πw to collect feasible statesGuided by πwrsquos actions on such collected states P is furtherimproved to resemble πw

The goal of the synthesis procedure is to search for a de-terministic program Plowast satisfying both (1) a quantitativespecification such that it bears reasonably close resemblanceto its oracle so that allowing it to serve as a potential sub-stitute is a sensible notion and (2) a desired logical safetyproperty such that when in operation the unsafe states de-fined in the environment C cannot be reached Formally

Plowast = argmaxPisinSafe(CJHK)

d(πwPC) (1)

where d(πwPC) measures proximity of P with its neuraloracle in an environment C JHK defines a search space forP with prior knowledge on the shape of target deterministicprograms and Safe(C JHK) restricts the solution space toa set of safe programs A program P is safe if the safetyof the transition system C[P] the deployment of P in theenvironment C can be formally verified

The novelty of our approach against prior work on neuralpolicy interpretation [9 47] is thus two-fold1 We bake in the concept of safety and formal safety verifi-

cation into the synthesis of a deterministic program froma neural policy as depicted in Fig 2 If a candidate programis not safe we rely on a counterexample-guided inductivesynthesis loop to improve our synthesis outcome to en-force the safety conditions imposed by the environment

2 We allowP to operate in tandemwith the high-performingneural policy P can be viewed as capturing an inductiveinvariant of the state transition system which can be usedas a shield to describe a boundary of safe states withinwhich the neural policy is free to make optimal control de-cisions If the system is about to be driven out of this safetyboundary the synthesized program is used to take an ac-tion that is guaranteed to stay within the space subsumedby the invariant By allowing the synthesis procedureto treat the neural policy as an oracle we constrain thesearch space of feasible programs to be those whose ac-tions reside within a proximate neighborhood of actionsundertaken by the neural policy

Synthesis Reducing a complex neural policy to a simpleryet safe deterministic program is possible because we donot require other properties from the oracle specificallywe do not require that the deterministic program preciselymirrors the performance of the neural policy For exampleexperiments described in [35] show that while a linear-policycontrolled robot can effectively stand up it is unable to learnan efficient walking gait unlike a sufficiently-trained neuralpolicy However if we just care about the safety of the neuralnetwork we posit that a linear reduction can be sufficientlyexpressive to describe necessary safety constraints Based onthis hypothesis for our inverted pendulum example we canexplore a linear program space from which a deterministicprogramPθ can be drawn expressed in terms of the followingprogram sketch

3

def P[θ1 θ2](η ω ) return θ1η + θ2ω

Here θ = [θ1θ2] are unknown parameters that need to besynthesized Intuitively the programweights the importanceof η and ω at a particular state to provide a feedback controlaction to mitigate the deviation of the inverted pendulumfrom (η = 0ω = 0)Our search-based synthesis sets θ to 0 initially It runs

the deterministic program Pθ instead of the oracle neuralpolicy πw within the environment C defined in Fig 1 (in thiscase the state transition system represents the differentialequation specification of the controller) to collect a batch oftrajectories A run of the state transition system of C[Pθ ]produces a finite trajectory s0 s1 middot middot middot sT We find θ from thefollowing optimization task that realizes (1)

maxθ isinR2E[ΣTt=0d(πwPθ st )] (2)

where d(πwPθ st ) equivminus(Pθ (st ) minus πw(st ))2 st lt SuminusMAX st isin Su

This

equation aims to search for a programPθ at minimal distancefrom the neural oracle πw along sampled trajectories whilesimultaneously maximizing the likelihood that Pθ is safe

Our synthesis procedure described in Sec 41 is a randomsearch-based optimization algorithm [30] We sample a newposition of θ iteratively from its hypersphere of a given smallradius surrounding the current position of θ and move to thenew position (wrt a learning rate) as dictated by Equation(2) For the running example our search synthesizes

def P(η ω ) return minus1205η + minus587ω

The synthesized program can be used to intuitively interprethow the neural oracle works For example if a pendulumwith a positive angle η gt 0 leans towards the right (ω gt 0)the controller will need to generate a large negative controlaction to force the pendulum to move leftVerification Since our method synthesizes a deterministicprogram P we can leverage off-the-shelf formal verificationalgorithms to verify its safety with respect to the state transi-tion system C defined in Fig 1 To ensure that P is safe wemust ascertain that it can never transition to an unsafe stateie a state that causes the pendulum to fall When framedas a formal verification problem answering such a questionis tantamount to discovering an inductive invariant φ thatrepresents all safe states over the state transition system

1 Safe φ is disjoint with all unsafe states Su 2 Init φ includes all initial states S03 Induction Any state in φ transitions to another state

in φ and hence cannot reach an unsafe stateInspired by template-based constraint solving approaches oninductive invariant inference [19 20 24 34] the verificationalgorithm described in Sec 42 uses a constraint solver tolook for an inductive invariant in the form of a convex barriercertificate [34] E(s) le 0 that maps all the states in the (safe)

reachable set to non-positive reals and all the states in theunsafe set to positive reals The basic idea is to identify apolynomial function E Rn rarr R such that 1) E(s) gt 0 forany state s isin Su 2) E(s) le 0 for any state s isin S0 and 3)E(s prime) minus E(s) le 0 for any state s that transitions to s prime in thestate transition system C[P] The second and third conditioncollectively guarantee that E(s) le 0 for any state s in thereachable set thus implying that an unsafe state in Su cannever be reachedFig 3(a) draws the discovered invariant in blue for C[P]

given the initial and unsafe states where P is the synthesizedprogram for the inverted pendulum systemWe can concludethat the safety property is satisfied by the P controlled sys-tem as all reachable states do not overlap with unsafe statesIn case verification fails our approach conducts a counterex-ample guided loop (Sec 42) to iteratively synthesize safedeterministic programs until verification succeedsShielding Keep in mind that a safety proof of a reduceddeterministic program of a neural network does not auto-matically lift to a safety argument of the neural networkfrom which it was derived since the network may exhibitbehaviors not fully captured by the simpler deterministicprogram To bridge this divide we propose to recover sound-ness at runtime by monitoring system behaviors of a neuralnetwork in its actual environment (deployment) contextFig 4 depicts our runtime shielding approach with more

details given in Sec 43 The inductive invariant φ learnt fora deterministic program P of a neural network πw can serveas a shield to protect πw at runtime under the environmentcontext and safety property used to synthesize P An obser-vation about a current state is sent to both πw and the shieldA high-performing neural network is allowed to take any ac-tions it feels are optimal as long as the next state it proposesis still within the safety boundary formally characterized bythe inductive invariant of C[P] Whenever a neural policyproposes an action that steers its controlled system out ofthe state space defined by the inductive invariant we havelearned as part of deterministic program synthesis our shieldcan instead take a safe action proposed by the deterministicprogram P The action given by P is guaranteed safe be-cause φ defines an inductive invariant of C[P] taking theaction allows the system to stay within the provably safe re-gion identified by φ Our shielding mechanism is sound dueto formal verification Because the deterministic programwas synthesized using πw as an oracle it is expected that theshield derived from the program will not frequently inter-rupt the neural networkrsquos decision allowing the combinedsystem to perform (close-to) optimallyIn the inverted pendulum example since the 90 bound

given as a safety constraint is rather conservative we do notexpect a well-trained neural network to violate this boundaryIndeed in Fig 3(a) even though the inductive invariant ofthe synthesized program defines a substantially smaller statespace than what is permissible in our simulation results

4

Figure 3 Invariant Inference on Inverted Pendulum

Figure 4 The Framework of Neural Network Shielding

we find that the neural policy is never interrupted by thedeterministic program when governing the pendulum Notethat the non-shaded areas in Fig 3(a) while appearing safepresumably define states from which the trajectory of thesystem can be led to an unsafe state and would thus not beinductive

Our synthesis approach is critical to ensuring safety whenthe neural policy is used to predict actions in an environmentdifferent from the one used during training Consider a neu-ral network suitably protected by a shield that now operatessafely The effectiveness of this shield would be greatly di-minished if the network had to be completely retrained fromscratch whenever it was deployed in a new environmentwhich imposes different safety constraints

In our running example suppose we wish to operate theinverted pendulum in an environment such as a Segwaytransporter in which the model is prohibited from swingingsignificantly and whose angular velocity must be suitablyrestricted We might specify the following new constraintson state parameters to enforce these conditions

Su (ηω) |not(minus30 lt η lt 30 and minus30 lt ω lt 30)Because of the dependence of a neural network to the qualityof training data used to build it environment changes thatdeviate from assumptions made at training-time could resultin a costly retraining exercise because the network mustlearn a new way to penalize unsafe actions that were previ-ously safe However training a neural network from scratchrequires substantial non-trivial effort involving fine-grainedtuning of training parameters or even network structures

In our framework the existing network provides a reason-able approximation to the desired behavior To incorporatethe additional constraints defined by the new environment

Cprime we attempt to synthesize a new deterministic programP prime for Cprime a task based on our experience is substantiallyeasier to achieve than training a new neural network policyfrom scratch This new program can be used to protect theoriginal network provided that we can use the aforemen-tioned verification approach to formally verify that Cprime[P prime] issafe by identifying a new inductive invariant φ prime As depictedin Fig 4 we simply build a new shield Sprime that is composedof the program P prime and the safety boundary captured by φ primeThe shield Sprime can ensure the safety of the neural network inthe environment context Cprime with a strengthened safety con-dition despite the fact that the neural network was trainedin a different environment context CFig 3(b) depicts the new unsafe states in Cprime (colored in

red) It extends the unsafe range draw in Fig 3(a) so theinductive invariant learned there is unsafe for the new oneOur approach synthesizes a new deterministic program forwhich we learn a more restricted inductive invariant de-picted in green in Fig 3(b) To characterize the effectivenessof our shielding approach we examined 1000 simulationtrajectories of this restricted version of the inverted pendu-lum system each of which is comprised of 5000 simulationsteps with the safety constraints defined by Cprime Without theshield Sprime the pendulum entered the unsafe region Su 41times All of these violations were prevented by Sprime Notablythe intervention rate of Sprime to interrupt the neural networkrsquosdecision was extremely low Out of a total of 5000times1000 de-cisions we only observed 414 instances (000828) wherethe shield interfered with (ie overrode) the decision madeby the network

3 Problem SetupWemodel the context C of a control policy as an environmentstate transition system C[middot] = (X ASS0Su Tt [middot] f r )Note that middot is intentionally left open to deploy neural controlpolicies Here X is a finite set of variables interpreted overthe reals R and S = RX is the set of all valuations of thevariables X We denote s isin S to be an n-dimensional envi-ronment state and a isin A to be a control action where Ais an infinite set ofm-dimensional continuous actions thata learning-enabled controller can perform We use S0 isin Sto specify a set of initial environment states and Su isin S tospecify a set of unsafe environment states that a safe con-troller should avoid The transition relation Tt [middot] defines howone state transitions to another given an action by an un-known policyWe assume thatTt [middot] is governed by a standarddifferential equation f defining the relationship between acontinuously varying state s(t) and action a(t) and its rateof change Ucircs(t) over time t

Ucircs(t) = f (s(t)a(t))We assume f is defined by equations of the formRntimesRm rarrRn such as the example in Fig 1 In the following we oftenomit the time variable t for simplicity A deterministic neural

5

network control policy πw Rn rarr Rm parameterized by aset of weight valuesw is a function mapping an environmentstate s to a control action a where

a = πw(s)By providing a feedback action a policy can alter the rate ofchange of state variables to realize optimal system controlThe transition relation Tt [middot] is parameterized by a con-

trol policy π that is deployed in C We explicitly modelthis deployment as Tt [π ] Given a control policy πw weuse Tt [πw] S times S to specify all possible state transitionsallowed by the policy We assume that a system transitionsin discrete time instants kt where k = 0 1 2 middot middot middot and t is afixed time step (t gt 0) A state s transitions to a next state s primeafter time t with the assumption that a control action a(τ ) attime τ is a constant between the time period τ isin [0 t) Us-ing Eulerrsquos method2 we discretize the continuous dynamicsf with finite difference approximation so it can be used inthe discretized transition relation Tt [πw] Then Tt [πw] cancompute these estimates by the following difference equation

Tt [πw] = (s s prime) | s prime = s + f (sπw(s)) times tEnvironment Disturbance Our model allows boundedexternal properties (eg additional environment-imposedconstraints) by extending the definition of change of rate Ucircs =f (sa)+d whered is a vector of random disturbances We used to encode environment disturbances in terms of boundednondeterministic values We assume that tight upper andlower bounds of d can be accurately estimated at runtimeusing multivariate normal distribution fitting methodsTrajectoryA trajectoryh of a state transition system C[πw]which we denote as h isin C[πw] is a sequence of statess0 middot middot middot si si+1 middot middot middot where s0 isin S0 and (si si+1) isin Tt [πw] forall i ge 0 We use C sube C[πw] to denote a set of trajectoriesThe reward that a control policy receives on performing anaction a in a state s is given by the reward function r (sa)Reinforcement Learning The goal of reinforcement learn-ing is to maximize the reward that can be collected by a neu-ral control policy in a given environment context C Suchproblems can be abstractly formulated as

maxwisinRn

J (w)

J (w) = E[r (πw)](3)

Assume that s0 s1 sT is a trajectory of length T of thestate transition system and r (πw) = lim

TrarrinfinΣTk=0r (sk πw(sk ))

is the cumulative reward achieved by the policy πw fromthis trajectory Thus this formulation uses simulations ofthe transition system with finite length rollouts to estimatethe expected cumulative reward collected over T time stepsand aim to maximize this reward Reinforcement learning2Eulerrsquos method may sometimes poorly approximate the true system tran-sition function when f is highly nonlinear More precise higher-order ap-proaches such as Runge-Kutta methods exist to compensate for loss ofprecision in this case

assumes polices are expressible in some executable structure(eg a neural network) and allows samples generated fromone policy to influence the estimates made for others Thetwo main approaches for reinforcement learning are valuefunction estimation and direct policy searchWe refer readersto [27] for an overviewSafety Verification Since reinforcement learning only con-siders finite length rollouts we wish to determine if a controlpolicy is safe to use under an infinite time horizon Givena state transition system the safety verification problem isconcerned with verifying that no trajectories contained in Sstarting from an initial state inS0 reach an unsafe state inSu However a neural network is a representative of a class ofdeep and sophisticated models that challenges the capabilityof the state-of-the-art verification techniques This level ofcomplexity is exacerbated in our work because we considerthe long term safety of a neural policy deployed within anontrivial environment context C that in turn is describedby a complex infinite-state transition system

4 Verification ProcedureTo verify a neural network control policy πw with respect toan environment context C we first synthesize a deterministicpolicy program P from the neural policy We require that Pboth (a) broadly resembles its neural oracle and (b) addition-ally satisfies a desired safety property when it is deployedin C We conjecture that a safety proof of C[P] is easier toconstruct than that of C[πw] More importantly we leveragethe safety proof of C[P] to ensure the safety of C[πw]

41 SynthesisFig 5 defines a search space for a deterministic policy pro-gram P where E and φ represent the basic syntax of (polyno-mial) program expressions and inductive invariants respec-tively Here v ranges over a universe of numerical constantsx represents variables and oplus is a basic operator including +and times A deterministic program P also features conditionalstatements using φ as branching predicatesWe allow the user to define a sketch [41 42] to describe

the shape of target policy programs using the grammar inFig 5 We use P[θ ] to represent a sketch where θ representsunknowns that need to be filled-in by the synthesis proce-dure We use Pθ to represent a synthesized program withknown values of θ Similarly the user can define a sketch ofan inductive invariant φ[middot] that is used (in Sec 42) to learna safety proof to verify a synthesized program Pθ

We do not require the user to explicitly define conditionalstatements in a program sketch Our synthesis algorithmusesverification counterexamples to lazily add branch predicatesφ under which a program performs different computationsdepending on whether φ evaluates to true or false The enduser simply writes a sketch over basic expressions For ex-ample a sketch that defines a family of linear function over

6

E = v | x | oplus (E1 Ek )φ = E le 0P = return E | if φ then return E else P

Figure 5 Syntax of the Policy Programming Language

Algorithm 1 Synthesize (πw P[θ ] C[middot])1 θ larr 02 do3 Sample δ from a zero mean Gaussian vector4 Sample a set of trajectories C1 using C[Pθ+νδ ]5 Sample a set of trajectories C2 using C[Pθminusνδ ]6 θ larr θ + α[d (πwPθ+νδ C1)minusd (πwPθminusνδ C2)

ν ]δ 7 while θ is not converged8 return Pθ

a collection of variables can be expressed as

P[θ ](X ) = return θ1x1 + θ2x2 + middot middot middot + θnxn + θn+1 (4)

Here X = (x1x2 middot middot middot ) are system variables in which thecoefficient constants θ = (θ1θ2 middot middot middot ) are unknownThe goal of our synthesis algorithm is to find unknown

values of θ that maximize the likelihood that Pθ resemblesthe neural oracle πw while still being a safe program withrespect to the environment context C

θ lowast = argmaxθ isinRn+1

d(πwPθ C) (5)

where d measures the distance between the outputs of theestimate program Pθ and the neural policy πw subject tosafety constraints To avoid the computational complexitythat arises if we consider a solution to this equation analyti-cally as an optimization problem we instead approximated(πwPθ C) by randomly sampling a set of trajectories Cthat are encountered by Pθ in the environment state transi-tion system C[Pθ ]

d(πwPθ C) asymp d(πwPθ C) s t C sube C[Pθ ]We estimate θ lowast using these sampled trajectoriesC and define

d(πwPθ C) =sumhisinC

d(πwPθ h)

Since each trajectory h isin C is a finite rollout s0 sT oflength T we have

d(πwPθ h) =Tsumt=0

minus ∥(Pθ (st ) minus πw(st ))∥ st lt SuminusMAX st isin Su

where ∥middot∥ is a suitable norm As illustrated in Sec 22 weaim to minimize the distance between a synthesized programPθ from a sketch space and its neural oracle along sampledtrajectories encountered by Pθ in the environment contextbut put a large penalty on states that are unsafe

Random SearchWe implement the idea encoded in equa-tion (5) in Algorithm 1 that depicts the pseudocode of ourpolicy interpretation approach It take as inputs a neuralpolicy a policy program sketch parameterized by θ and anenvironment state transition system and outputs a synthe-sized policy program Pθ

An efficient way to solve equation (5) is to directly perturbθ in the search space by adding random noise and thenupdate θ based on the effect on this perturbation [30] Wechoose a direction uniformly at random on the sphere inparameter space and then optimize the goal function alongthis direction To this end in line 3 of Algorithm 1 we sampleGaussian noise δ to be added to policy parameters θ in bothdirections where ν is a small positive real number In line 4and line 5 we sample trajectories C1 and C2 from the statetransition systems obtained by running the perturbed policyprogram Pθ+νδ and Pθminusνδ in the environment context C

We evaluate the proximity of these two policy programs tothe neural oracle and in line 6 of Algorithm 1 to improve thepolicy program we also optimize equation (5) by updating θwith a finite difference approximation along the direction

θ larr θ + α[d(Pθ+νδ πwC1) minus d(Pθminusνδ πwC2)ν

]δ (6)

where α is a predefined learning rate Such an update incre-ment corresponds to an unbiased estimator of the gradient ofθ [29 33] for equation (5) The algorithm iteratively updatesθ until convergence

42 VerificationA synthesized policy program P is verified with respect to anenvironment context given as an infinite state transition sys-tem defined in Sec 3 C[P] = (X ASS0Su Tt [P] f r )Our verification algorithm learns an inductive invariant φover the transition relation Tt [P] formally proving that allpossible system behaviors are encapsulated in φ and φ isrequired to be disjoint with all unsafe states Su We follow template-based constraint solving approaches

for inductive invariant inference [19 20 24 34] to discoverthis invariant The basic idea is to identify a function E Rn rarr R that serves as a barrier [20 24 34] between reach-able system states (evaluated to be nonpositive by E) andunsafe states (evaluated positive by E) Using the invariantsyntax in Fig 5 the user can define an invariant sketch

φ[c](X ) = E[c](X ) le 0 (7)

over variables X and c unknown coefficients intended to besynthesized Fig 5 carefully restricts that an invariant sketchE[c] can only be postulated as a polynomial function as thereexist efficient SMT solvers [13] and constraint solvers [26]for nonlinear polynomial reals Formally assume real coef-ficients c = (c0 middot middot middot cp ) are used to parameterize E[c] in anaffine manner

E[c](X ) = Σpi=0cibi (X )

7

where the bi (X )rsquos are some monomials in variables X As aheuristic the user can simply determine an upper bound onthe degree of E[c] and then include all monomials whosedegrees are no greater than the bound in the sketch Largevalues of the bound enable verification of more complexsafety conditions but impose greater demands on the con-straint solver small values capture coarser safety propertiesbut are easier to solve

Example 41 Consider the inverted pendulum system inSec 22 To discover an inductive invariant for the transitionsystem the user might choose to define an upper bound of 4which results in the following polynomial invariant sketchφ[c](ηω) = E[c](ηω) le 0 where

E[c](ηω) = c0η4+c1η3ω+c2η2ω2+c3ηω3+c4ω

4+c5η3+middot middot middot+cp

The sketch includes all monomials over η and ω whose de-grees are no greater than 4 The coefficients c = [c0 middot middot middot cp ]are unknown and need to be synthesized

To synthesize these unknowns we require that E[c] mustsatisfy the following verification conditions

forall(s) isin Su E[c](s) gt 0 (8)forall(s) isin S0 E[c](s) le 0 (9)forall(s s prime) isin Tt [P] E[c](s prime) minus E[c](s) le 0 (10)

We claim that φ = E[c](x) le 0 defines an inductive in-variant because verification condition (9) ensures that anyinitial state s0 isin S0 satisfies φ since E[c](s0) le 0 verificationcondition (10) asserts that along the transition from a states isin φ (so E[c](s) le 0) to a resulting state s prime E[c] cannotbecome positive so s prime satisfies φ as well Finally accordingto verification condition (8) φ does not include any unsafestate su isin Su as E[c](su ) is positive

Verification conditions (8) (9) (10) are polynomials over re-als Synthesizing unknown coefficients can be left to an SMTsolver [13] after universal quantifiers are eliminated using avariant of Farkas Lemma as in [20] However observe thatE[c] is convex3 We can gain efficiency by finding unknowncoefficients c using off-the-shelf convex constraint solversfollowing [34] Encoding verification conditions (8) (9) (10)as polynomial inequalities we search c that can prove non-negativity of these constraints via an efficient and convexsum of squares programming solver [26] Additional techni-cal details are provided in the supplementary material [49]Counterexample-guided Inductive Synthesis (CEGIS)Given the environment state transition system C[P] de-ployed with a synthesized program P the verification ap-proach above can compute an inductive invariant over thestate transition system or tell if there is no feasible solutionin the given set of candidates Note however that the latterdoes not necessarily imply that the system is unsafe

3For arbitrary E1(x ) and E2(x ) satisfying the verification conditions andα isin [0 1] E(x ) = αE1(x ) + (1 minus α )E2(x ) satisfies the conditions as well

Algorithm 2 CEGIS (πw P[θ ] C[middot])1 policieslarr empty2 coverslarr empty3 while CS0 ⊈ covers do4 search s0 such that s0 isin CS0 and s0 lt covers5 rlowastlarr Diameter(CS0)6 do7 ϕbound larr s | s isin s0 minus rlowast s0 + rlowast8 C larr C where CS0 = (CS0 cap ϕbound )9 θ larr Synthesize(πw P[θ ] C[middot])

10 φlarr Verify(C[Pθ ])11 if φ is False then12 rlowastlarr rlowast213 else14 coverslarr covers cup s | φ(s)15 policieslarr policies cup (Pθ φ)16 break17 while True18 end19 return policies

Since our goal is to learn a safe deterministic programfrom a neural network we develop a counterexample guidedinductive program synthesis approach A CEGIS algorithmin our context is challenging because safety verification isnecessarily incomplete and may not be able to produce acounterexample that serves as an explanation for why averification attempt is unsuccessful

We solve the incompleteness challenge by leveraging thefact that we can simultaneously synthesize and verify aprogram Our CEGIS approach is given in Algorithm 2 Acounterexample is an initial state on which our synthesizedprogram is not yet proved safe Driven by these counterex-amples our algorithm synthesizes a set of programs from asketch along with the conditions under which we can switchfrom from one synthesized program to another

Algorithm 2 takes as input a neural policy πw a programsketch P[θ ] and an environment context C It maintains syn-thesized policy programs in policies in line 1 each of whichis inductively verified safe in a partition of the universe statespace that is maintained in covers in line 2 For soundnessthe state space covered by such partitions must be able toinclude all initial states checked in line 3 of Algorithm 2 byan SMT solver In the algorithm we use CS0 to access afield of C such as its initial state space

The algorithm iteratively samples a counterexample initialstate s0 that is currently not covered by covers in line 4 Sincecovers is empty at the beginning this choice is uniformlyrandom initially we synthesize a presumably safe policyprogram in line 9 of Algorithm 2 that resembles the neuralpolicy πw considering all possible initial states S0 of thegiven environment model C using Algorithm 1

8

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 4: An Inductive Synthesis Framework for Verifiable

def P[θ1 θ2](η ω ) return θ1η + θ2ω

Here θ = [θ1θ2] are unknown parameters that need to besynthesized Intuitively the programweights the importanceof η and ω at a particular state to provide a feedback controlaction to mitigate the deviation of the inverted pendulumfrom (η = 0ω = 0)Our search-based synthesis sets θ to 0 initially It runs

the deterministic program Pθ instead of the oracle neuralpolicy πw within the environment C defined in Fig 1 (in thiscase the state transition system represents the differentialequation specification of the controller) to collect a batch oftrajectories A run of the state transition system of C[Pθ ]produces a finite trajectory s0 s1 middot middot middot sT We find θ from thefollowing optimization task that realizes (1)

maxθ isinR2E[ΣTt=0d(πwPθ st )] (2)

where d(πwPθ st ) equivminus(Pθ (st ) minus πw(st ))2 st lt SuminusMAX st isin Su

This

equation aims to search for a programPθ at minimal distancefrom the neural oracle πw along sampled trajectories whilesimultaneously maximizing the likelihood that Pθ is safe

Our synthesis procedure described in Sec 41 is a randomsearch-based optimization algorithm [30] We sample a newposition of θ iteratively from its hypersphere of a given smallradius surrounding the current position of θ and move to thenew position (wrt a learning rate) as dictated by Equation(2) For the running example our search synthesizes

def P(η ω ) return minus1205η + minus587ω

The synthesized program can be used to intuitively interprethow the neural oracle works For example if a pendulumwith a positive angle η gt 0 leans towards the right (ω gt 0)the controller will need to generate a large negative controlaction to force the pendulum to move leftVerification Since our method synthesizes a deterministicprogram P we can leverage off-the-shelf formal verificationalgorithms to verify its safety with respect to the state transi-tion system C defined in Fig 1 To ensure that P is safe wemust ascertain that it can never transition to an unsafe stateie a state that causes the pendulum to fall When framedas a formal verification problem answering such a questionis tantamount to discovering an inductive invariant φ thatrepresents all safe states over the state transition system

1 Safe φ is disjoint with all unsafe states Su 2 Init φ includes all initial states S03 Induction Any state in φ transitions to another state

in φ and hence cannot reach an unsafe stateInspired by template-based constraint solving approaches oninductive invariant inference [19 20 24 34] the verificationalgorithm described in Sec 42 uses a constraint solver tolook for an inductive invariant in the form of a convex barriercertificate [34] E(s) le 0 that maps all the states in the (safe)

reachable set to non-positive reals and all the states in theunsafe set to positive reals The basic idea is to identify apolynomial function E Rn rarr R such that 1) E(s) gt 0 forany state s isin Su 2) E(s) le 0 for any state s isin S0 and 3)E(s prime) minus E(s) le 0 for any state s that transitions to s prime in thestate transition system C[P] The second and third conditioncollectively guarantee that E(s) le 0 for any state s in thereachable set thus implying that an unsafe state in Su cannever be reachedFig 3(a) draws the discovered invariant in blue for C[P]

given the initial and unsafe states where P is the synthesizedprogram for the inverted pendulum systemWe can concludethat the safety property is satisfied by the P controlled sys-tem as all reachable states do not overlap with unsafe statesIn case verification fails our approach conducts a counterex-ample guided loop (Sec 42) to iteratively synthesize safedeterministic programs until verification succeedsShielding Keep in mind that a safety proof of a reduceddeterministic program of a neural network does not auto-matically lift to a safety argument of the neural networkfrom which it was derived since the network may exhibitbehaviors not fully captured by the simpler deterministicprogram To bridge this divide we propose to recover sound-ness at runtime by monitoring system behaviors of a neuralnetwork in its actual environment (deployment) contextFig 4 depicts our runtime shielding approach with more

details given in Sec 43 The inductive invariant φ learnt fora deterministic program P of a neural network πw can serveas a shield to protect πw at runtime under the environmentcontext and safety property used to synthesize P An obser-vation about a current state is sent to both πw and the shieldA high-performing neural network is allowed to take any ac-tions it feels are optimal as long as the next state it proposesis still within the safety boundary formally characterized bythe inductive invariant of C[P] Whenever a neural policyproposes an action that steers its controlled system out ofthe state space defined by the inductive invariant we havelearned as part of deterministic program synthesis our shieldcan instead take a safe action proposed by the deterministicprogram P The action given by P is guaranteed safe be-cause φ defines an inductive invariant of C[P] taking theaction allows the system to stay within the provably safe re-gion identified by φ Our shielding mechanism is sound dueto formal verification Because the deterministic programwas synthesized using πw as an oracle it is expected that theshield derived from the program will not frequently inter-rupt the neural networkrsquos decision allowing the combinedsystem to perform (close-to) optimallyIn the inverted pendulum example since the 90 bound

given as a safety constraint is rather conservative we do notexpect a well-trained neural network to violate this boundaryIndeed in Fig 3(a) even though the inductive invariant ofthe synthesized program defines a substantially smaller statespace than what is permissible in our simulation results

4

Figure 3 Invariant Inference on Inverted Pendulum

Figure 4 The Framework of Neural Network Shielding

we find that the neural policy is never interrupted by thedeterministic program when governing the pendulum Notethat the non-shaded areas in Fig 3(a) while appearing safepresumably define states from which the trajectory of thesystem can be led to an unsafe state and would thus not beinductive

Our synthesis approach is critical to ensuring safety whenthe neural policy is used to predict actions in an environmentdifferent from the one used during training Consider a neu-ral network suitably protected by a shield that now operatessafely The effectiveness of this shield would be greatly di-minished if the network had to be completely retrained fromscratch whenever it was deployed in a new environmentwhich imposes different safety constraints

In our running example suppose we wish to operate theinverted pendulum in an environment such as a Segwaytransporter in which the model is prohibited from swingingsignificantly and whose angular velocity must be suitablyrestricted We might specify the following new constraintson state parameters to enforce these conditions

Su (ηω) |not(minus30 lt η lt 30 and minus30 lt ω lt 30)Because of the dependence of a neural network to the qualityof training data used to build it environment changes thatdeviate from assumptions made at training-time could resultin a costly retraining exercise because the network mustlearn a new way to penalize unsafe actions that were previ-ously safe However training a neural network from scratchrequires substantial non-trivial effort involving fine-grainedtuning of training parameters or even network structures

In our framework the existing network provides a reason-able approximation to the desired behavior To incorporatethe additional constraints defined by the new environment

Cprime we attempt to synthesize a new deterministic programP prime for Cprime a task based on our experience is substantiallyeasier to achieve than training a new neural network policyfrom scratch This new program can be used to protect theoriginal network provided that we can use the aforemen-tioned verification approach to formally verify that Cprime[P prime] issafe by identifying a new inductive invariant φ prime As depictedin Fig 4 we simply build a new shield Sprime that is composedof the program P prime and the safety boundary captured by φ primeThe shield Sprime can ensure the safety of the neural network inthe environment context Cprime with a strengthened safety con-dition despite the fact that the neural network was trainedin a different environment context CFig 3(b) depicts the new unsafe states in Cprime (colored in

red) It extends the unsafe range draw in Fig 3(a) so theinductive invariant learned there is unsafe for the new oneOur approach synthesizes a new deterministic program forwhich we learn a more restricted inductive invariant de-picted in green in Fig 3(b) To characterize the effectivenessof our shielding approach we examined 1000 simulationtrajectories of this restricted version of the inverted pendu-lum system each of which is comprised of 5000 simulationsteps with the safety constraints defined by Cprime Without theshield Sprime the pendulum entered the unsafe region Su 41times All of these violations were prevented by Sprime Notablythe intervention rate of Sprime to interrupt the neural networkrsquosdecision was extremely low Out of a total of 5000times1000 de-cisions we only observed 414 instances (000828) wherethe shield interfered with (ie overrode) the decision madeby the network

3 Problem SetupWemodel the context C of a control policy as an environmentstate transition system C[middot] = (X ASS0Su Tt [middot] f r )Note that middot is intentionally left open to deploy neural controlpolicies Here X is a finite set of variables interpreted overthe reals R and S = RX is the set of all valuations of thevariables X We denote s isin S to be an n-dimensional envi-ronment state and a isin A to be a control action where Ais an infinite set ofm-dimensional continuous actions thata learning-enabled controller can perform We use S0 isin Sto specify a set of initial environment states and Su isin S tospecify a set of unsafe environment states that a safe con-troller should avoid The transition relation Tt [middot] defines howone state transitions to another given an action by an un-known policyWe assume thatTt [middot] is governed by a standarddifferential equation f defining the relationship between acontinuously varying state s(t) and action a(t) and its rateof change Ucircs(t) over time t

Ucircs(t) = f (s(t)a(t))We assume f is defined by equations of the formRntimesRm rarrRn such as the example in Fig 1 In the following we oftenomit the time variable t for simplicity A deterministic neural

5

network control policy πw Rn rarr Rm parameterized by aset of weight valuesw is a function mapping an environmentstate s to a control action a where

a = πw(s)By providing a feedback action a policy can alter the rate ofchange of state variables to realize optimal system controlThe transition relation Tt [middot] is parameterized by a con-

trol policy π that is deployed in C We explicitly modelthis deployment as Tt [π ] Given a control policy πw weuse Tt [πw] S times S to specify all possible state transitionsallowed by the policy We assume that a system transitionsin discrete time instants kt where k = 0 1 2 middot middot middot and t is afixed time step (t gt 0) A state s transitions to a next state s primeafter time t with the assumption that a control action a(τ ) attime τ is a constant between the time period τ isin [0 t) Us-ing Eulerrsquos method2 we discretize the continuous dynamicsf with finite difference approximation so it can be used inthe discretized transition relation Tt [πw] Then Tt [πw] cancompute these estimates by the following difference equation

Tt [πw] = (s s prime) | s prime = s + f (sπw(s)) times tEnvironment Disturbance Our model allows boundedexternal properties (eg additional environment-imposedconstraints) by extending the definition of change of rate Ucircs =f (sa)+d whered is a vector of random disturbances We used to encode environment disturbances in terms of boundednondeterministic values We assume that tight upper andlower bounds of d can be accurately estimated at runtimeusing multivariate normal distribution fitting methodsTrajectoryA trajectoryh of a state transition system C[πw]which we denote as h isin C[πw] is a sequence of statess0 middot middot middot si si+1 middot middot middot where s0 isin S0 and (si si+1) isin Tt [πw] forall i ge 0 We use C sube C[πw] to denote a set of trajectoriesThe reward that a control policy receives on performing anaction a in a state s is given by the reward function r (sa)Reinforcement Learning The goal of reinforcement learn-ing is to maximize the reward that can be collected by a neu-ral control policy in a given environment context C Suchproblems can be abstractly formulated as

maxwisinRn

J (w)

J (w) = E[r (πw)](3)

Assume that s0 s1 sT is a trajectory of length T of thestate transition system and r (πw) = lim

TrarrinfinΣTk=0r (sk πw(sk ))

is the cumulative reward achieved by the policy πw fromthis trajectory Thus this formulation uses simulations ofthe transition system with finite length rollouts to estimatethe expected cumulative reward collected over T time stepsand aim to maximize this reward Reinforcement learning2Eulerrsquos method may sometimes poorly approximate the true system tran-sition function when f is highly nonlinear More precise higher-order ap-proaches such as Runge-Kutta methods exist to compensate for loss ofprecision in this case

assumes polices are expressible in some executable structure(eg a neural network) and allows samples generated fromone policy to influence the estimates made for others Thetwo main approaches for reinforcement learning are valuefunction estimation and direct policy searchWe refer readersto [27] for an overviewSafety Verification Since reinforcement learning only con-siders finite length rollouts we wish to determine if a controlpolicy is safe to use under an infinite time horizon Givena state transition system the safety verification problem isconcerned with verifying that no trajectories contained in Sstarting from an initial state inS0 reach an unsafe state inSu However a neural network is a representative of a class ofdeep and sophisticated models that challenges the capabilityof the state-of-the-art verification techniques This level ofcomplexity is exacerbated in our work because we considerthe long term safety of a neural policy deployed within anontrivial environment context C that in turn is describedby a complex infinite-state transition system

4 Verification ProcedureTo verify a neural network control policy πw with respect toan environment context C we first synthesize a deterministicpolicy program P from the neural policy We require that Pboth (a) broadly resembles its neural oracle and (b) addition-ally satisfies a desired safety property when it is deployedin C We conjecture that a safety proof of C[P] is easier toconstruct than that of C[πw] More importantly we leveragethe safety proof of C[P] to ensure the safety of C[πw]

41 SynthesisFig 5 defines a search space for a deterministic policy pro-gram P where E and φ represent the basic syntax of (polyno-mial) program expressions and inductive invariants respec-tively Here v ranges over a universe of numerical constantsx represents variables and oplus is a basic operator including +and times A deterministic program P also features conditionalstatements using φ as branching predicatesWe allow the user to define a sketch [41 42] to describe

the shape of target policy programs using the grammar inFig 5 We use P[θ ] to represent a sketch where θ representsunknowns that need to be filled-in by the synthesis proce-dure We use Pθ to represent a synthesized program withknown values of θ Similarly the user can define a sketch ofan inductive invariant φ[middot] that is used (in Sec 42) to learna safety proof to verify a synthesized program Pθ

We do not require the user to explicitly define conditionalstatements in a program sketch Our synthesis algorithmusesverification counterexamples to lazily add branch predicatesφ under which a program performs different computationsdepending on whether φ evaluates to true or false The enduser simply writes a sketch over basic expressions For ex-ample a sketch that defines a family of linear function over

6

E = v | x | oplus (E1 Ek )φ = E le 0P = return E | if φ then return E else P

Figure 5 Syntax of the Policy Programming Language

Algorithm 1 Synthesize (πw P[θ ] C[middot])1 θ larr 02 do3 Sample δ from a zero mean Gaussian vector4 Sample a set of trajectories C1 using C[Pθ+νδ ]5 Sample a set of trajectories C2 using C[Pθminusνδ ]6 θ larr θ + α[d (πwPθ+νδ C1)minusd (πwPθminusνδ C2)

ν ]δ 7 while θ is not converged8 return Pθ

a collection of variables can be expressed as

P[θ ](X ) = return θ1x1 + θ2x2 + middot middot middot + θnxn + θn+1 (4)

Here X = (x1x2 middot middot middot ) are system variables in which thecoefficient constants θ = (θ1θ2 middot middot middot ) are unknownThe goal of our synthesis algorithm is to find unknown

values of θ that maximize the likelihood that Pθ resemblesthe neural oracle πw while still being a safe program withrespect to the environment context C

θ lowast = argmaxθ isinRn+1

d(πwPθ C) (5)

where d measures the distance between the outputs of theestimate program Pθ and the neural policy πw subject tosafety constraints To avoid the computational complexitythat arises if we consider a solution to this equation analyti-cally as an optimization problem we instead approximated(πwPθ C) by randomly sampling a set of trajectories Cthat are encountered by Pθ in the environment state transi-tion system C[Pθ ]

d(πwPθ C) asymp d(πwPθ C) s t C sube C[Pθ ]We estimate θ lowast using these sampled trajectoriesC and define

d(πwPθ C) =sumhisinC

d(πwPθ h)

Since each trajectory h isin C is a finite rollout s0 sT oflength T we have

d(πwPθ h) =Tsumt=0

minus ∥(Pθ (st ) minus πw(st ))∥ st lt SuminusMAX st isin Su

where ∥middot∥ is a suitable norm As illustrated in Sec 22 weaim to minimize the distance between a synthesized programPθ from a sketch space and its neural oracle along sampledtrajectories encountered by Pθ in the environment contextbut put a large penalty on states that are unsafe

Random SearchWe implement the idea encoded in equa-tion (5) in Algorithm 1 that depicts the pseudocode of ourpolicy interpretation approach It take as inputs a neuralpolicy a policy program sketch parameterized by θ and anenvironment state transition system and outputs a synthe-sized policy program Pθ

An efficient way to solve equation (5) is to directly perturbθ in the search space by adding random noise and thenupdate θ based on the effect on this perturbation [30] Wechoose a direction uniformly at random on the sphere inparameter space and then optimize the goal function alongthis direction To this end in line 3 of Algorithm 1 we sampleGaussian noise δ to be added to policy parameters θ in bothdirections where ν is a small positive real number In line 4and line 5 we sample trajectories C1 and C2 from the statetransition systems obtained by running the perturbed policyprogram Pθ+νδ and Pθminusνδ in the environment context C

We evaluate the proximity of these two policy programs tothe neural oracle and in line 6 of Algorithm 1 to improve thepolicy program we also optimize equation (5) by updating θwith a finite difference approximation along the direction

θ larr θ + α[d(Pθ+νδ πwC1) minus d(Pθminusνδ πwC2)ν

]δ (6)

where α is a predefined learning rate Such an update incre-ment corresponds to an unbiased estimator of the gradient ofθ [29 33] for equation (5) The algorithm iteratively updatesθ until convergence

42 VerificationA synthesized policy program P is verified with respect to anenvironment context given as an infinite state transition sys-tem defined in Sec 3 C[P] = (X ASS0Su Tt [P] f r )Our verification algorithm learns an inductive invariant φover the transition relation Tt [P] formally proving that allpossible system behaviors are encapsulated in φ and φ isrequired to be disjoint with all unsafe states Su We follow template-based constraint solving approaches

for inductive invariant inference [19 20 24 34] to discoverthis invariant The basic idea is to identify a function E Rn rarr R that serves as a barrier [20 24 34] between reach-able system states (evaluated to be nonpositive by E) andunsafe states (evaluated positive by E) Using the invariantsyntax in Fig 5 the user can define an invariant sketch

φ[c](X ) = E[c](X ) le 0 (7)

over variables X and c unknown coefficients intended to besynthesized Fig 5 carefully restricts that an invariant sketchE[c] can only be postulated as a polynomial function as thereexist efficient SMT solvers [13] and constraint solvers [26]for nonlinear polynomial reals Formally assume real coef-ficients c = (c0 middot middot middot cp ) are used to parameterize E[c] in anaffine manner

E[c](X ) = Σpi=0cibi (X )

7

where the bi (X )rsquos are some monomials in variables X As aheuristic the user can simply determine an upper bound onthe degree of E[c] and then include all monomials whosedegrees are no greater than the bound in the sketch Largevalues of the bound enable verification of more complexsafety conditions but impose greater demands on the con-straint solver small values capture coarser safety propertiesbut are easier to solve

Example 41 Consider the inverted pendulum system inSec 22 To discover an inductive invariant for the transitionsystem the user might choose to define an upper bound of 4which results in the following polynomial invariant sketchφ[c](ηω) = E[c](ηω) le 0 where

E[c](ηω) = c0η4+c1η3ω+c2η2ω2+c3ηω3+c4ω

4+c5η3+middot middot middot+cp

The sketch includes all monomials over η and ω whose de-grees are no greater than 4 The coefficients c = [c0 middot middot middot cp ]are unknown and need to be synthesized

To synthesize these unknowns we require that E[c] mustsatisfy the following verification conditions

forall(s) isin Su E[c](s) gt 0 (8)forall(s) isin S0 E[c](s) le 0 (9)forall(s s prime) isin Tt [P] E[c](s prime) minus E[c](s) le 0 (10)

We claim that φ = E[c](x) le 0 defines an inductive in-variant because verification condition (9) ensures that anyinitial state s0 isin S0 satisfies φ since E[c](s0) le 0 verificationcondition (10) asserts that along the transition from a states isin φ (so E[c](s) le 0) to a resulting state s prime E[c] cannotbecome positive so s prime satisfies φ as well Finally accordingto verification condition (8) φ does not include any unsafestate su isin Su as E[c](su ) is positive

Verification conditions (8) (9) (10) are polynomials over re-als Synthesizing unknown coefficients can be left to an SMTsolver [13] after universal quantifiers are eliminated using avariant of Farkas Lemma as in [20] However observe thatE[c] is convex3 We can gain efficiency by finding unknowncoefficients c using off-the-shelf convex constraint solversfollowing [34] Encoding verification conditions (8) (9) (10)as polynomial inequalities we search c that can prove non-negativity of these constraints via an efficient and convexsum of squares programming solver [26] Additional techni-cal details are provided in the supplementary material [49]Counterexample-guided Inductive Synthesis (CEGIS)Given the environment state transition system C[P] de-ployed with a synthesized program P the verification ap-proach above can compute an inductive invariant over thestate transition system or tell if there is no feasible solutionin the given set of candidates Note however that the latterdoes not necessarily imply that the system is unsafe

3For arbitrary E1(x ) and E2(x ) satisfying the verification conditions andα isin [0 1] E(x ) = αE1(x ) + (1 minus α )E2(x ) satisfies the conditions as well

Algorithm 2 CEGIS (πw P[θ ] C[middot])1 policieslarr empty2 coverslarr empty3 while CS0 ⊈ covers do4 search s0 such that s0 isin CS0 and s0 lt covers5 rlowastlarr Diameter(CS0)6 do7 ϕbound larr s | s isin s0 minus rlowast s0 + rlowast8 C larr C where CS0 = (CS0 cap ϕbound )9 θ larr Synthesize(πw P[θ ] C[middot])

10 φlarr Verify(C[Pθ ])11 if φ is False then12 rlowastlarr rlowast213 else14 coverslarr covers cup s | φ(s)15 policieslarr policies cup (Pθ φ)16 break17 while True18 end19 return policies

Since our goal is to learn a safe deterministic programfrom a neural network we develop a counterexample guidedinductive program synthesis approach A CEGIS algorithmin our context is challenging because safety verification isnecessarily incomplete and may not be able to produce acounterexample that serves as an explanation for why averification attempt is unsuccessful

We solve the incompleteness challenge by leveraging thefact that we can simultaneously synthesize and verify aprogram Our CEGIS approach is given in Algorithm 2 Acounterexample is an initial state on which our synthesizedprogram is not yet proved safe Driven by these counterex-amples our algorithm synthesizes a set of programs from asketch along with the conditions under which we can switchfrom from one synthesized program to another

Algorithm 2 takes as input a neural policy πw a programsketch P[θ ] and an environment context C It maintains syn-thesized policy programs in policies in line 1 each of whichis inductively verified safe in a partition of the universe statespace that is maintained in covers in line 2 For soundnessthe state space covered by such partitions must be able toinclude all initial states checked in line 3 of Algorithm 2 byan SMT solver In the algorithm we use CS0 to access afield of C such as its initial state space

The algorithm iteratively samples a counterexample initialstate s0 that is currently not covered by covers in line 4 Sincecovers is empty at the beginning this choice is uniformlyrandom initially we synthesize a presumably safe policyprogram in line 9 of Algorithm 2 that resembles the neuralpolicy πw considering all possible initial states S0 of thegiven environment model C using Algorithm 1

8

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 5: An Inductive Synthesis Framework for Verifiable

Figure 3 Invariant Inference on Inverted Pendulum

Figure 4 The Framework of Neural Network Shielding

we find that the neural policy is never interrupted by thedeterministic program when governing the pendulum Notethat the non-shaded areas in Fig 3(a) while appearing safepresumably define states from which the trajectory of thesystem can be led to an unsafe state and would thus not beinductive

Our synthesis approach is critical to ensuring safety whenthe neural policy is used to predict actions in an environmentdifferent from the one used during training Consider a neu-ral network suitably protected by a shield that now operatessafely The effectiveness of this shield would be greatly di-minished if the network had to be completely retrained fromscratch whenever it was deployed in a new environmentwhich imposes different safety constraints

In our running example suppose we wish to operate theinverted pendulum in an environment such as a Segwaytransporter in which the model is prohibited from swingingsignificantly and whose angular velocity must be suitablyrestricted We might specify the following new constraintson state parameters to enforce these conditions

Su (ηω) |not(minus30 lt η lt 30 and minus30 lt ω lt 30)Because of the dependence of a neural network to the qualityof training data used to build it environment changes thatdeviate from assumptions made at training-time could resultin a costly retraining exercise because the network mustlearn a new way to penalize unsafe actions that were previ-ously safe However training a neural network from scratchrequires substantial non-trivial effort involving fine-grainedtuning of training parameters or even network structures

In our framework the existing network provides a reason-able approximation to the desired behavior To incorporatethe additional constraints defined by the new environment

Cprime we attempt to synthesize a new deterministic programP prime for Cprime a task based on our experience is substantiallyeasier to achieve than training a new neural network policyfrom scratch This new program can be used to protect theoriginal network provided that we can use the aforemen-tioned verification approach to formally verify that Cprime[P prime] issafe by identifying a new inductive invariant φ prime As depictedin Fig 4 we simply build a new shield Sprime that is composedof the program P prime and the safety boundary captured by φ primeThe shield Sprime can ensure the safety of the neural network inthe environment context Cprime with a strengthened safety con-dition despite the fact that the neural network was trainedin a different environment context CFig 3(b) depicts the new unsafe states in Cprime (colored in

red) It extends the unsafe range draw in Fig 3(a) so theinductive invariant learned there is unsafe for the new oneOur approach synthesizes a new deterministic program forwhich we learn a more restricted inductive invariant de-picted in green in Fig 3(b) To characterize the effectivenessof our shielding approach we examined 1000 simulationtrajectories of this restricted version of the inverted pendu-lum system each of which is comprised of 5000 simulationsteps with the safety constraints defined by Cprime Without theshield Sprime the pendulum entered the unsafe region Su 41times All of these violations were prevented by Sprime Notablythe intervention rate of Sprime to interrupt the neural networkrsquosdecision was extremely low Out of a total of 5000times1000 de-cisions we only observed 414 instances (000828) wherethe shield interfered with (ie overrode) the decision madeby the network

3 Problem SetupWemodel the context C of a control policy as an environmentstate transition system C[middot] = (X ASS0Su Tt [middot] f r )Note that middot is intentionally left open to deploy neural controlpolicies Here X is a finite set of variables interpreted overthe reals R and S = RX is the set of all valuations of thevariables X We denote s isin S to be an n-dimensional envi-ronment state and a isin A to be a control action where Ais an infinite set ofm-dimensional continuous actions thata learning-enabled controller can perform We use S0 isin Sto specify a set of initial environment states and Su isin S tospecify a set of unsafe environment states that a safe con-troller should avoid The transition relation Tt [middot] defines howone state transitions to another given an action by an un-known policyWe assume thatTt [middot] is governed by a standarddifferential equation f defining the relationship between acontinuously varying state s(t) and action a(t) and its rateof change Ucircs(t) over time t

Ucircs(t) = f (s(t)a(t))We assume f is defined by equations of the formRntimesRm rarrRn such as the example in Fig 1 In the following we oftenomit the time variable t for simplicity A deterministic neural

5

network control policy πw Rn rarr Rm parameterized by aset of weight valuesw is a function mapping an environmentstate s to a control action a where

a = πw(s)By providing a feedback action a policy can alter the rate ofchange of state variables to realize optimal system controlThe transition relation Tt [middot] is parameterized by a con-

trol policy π that is deployed in C We explicitly modelthis deployment as Tt [π ] Given a control policy πw weuse Tt [πw] S times S to specify all possible state transitionsallowed by the policy We assume that a system transitionsin discrete time instants kt where k = 0 1 2 middot middot middot and t is afixed time step (t gt 0) A state s transitions to a next state s primeafter time t with the assumption that a control action a(τ ) attime τ is a constant between the time period τ isin [0 t) Us-ing Eulerrsquos method2 we discretize the continuous dynamicsf with finite difference approximation so it can be used inthe discretized transition relation Tt [πw] Then Tt [πw] cancompute these estimates by the following difference equation

Tt [πw] = (s s prime) | s prime = s + f (sπw(s)) times tEnvironment Disturbance Our model allows boundedexternal properties (eg additional environment-imposedconstraints) by extending the definition of change of rate Ucircs =f (sa)+d whered is a vector of random disturbances We used to encode environment disturbances in terms of boundednondeterministic values We assume that tight upper andlower bounds of d can be accurately estimated at runtimeusing multivariate normal distribution fitting methodsTrajectoryA trajectoryh of a state transition system C[πw]which we denote as h isin C[πw] is a sequence of statess0 middot middot middot si si+1 middot middot middot where s0 isin S0 and (si si+1) isin Tt [πw] forall i ge 0 We use C sube C[πw] to denote a set of trajectoriesThe reward that a control policy receives on performing anaction a in a state s is given by the reward function r (sa)Reinforcement Learning The goal of reinforcement learn-ing is to maximize the reward that can be collected by a neu-ral control policy in a given environment context C Suchproblems can be abstractly formulated as

maxwisinRn

J (w)

J (w) = E[r (πw)](3)

Assume that s0 s1 sT is a trajectory of length T of thestate transition system and r (πw) = lim

TrarrinfinΣTk=0r (sk πw(sk ))

is the cumulative reward achieved by the policy πw fromthis trajectory Thus this formulation uses simulations ofthe transition system with finite length rollouts to estimatethe expected cumulative reward collected over T time stepsand aim to maximize this reward Reinforcement learning2Eulerrsquos method may sometimes poorly approximate the true system tran-sition function when f is highly nonlinear More precise higher-order ap-proaches such as Runge-Kutta methods exist to compensate for loss ofprecision in this case

assumes polices are expressible in some executable structure(eg a neural network) and allows samples generated fromone policy to influence the estimates made for others Thetwo main approaches for reinforcement learning are valuefunction estimation and direct policy searchWe refer readersto [27] for an overviewSafety Verification Since reinforcement learning only con-siders finite length rollouts we wish to determine if a controlpolicy is safe to use under an infinite time horizon Givena state transition system the safety verification problem isconcerned with verifying that no trajectories contained in Sstarting from an initial state inS0 reach an unsafe state inSu However a neural network is a representative of a class ofdeep and sophisticated models that challenges the capabilityof the state-of-the-art verification techniques This level ofcomplexity is exacerbated in our work because we considerthe long term safety of a neural policy deployed within anontrivial environment context C that in turn is describedby a complex infinite-state transition system

4 Verification ProcedureTo verify a neural network control policy πw with respect toan environment context C we first synthesize a deterministicpolicy program P from the neural policy We require that Pboth (a) broadly resembles its neural oracle and (b) addition-ally satisfies a desired safety property when it is deployedin C We conjecture that a safety proof of C[P] is easier toconstruct than that of C[πw] More importantly we leveragethe safety proof of C[P] to ensure the safety of C[πw]

41 SynthesisFig 5 defines a search space for a deterministic policy pro-gram P where E and φ represent the basic syntax of (polyno-mial) program expressions and inductive invariants respec-tively Here v ranges over a universe of numerical constantsx represents variables and oplus is a basic operator including +and times A deterministic program P also features conditionalstatements using φ as branching predicatesWe allow the user to define a sketch [41 42] to describe

the shape of target policy programs using the grammar inFig 5 We use P[θ ] to represent a sketch where θ representsunknowns that need to be filled-in by the synthesis proce-dure We use Pθ to represent a synthesized program withknown values of θ Similarly the user can define a sketch ofan inductive invariant φ[middot] that is used (in Sec 42) to learna safety proof to verify a synthesized program Pθ

We do not require the user to explicitly define conditionalstatements in a program sketch Our synthesis algorithmusesverification counterexamples to lazily add branch predicatesφ under which a program performs different computationsdepending on whether φ evaluates to true or false The enduser simply writes a sketch over basic expressions For ex-ample a sketch that defines a family of linear function over

6

E = v | x | oplus (E1 Ek )φ = E le 0P = return E | if φ then return E else P

Figure 5 Syntax of the Policy Programming Language

Algorithm 1 Synthesize (πw P[θ ] C[middot])1 θ larr 02 do3 Sample δ from a zero mean Gaussian vector4 Sample a set of trajectories C1 using C[Pθ+νδ ]5 Sample a set of trajectories C2 using C[Pθminusνδ ]6 θ larr θ + α[d (πwPθ+νδ C1)minusd (πwPθminusνδ C2)

ν ]δ 7 while θ is not converged8 return Pθ

a collection of variables can be expressed as

P[θ ](X ) = return θ1x1 + θ2x2 + middot middot middot + θnxn + θn+1 (4)

Here X = (x1x2 middot middot middot ) are system variables in which thecoefficient constants θ = (θ1θ2 middot middot middot ) are unknownThe goal of our synthesis algorithm is to find unknown

values of θ that maximize the likelihood that Pθ resemblesthe neural oracle πw while still being a safe program withrespect to the environment context C

θ lowast = argmaxθ isinRn+1

d(πwPθ C) (5)

where d measures the distance between the outputs of theestimate program Pθ and the neural policy πw subject tosafety constraints To avoid the computational complexitythat arises if we consider a solution to this equation analyti-cally as an optimization problem we instead approximated(πwPθ C) by randomly sampling a set of trajectories Cthat are encountered by Pθ in the environment state transi-tion system C[Pθ ]

d(πwPθ C) asymp d(πwPθ C) s t C sube C[Pθ ]We estimate θ lowast using these sampled trajectoriesC and define

d(πwPθ C) =sumhisinC

d(πwPθ h)

Since each trajectory h isin C is a finite rollout s0 sT oflength T we have

d(πwPθ h) =Tsumt=0

minus ∥(Pθ (st ) minus πw(st ))∥ st lt SuminusMAX st isin Su

where ∥middot∥ is a suitable norm As illustrated in Sec 22 weaim to minimize the distance between a synthesized programPθ from a sketch space and its neural oracle along sampledtrajectories encountered by Pθ in the environment contextbut put a large penalty on states that are unsafe

Random SearchWe implement the idea encoded in equa-tion (5) in Algorithm 1 that depicts the pseudocode of ourpolicy interpretation approach It take as inputs a neuralpolicy a policy program sketch parameterized by θ and anenvironment state transition system and outputs a synthe-sized policy program Pθ

An efficient way to solve equation (5) is to directly perturbθ in the search space by adding random noise and thenupdate θ based on the effect on this perturbation [30] Wechoose a direction uniformly at random on the sphere inparameter space and then optimize the goal function alongthis direction To this end in line 3 of Algorithm 1 we sampleGaussian noise δ to be added to policy parameters θ in bothdirections where ν is a small positive real number In line 4and line 5 we sample trajectories C1 and C2 from the statetransition systems obtained by running the perturbed policyprogram Pθ+νδ and Pθminusνδ in the environment context C

We evaluate the proximity of these two policy programs tothe neural oracle and in line 6 of Algorithm 1 to improve thepolicy program we also optimize equation (5) by updating θwith a finite difference approximation along the direction

θ larr θ + α[d(Pθ+νδ πwC1) minus d(Pθminusνδ πwC2)ν

]δ (6)

where α is a predefined learning rate Such an update incre-ment corresponds to an unbiased estimator of the gradient ofθ [29 33] for equation (5) The algorithm iteratively updatesθ until convergence

42 VerificationA synthesized policy program P is verified with respect to anenvironment context given as an infinite state transition sys-tem defined in Sec 3 C[P] = (X ASS0Su Tt [P] f r )Our verification algorithm learns an inductive invariant φover the transition relation Tt [P] formally proving that allpossible system behaviors are encapsulated in φ and φ isrequired to be disjoint with all unsafe states Su We follow template-based constraint solving approaches

for inductive invariant inference [19 20 24 34] to discoverthis invariant The basic idea is to identify a function E Rn rarr R that serves as a barrier [20 24 34] between reach-able system states (evaluated to be nonpositive by E) andunsafe states (evaluated positive by E) Using the invariantsyntax in Fig 5 the user can define an invariant sketch

φ[c](X ) = E[c](X ) le 0 (7)

over variables X and c unknown coefficients intended to besynthesized Fig 5 carefully restricts that an invariant sketchE[c] can only be postulated as a polynomial function as thereexist efficient SMT solvers [13] and constraint solvers [26]for nonlinear polynomial reals Formally assume real coef-ficients c = (c0 middot middot middot cp ) are used to parameterize E[c] in anaffine manner

E[c](X ) = Σpi=0cibi (X )

7

where the bi (X )rsquos are some monomials in variables X As aheuristic the user can simply determine an upper bound onthe degree of E[c] and then include all monomials whosedegrees are no greater than the bound in the sketch Largevalues of the bound enable verification of more complexsafety conditions but impose greater demands on the con-straint solver small values capture coarser safety propertiesbut are easier to solve

Example 41 Consider the inverted pendulum system inSec 22 To discover an inductive invariant for the transitionsystem the user might choose to define an upper bound of 4which results in the following polynomial invariant sketchφ[c](ηω) = E[c](ηω) le 0 where

E[c](ηω) = c0η4+c1η3ω+c2η2ω2+c3ηω3+c4ω

4+c5η3+middot middot middot+cp

The sketch includes all monomials over η and ω whose de-grees are no greater than 4 The coefficients c = [c0 middot middot middot cp ]are unknown and need to be synthesized

To synthesize these unknowns we require that E[c] mustsatisfy the following verification conditions

forall(s) isin Su E[c](s) gt 0 (8)forall(s) isin S0 E[c](s) le 0 (9)forall(s s prime) isin Tt [P] E[c](s prime) minus E[c](s) le 0 (10)

We claim that φ = E[c](x) le 0 defines an inductive in-variant because verification condition (9) ensures that anyinitial state s0 isin S0 satisfies φ since E[c](s0) le 0 verificationcondition (10) asserts that along the transition from a states isin φ (so E[c](s) le 0) to a resulting state s prime E[c] cannotbecome positive so s prime satisfies φ as well Finally accordingto verification condition (8) φ does not include any unsafestate su isin Su as E[c](su ) is positive

Verification conditions (8) (9) (10) are polynomials over re-als Synthesizing unknown coefficients can be left to an SMTsolver [13] after universal quantifiers are eliminated using avariant of Farkas Lemma as in [20] However observe thatE[c] is convex3 We can gain efficiency by finding unknowncoefficients c using off-the-shelf convex constraint solversfollowing [34] Encoding verification conditions (8) (9) (10)as polynomial inequalities we search c that can prove non-negativity of these constraints via an efficient and convexsum of squares programming solver [26] Additional techni-cal details are provided in the supplementary material [49]Counterexample-guided Inductive Synthesis (CEGIS)Given the environment state transition system C[P] de-ployed with a synthesized program P the verification ap-proach above can compute an inductive invariant over thestate transition system or tell if there is no feasible solutionin the given set of candidates Note however that the latterdoes not necessarily imply that the system is unsafe

3For arbitrary E1(x ) and E2(x ) satisfying the verification conditions andα isin [0 1] E(x ) = αE1(x ) + (1 minus α )E2(x ) satisfies the conditions as well

Algorithm 2 CEGIS (πw P[θ ] C[middot])1 policieslarr empty2 coverslarr empty3 while CS0 ⊈ covers do4 search s0 such that s0 isin CS0 and s0 lt covers5 rlowastlarr Diameter(CS0)6 do7 ϕbound larr s | s isin s0 minus rlowast s0 + rlowast8 C larr C where CS0 = (CS0 cap ϕbound )9 θ larr Synthesize(πw P[θ ] C[middot])

10 φlarr Verify(C[Pθ ])11 if φ is False then12 rlowastlarr rlowast213 else14 coverslarr covers cup s | φ(s)15 policieslarr policies cup (Pθ φ)16 break17 while True18 end19 return policies

Since our goal is to learn a safe deterministic programfrom a neural network we develop a counterexample guidedinductive program synthesis approach A CEGIS algorithmin our context is challenging because safety verification isnecessarily incomplete and may not be able to produce acounterexample that serves as an explanation for why averification attempt is unsuccessful

We solve the incompleteness challenge by leveraging thefact that we can simultaneously synthesize and verify aprogram Our CEGIS approach is given in Algorithm 2 Acounterexample is an initial state on which our synthesizedprogram is not yet proved safe Driven by these counterex-amples our algorithm synthesizes a set of programs from asketch along with the conditions under which we can switchfrom from one synthesized program to another

Algorithm 2 takes as input a neural policy πw a programsketch P[θ ] and an environment context C It maintains syn-thesized policy programs in policies in line 1 each of whichis inductively verified safe in a partition of the universe statespace that is maintained in covers in line 2 For soundnessthe state space covered by such partitions must be able toinclude all initial states checked in line 3 of Algorithm 2 byan SMT solver In the algorithm we use CS0 to access afield of C such as its initial state space

The algorithm iteratively samples a counterexample initialstate s0 that is currently not covered by covers in line 4 Sincecovers is empty at the beginning this choice is uniformlyrandom initially we synthesize a presumably safe policyprogram in line 9 of Algorithm 2 that resembles the neuralpolicy πw considering all possible initial states S0 of thegiven environment model C using Algorithm 1

8

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 6: An Inductive Synthesis Framework for Verifiable

network control policy πw Rn rarr Rm parameterized by aset of weight valuesw is a function mapping an environmentstate s to a control action a where

a = πw(s)By providing a feedback action a policy can alter the rate ofchange of state variables to realize optimal system controlThe transition relation Tt [middot] is parameterized by a con-

trol policy π that is deployed in C We explicitly modelthis deployment as Tt [π ] Given a control policy πw weuse Tt [πw] S times S to specify all possible state transitionsallowed by the policy We assume that a system transitionsin discrete time instants kt where k = 0 1 2 middot middot middot and t is afixed time step (t gt 0) A state s transitions to a next state s primeafter time t with the assumption that a control action a(τ ) attime τ is a constant between the time period τ isin [0 t) Us-ing Eulerrsquos method2 we discretize the continuous dynamicsf with finite difference approximation so it can be used inthe discretized transition relation Tt [πw] Then Tt [πw] cancompute these estimates by the following difference equation

Tt [πw] = (s s prime) | s prime = s + f (sπw(s)) times tEnvironment Disturbance Our model allows boundedexternal properties (eg additional environment-imposedconstraints) by extending the definition of change of rate Ucircs =f (sa)+d whered is a vector of random disturbances We used to encode environment disturbances in terms of boundednondeterministic values We assume that tight upper andlower bounds of d can be accurately estimated at runtimeusing multivariate normal distribution fitting methodsTrajectoryA trajectoryh of a state transition system C[πw]which we denote as h isin C[πw] is a sequence of statess0 middot middot middot si si+1 middot middot middot where s0 isin S0 and (si si+1) isin Tt [πw] forall i ge 0 We use C sube C[πw] to denote a set of trajectoriesThe reward that a control policy receives on performing anaction a in a state s is given by the reward function r (sa)Reinforcement Learning The goal of reinforcement learn-ing is to maximize the reward that can be collected by a neu-ral control policy in a given environment context C Suchproblems can be abstractly formulated as

maxwisinRn

J (w)

J (w) = E[r (πw)](3)

Assume that s0 s1 sT is a trajectory of length T of thestate transition system and r (πw) = lim

TrarrinfinΣTk=0r (sk πw(sk ))

is the cumulative reward achieved by the policy πw fromthis trajectory Thus this formulation uses simulations ofthe transition system with finite length rollouts to estimatethe expected cumulative reward collected over T time stepsand aim to maximize this reward Reinforcement learning2Eulerrsquos method may sometimes poorly approximate the true system tran-sition function when f is highly nonlinear More precise higher-order ap-proaches such as Runge-Kutta methods exist to compensate for loss ofprecision in this case

assumes polices are expressible in some executable structure(eg a neural network) and allows samples generated fromone policy to influence the estimates made for others Thetwo main approaches for reinforcement learning are valuefunction estimation and direct policy searchWe refer readersto [27] for an overviewSafety Verification Since reinforcement learning only con-siders finite length rollouts we wish to determine if a controlpolicy is safe to use under an infinite time horizon Givena state transition system the safety verification problem isconcerned with verifying that no trajectories contained in Sstarting from an initial state inS0 reach an unsafe state inSu However a neural network is a representative of a class ofdeep and sophisticated models that challenges the capabilityof the state-of-the-art verification techniques This level ofcomplexity is exacerbated in our work because we considerthe long term safety of a neural policy deployed within anontrivial environment context C that in turn is describedby a complex infinite-state transition system

4 Verification ProcedureTo verify a neural network control policy πw with respect toan environment context C we first synthesize a deterministicpolicy program P from the neural policy We require that Pboth (a) broadly resembles its neural oracle and (b) addition-ally satisfies a desired safety property when it is deployedin C We conjecture that a safety proof of C[P] is easier toconstruct than that of C[πw] More importantly we leveragethe safety proof of C[P] to ensure the safety of C[πw]

41 SynthesisFig 5 defines a search space for a deterministic policy pro-gram P where E and φ represent the basic syntax of (polyno-mial) program expressions and inductive invariants respec-tively Here v ranges over a universe of numerical constantsx represents variables and oplus is a basic operator including +and times A deterministic program P also features conditionalstatements using φ as branching predicatesWe allow the user to define a sketch [41 42] to describe

the shape of target policy programs using the grammar inFig 5 We use P[θ ] to represent a sketch where θ representsunknowns that need to be filled-in by the synthesis proce-dure We use Pθ to represent a synthesized program withknown values of θ Similarly the user can define a sketch ofan inductive invariant φ[middot] that is used (in Sec 42) to learna safety proof to verify a synthesized program Pθ

We do not require the user to explicitly define conditionalstatements in a program sketch Our synthesis algorithmusesverification counterexamples to lazily add branch predicatesφ under which a program performs different computationsdepending on whether φ evaluates to true or false The enduser simply writes a sketch over basic expressions For ex-ample a sketch that defines a family of linear function over

6

E = v | x | oplus (E1 Ek )φ = E le 0P = return E | if φ then return E else P

Figure 5 Syntax of the Policy Programming Language

Algorithm 1 Synthesize (πw P[θ ] C[middot])1 θ larr 02 do3 Sample δ from a zero mean Gaussian vector4 Sample a set of trajectories C1 using C[Pθ+νδ ]5 Sample a set of trajectories C2 using C[Pθminusνδ ]6 θ larr θ + α[d (πwPθ+νδ C1)minusd (πwPθminusνδ C2)

ν ]δ 7 while θ is not converged8 return Pθ

a collection of variables can be expressed as

P[θ ](X ) = return θ1x1 + θ2x2 + middot middot middot + θnxn + θn+1 (4)

Here X = (x1x2 middot middot middot ) are system variables in which thecoefficient constants θ = (θ1θ2 middot middot middot ) are unknownThe goal of our synthesis algorithm is to find unknown

values of θ that maximize the likelihood that Pθ resemblesthe neural oracle πw while still being a safe program withrespect to the environment context C

θ lowast = argmaxθ isinRn+1

d(πwPθ C) (5)

where d measures the distance between the outputs of theestimate program Pθ and the neural policy πw subject tosafety constraints To avoid the computational complexitythat arises if we consider a solution to this equation analyti-cally as an optimization problem we instead approximated(πwPθ C) by randomly sampling a set of trajectories Cthat are encountered by Pθ in the environment state transi-tion system C[Pθ ]

d(πwPθ C) asymp d(πwPθ C) s t C sube C[Pθ ]We estimate θ lowast using these sampled trajectoriesC and define

d(πwPθ C) =sumhisinC

d(πwPθ h)

Since each trajectory h isin C is a finite rollout s0 sT oflength T we have

d(πwPθ h) =Tsumt=0

minus ∥(Pθ (st ) minus πw(st ))∥ st lt SuminusMAX st isin Su

where ∥middot∥ is a suitable norm As illustrated in Sec 22 weaim to minimize the distance between a synthesized programPθ from a sketch space and its neural oracle along sampledtrajectories encountered by Pθ in the environment contextbut put a large penalty on states that are unsafe

Random SearchWe implement the idea encoded in equa-tion (5) in Algorithm 1 that depicts the pseudocode of ourpolicy interpretation approach It take as inputs a neuralpolicy a policy program sketch parameterized by θ and anenvironment state transition system and outputs a synthe-sized policy program Pθ

An efficient way to solve equation (5) is to directly perturbθ in the search space by adding random noise and thenupdate θ based on the effect on this perturbation [30] Wechoose a direction uniformly at random on the sphere inparameter space and then optimize the goal function alongthis direction To this end in line 3 of Algorithm 1 we sampleGaussian noise δ to be added to policy parameters θ in bothdirections where ν is a small positive real number In line 4and line 5 we sample trajectories C1 and C2 from the statetransition systems obtained by running the perturbed policyprogram Pθ+νδ and Pθminusνδ in the environment context C

We evaluate the proximity of these two policy programs tothe neural oracle and in line 6 of Algorithm 1 to improve thepolicy program we also optimize equation (5) by updating θwith a finite difference approximation along the direction

θ larr θ + α[d(Pθ+νδ πwC1) minus d(Pθminusνδ πwC2)ν

]δ (6)

where α is a predefined learning rate Such an update incre-ment corresponds to an unbiased estimator of the gradient ofθ [29 33] for equation (5) The algorithm iteratively updatesθ until convergence

42 VerificationA synthesized policy program P is verified with respect to anenvironment context given as an infinite state transition sys-tem defined in Sec 3 C[P] = (X ASS0Su Tt [P] f r )Our verification algorithm learns an inductive invariant φover the transition relation Tt [P] formally proving that allpossible system behaviors are encapsulated in φ and φ isrequired to be disjoint with all unsafe states Su We follow template-based constraint solving approaches

for inductive invariant inference [19 20 24 34] to discoverthis invariant The basic idea is to identify a function E Rn rarr R that serves as a barrier [20 24 34] between reach-able system states (evaluated to be nonpositive by E) andunsafe states (evaluated positive by E) Using the invariantsyntax in Fig 5 the user can define an invariant sketch

φ[c](X ) = E[c](X ) le 0 (7)

over variables X and c unknown coefficients intended to besynthesized Fig 5 carefully restricts that an invariant sketchE[c] can only be postulated as a polynomial function as thereexist efficient SMT solvers [13] and constraint solvers [26]for nonlinear polynomial reals Formally assume real coef-ficients c = (c0 middot middot middot cp ) are used to parameterize E[c] in anaffine manner

E[c](X ) = Σpi=0cibi (X )

7

where the bi (X )rsquos are some monomials in variables X As aheuristic the user can simply determine an upper bound onthe degree of E[c] and then include all monomials whosedegrees are no greater than the bound in the sketch Largevalues of the bound enable verification of more complexsafety conditions but impose greater demands on the con-straint solver small values capture coarser safety propertiesbut are easier to solve

Example 41 Consider the inverted pendulum system inSec 22 To discover an inductive invariant for the transitionsystem the user might choose to define an upper bound of 4which results in the following polynomial invariant sketchφ[c](ηω) = E[c](ηω) le 0 where

E[c](ηω) = c0η4+c1η3ω+c2η2ω2+c3ηω3+c4ω

4+c5η3+middot middot middot+cp

The sketch includes all monomials over η and ω whose de-grees are no greater than 4 The coefficients c = [c0 middot middot middot cp ]are unknown and need to be synthesized

To synthesize these unknowns we require that E[c] mustsatisfy the following verification conditions

forall(s) isin Su E[c](s) gt 0 (8)forall(s) isin S0 E[c](s) le 0 (9)forall(s s prime) isin Tt [P] E[c](s prime) minus E[c](s) le 0 (10)

We claim that φ = E[c](x) le 0 defines an inductive in-variant because verification condition (9) ensures that anyinitial state s0 isin S0 satisfies φ since E[c](s0) le 0 verificationcondition (10) asserts that along the transition from a states isin φ (so E[c](s) le 0) to a resulting state s prime E[c] cannotbecome positive so s prime satisfies φ as well Finally accordingto verification condition (8) φ does not include any unsafestate su isin Su as E[c](su ) is positive

Verification conditions (8) (9) (10) are polynomials over re-als Synthesizing unknown coefficients can be left to an SMTsolver [13] after universal quantifiers are eliminated using avariant of Farkas Lemma as in [20] However observe thatE[c] is convex3 We can gain efficiency by finding unknowncoefficients c using off-the-shelf convex constraint solversfollowing [34] Encoding verification conditions (8) (9) (10)as polynomial inequalities we search c that can prove non-negativity of these constraints via an efficient and convexsum of squares programming solver [26] Additional techni-cal details are provided in the supplementary material [49]Counterexample-guided Inductive Synthesis (CEGIS)Given the environment state transition system C[P] de-ployed with a synthesized program P the verification ap-proach above can compute an inductive invariant over thestate transition system or tell if there is no feasible solutionin the given set of candidates Note however that the latterdoes not necessarily imply that the system is unsafe

3For arbitrary E1(x ) and E2(x ) satisfying the verification conditions andα isin [0 1] E(x ) = αE1(x ) + (1 minus α )E2(x ) satisfies the conditions as well

Algorithm 2 CEGIS (πw P[θ ] C[middot])1 policieslarr empty2 coverslarr empty3 while CS0 ⊈ covers do4 search s0 such that s0 isin CS0 and s0 lt covers5 rlowastlarr Diameter(CS0)6 do7 ϕbound larr s | s isin s0 minus rlowast s0 + rlowast8 C larr C where CS0 = (CS0 cap ϕbound )9 θ larr Synthesize(πw P[θ ] C[middot])

10 φlarr Verify(C[Pθ ])11 if φ is False then12 rlowastlarr rlowast213 else14 coverslarr covers cup s | φ(s)15 policieslarr policies cup (Pθ φ)16 break17 while True18 end19 return policies

Since our goal is to learn a safe deterministic programfrom a neural network we develop a counterexample guidedinductive program synthesis approach A CEGIS algorithmin our context is challenging because safety verification isnecessarily incomplete and may not be able to produce acounterexample that serves as an explanation for why averification attempt is unsuccessful

We solve the incompleteness challenge by leveraging thefact that we can simultaneously synthesize and verify aprogram Our CEGIS approach is given in Algorithm 2 Acounterexample is an initial state on which our synthesizedprogram is not yet proved safe Driven by these counterex-amples our algorithm synthesizes a set of programs from asketch along with the conditions under which we can switchfrom from one synthesized program to another

Algorithm 2 takes as input a neural policy πw a programsketch P[θ ] and an environment context C It maintains syn-thesized policy programs in policies in line 1 each of whichis inductively verified safe in a partition of the universe statespace that is maintained in covers in line 2 For soundnessthe state space covered by such partitions must be able toinclude all initial states checked in line 3 of Algorithm 2 byan SMT solver In the algorithm we use CS0 to access afield of C such as its initial state space

The algorithm iteratively samples a counterexample initialstate s0 that is currently not covered by covers in line 4 Sincecovers is empty at the beginning this choice is uniformlyrandom initially we synthesize a presumably safe policyprogram in line 9 of Algorithm 2 that resembles the neuralpolicy πw considering all possible initial states S0 of thegiven environment model C using Algorithm 1

8

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 7: An Inductive Synthesis Framework for Verifiable

E = v | x | oplus (E1 Ek )φ = E le 0P = return E | if φ then return E else P

Figure 5 Syntax of the Policy Programming Language

Algorithm 1 Synthesize (πw P[θ ] C[middot])1 θ larr 02 do3 Sample δ from a zero mean Gaussian vector4 Sample a set of trajectories C1 using C[Pθ+νδ ]5 Sample a set of trajectories C2 using C[Pθminusνδ ]6 θ larr θ + α[d (πwPθ+νδ C1)minusd (πwPθminusνδ C2)

ν ]δ 7 while θ is not converged8 return Pθ

a collection of variables can be expressed as

P[θ ](X ) = return θ1x1 + θ2x2 + middot middot middot + θnxn + θn+1 (4)

Here X = (x1x2 middot middot middot ) are system variables in which thecoefficient constants θ = (θ1θ2 middot middot middot ) are unknownThe goal of our synthesis algorithm is to find unknown

values of θ that maximize the likelihood that Pθ resemblesthe neural oracle πw while still being a safe program withrespect to the environment context C

θ lowast = argmaxθ isinRn+1

d(πwPθ C) (5)

where d measures the distance between the outputs of theestimate program Pθ and the neural policy πw subject tosafety constraints To avoid the computational complexitythat arises if we consider a solution to this equation analyti-cally as an optimization problem we instead approximated(πwPθ C) by randomly sampling a set of trajectories Cthat are encountered by Pθ in the environment state transi-tion system C[Pθ ]

d(πwPθ C) asymp d(πwPθ C) s t C sube C[Pθ ]We estimate θ lowast using these sampled trajectoriesC and define

d(πwPθ C) =sumhisinC

d(πwPθ h)

Since each trajectory h isin C is a finite rollout s0 sT oflength T we have

d(πwPθ h) =Tsumt=0

minus ∥(Pθ (st ) minus πw(st ))∥ st lt SuminusMAX st isin Su

where ∥middot∥ is a suitable norm As illustrated in Sec 22 weaim to minimize the distance between a synthesized programPθ from a sketch space and its neural oracle along sampledtrajectories encountered by Pθ in the environment contextbut put a large penalty on states that are unsafe

Random SearchWe implement the idea encoded in equa-tion (5) in Algorithm 1 that depicts the pseudocode of ourpolicy interpretation approach It take as inputs a neuralpolicy a policy program sketch parameterized by θ and anenvironment state transition system and outputs a synthe-sized policy program Pθ

An efficient way to solve equation (5) is to directly perturbθ in the search space by adding random noise and thenupdate θ based on the effect on this perturbation [30] Wechoose a direction uniformly at random on the sphere inparameter space and then optimize the goal function alongthis direction To this end in line 3 of Algorithm 1 we sampleGaussian noise δ to be added to policy parameters θ in bothdirections where ν is a small positive real number In line 4and line 5 we sample trajectories C1 and C2 from the statetransition systems obtained by running the perturbed policyprogram Pθ+νδ and Pθminusνδ in the environment context C

We evaluate the proximity of these two policy programs tothe neural oracle and in line 6 of Algorithm 1 to improve thepolicy program we also optimize equation (5) by updating θwith a finite difference approximation along the direction

θ larr θ + α[d(Pθ+νδ πwC1) minus d(Pθminusνδ πwC2)ν

]δ (6)

where α is a predefined learning rate Such an update incre-ment corresponds to an unbiased estimator of the gradient ofθ [29 33] for equation (5) The algorithm iteratively updatesθ until convergence

42 VerificationA synthesized policy program P is verified with respect to anenvironment context given as an infinite state transition sys-tem defined in Sec 3 C[P] = (X ASS0Su Tt [P] f r )Our verification algorithm learns an inductive invariant φover the transition relation Tt [P] formally proving that allpossible system behaviors are encapsulated in φ and φ isrequired to be disjoint with all unsafe states Su We follow template-based constraint solving approaches

for inductive invariant inference [19 20 24 34] to discoverthis invariant The basic idea is to identify a function E Rn rarr R that serves as a barrier [20 24 34] between reach-able system states (evaluated to be nonpositive by E) andunsafe states (evaluated positive by E) Using the invariantsyntax in Fig 5 the user can define an invariant sketch

φ[c](X ) = E[c](X ) le 0 (7)

over variables X and c unknown coefficients intended to besynthesized Fig 5 carefully restricts that an invariant sketchE[c] can only be postulated as a polynomial function as thereexist efficient SMT solvers [13] and constraint solvers [26]for nonlinear polynomial reals Formally assume real coef-ficients c = (c0 middot middot middot cp ) are used to parameterize E[c] in anaffine manner

E[c](X ) = Σpi=0cibi (X )

7

where the bi (X )rsquos are some monomials in variables X As aheuristic the user can simply determine an upper bound onthe degree of E[c] and then include all monomials whosedegrees are no greater than the bound in the sketch Largevalues of the bound enable verification of more complexsafety conditions but impose greater demands on the con-straint solver small values capture coarser safety propertiesbut are easier to solve

Example 41 Consider the inverted pendulum system inSec 22 To discover an inductive invariant for the transitionsystem the user might choose to define an upper bound of 4which results in the following polynomial invariant sketchφ[c](ηω) = E[c](ηω) le 0 where

E[c](ηω) = c0η4+c1η3ω+c2η2ω2+c3ηω3+c4ω

4+c5η3+middot middot middot+cp

The sketch includes all monomials over η and ω whose de-grees are no greater than 4 The coefficients c = [c0 middot middot middot cp ]are unknown and need to be synthesized

To synthesize these unknowns we require that E[c] mustsatisfy the following verification conditions

forall(s) isin Su E[c](s) gt 0 (8)forall(s) isin S0 E[c](s) le 0 (9)forall(s s prime) isin Tt [P] E[c](s prime) minus E[c](s) le 0 (10)

We claim that φ = E[c](x) le 0 defines an inductive in-variant because verification condition (9) ensures that anyinitial state s0 isin S0 satisfies φ since E[c](s0) le 0 verificationcondition (10) asserts that along the transition from a states isin φ (so E[c](s) le 0) to a resulting state s prime E[c] cannotbecome positive so s prime satisfies φ as well Finally accordingto verification condition (8) φ does not include any unsafestate su isin Su as E[c](su ) is positive

Verification conditions (8) (9) (10) are polynomials over re-als Synthesizing unknown coefficients can be left to an SMTsolver [13] after universal quantifiers are eliminated using avariant of Farkas Lemma as in [20] However observe thatE[c] is convex3 We can gain efficiency by finding unknowncoefficients c using off-the-shelf convex constraint solversfollowing [34] Encoding verification conditions (8) (9) (10)as polynomial inequalities we search c that can prove non-negativity of these constraints via an efficient and convexsum of squares programming solver [26] Additional techni-cal details are provided in the supplementary material [49]Counterexample-guided Inductive Synthesis (CEGIS)Given the environment state transition system C[P] de-ployed with a synthesized program P the verification ap-proach above can compute an inductive invariant over thestate transition system or tell if there is no feasible solutionin the given set of candidates Note however that the latterdoes not necessarily imply that the system is unsafe

3For arbitrary E1(x ) and E2(x ) satisfying the verification conditions andα isin [0 1] E(x ) = αE1(x ) + (1 minus α )E2(x ) satisfies the conditions as well

Algorithm 2 CEGIS (πw P[θ ] C[middot])1 policieslarr empty2 coverslarr empty3 while CS0 ⊈ covers do4 search s0 such that s0 isin CS0 and s0 lt covers5 rlowastlarr Diameter(CS0)6 do7 ϕbound larr s | s isin s0 minus rlowast s0 + rlowast8 C larr C where CS0 = (CS0 cap ϕbound )9 θ larr Synthesize(πw P[θ ] C[middot])

10 φlarr Verify(C[Pθ ])11 if φ is False then12 rlowastlarr rlowast213 else14 coverslarr covers cup s | φ(s)15 policieslarr policies cup (Pθ φ)16 break17 while True18 end19 return policies

Since our goal is to learn a safe deterministic programfrom a neural network we develop a counterexample guidedinductive program synthesis approach A CEGIS algorithmin our context is challenging because safety verification isnecessarily incomplete and may not be able to produce acounterexample that serves as an explanation for why averification attempt is unsuccessful

We solve the incompleteness challenge by leveraging thefact that we can simultaneously synthesize and verify aprogram Our CEGIS approach is given in Algorithm 2 Acounterexample is an initial state on which our synthesizedprogram is not yet proved safe Driven by these counterex-amples our algorithm synthesizes a set of programs from asketch along with the conditions under which we can switchfrom from one synthesized program to another

Algorithm 2 takes as input a neural policy πw a programsketch P[θ ] and an environment context C It maintains syn-thesized policy programs in policies in line 1 each of whichis inductively verified safe in a partition of the universe statespace that is maintained in covers in line 2 For soundnessthe state space covered by such partitions must be able toinclude all initial states checked in line 3 of Algorithm 2 byan SMT solver In the algorithm we use CS0 to access afield of C such as its initial state space

The algorithm iteratively samples a counterexample initialstate s0 that is currently not covered by covers in line 4 Sincecovers is empty at the beginning this choice is uniformlyrandom initially we synthesize a presumably safe policyprogram in line 9 of Algorithm 2 that resembles the neuralpolicy πw considering all possible initial states S0 of thegiven environment model C using Algorithm 1

8

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 8: An Inductive Synthesis Framework for Verifiable

where the bi (X )rsquos are some monomials in variables X As aheuristic the user can simply determine an upper bound onthe degree of E[c] and then include all monomials whosedegrees are no greater than the bound in the sketch Largevalues of the bound enable verification of more complexsafety conditions but impose greater demands on the con-straint solver small values capture coarser safety propertiesbut are easier to solve

Example 41 Consider the inverted pendulum system inSec 22 To discover an inductive invariant for the transitionsystem the user might choose to define an upper bound of 4which results in the following polynomial invariant sketchφ[c](ηω) = E[c](ηω) le 0 where

E[c](ηω) = c0η4+c1η3ω+c2η2ω2+c3ηω3+c4ω

4+c5η3+middot middot middot+cp

The sketch includes all monomials over η and ω whose de-grees are no greater than 4 The coefficients c = [c0 middot middot middot cp ]are unknown and need to be synthesized

To synthesize these unknowns we require that E[c] mustsatisfy the following verification conditions

forall(s) isin Su E[c](s) gt 0 (8)forall(s) isin S0 E[c](s) le 0 (9)forall(s s prime) isin Tt [P] E[c](s prime) minus E[c](s) le 0 (10)

We claim that φ = E[c](x) le 0 defines an inductive in-variant because verification condition (9) ensures that anyinitial state s0 isin S0 satisfies φ since E[c](s0) le 0 verificationcondition (10) asserts that along the transition from a states isin φ (so E[c](s) le 0) to a resulting state s prime E[c] cannotbecome positive so s prime satisfies φ as well Finally accordingto verification condition (8) φ does not include any unsafestate su isin Su as E[c](su ) is positive

Verification conditions (8) (9) (10) are polynomials over re-als Synthesizing unknown coefficients can be left to an SMTsolver [13] after universal quantifiers are eliminated using avariant of Farkas Lemma as in [20] However observe thatE[c] is convex3 We can gain efficiency by finding unknowncoefficients c using off-the-shelf convex constraint solversfollowing [34] Encoding verification conditions (8) (9) (10)as polynomial inequalities we search c that can prove non-negativity of these constraints via an efficient and convexsum of squares programming solver [26] Additional techni-cal details are provided in the supplementary material [49]Counterexample-guided Inductive Synthesis (CEGIS)Given the environment state transition system C[P] de-ployed with a synthesized program P the verification ap-proach above can compute an inductive invariant over thestate transition system or tell if there is no feasible solutionin the given set of candidates Note however that the latterdoes not necessarily imply that the system is unsafe

3For arbitrary E1(x ) and E2(x ) satisfying the verification conditions andα isin [0 1] E(x ) = αE1(x ) + (1 minus α )E2(x ) satisfies the conditions as well

Algorithm 2 CEGIS (πw P[θ ] C[middot])1 policieslarr empty2 coverslarr empty3 while CS0 ⊈ covers do4 search s0 such that s0 isin CS0 and s0 lt covers5 rlowastlarr Diameter(CS0)6 do7 ϕbound larr s | s isin s0 minus rlowast s0 + rlowast8 C larr C where CS0 = (CS0 cap ϕbound )9 θ larr Synthesize(πw P[θ ] C[middot])

10 φlarr Verify(C[Pθ ])11 if φ is False then12 rlowastlarr rlowast213 else14 coverslarr covers cup s | φ(s)15 policieslarr policies cup (Pθ φ)16 break17 while True18 end19 return policies

Since our goal is to learn a safe deterministic programfrom a neural network we develop a counterexample guidedinductive program synthesis approach A CEGIS algorithmin our context is challenging because safety verification isnecessarily incomplete and may not be able to produce acounterexample that serves as an explanation for why averification attempt is unsuccessful

We solve the incompleteness challenge by leveraging thefact that we can simultaneously synthesize and verify aprogram Our CEGIS approach is given in Algorithm 2 Acounterexample is an initial state on which our synthesizedprogram is not yet proved safe Driven by these counterex-amples our algorithm synthesizes a set of programs from asketch along with the conditions under which we can switchfrom from one synthesized program to another

Algorithm 2 takes as input a neural policy πw a programsketch P[θ ] and an environment context C It maintains syn-thesized policy programs in policies in line 1 each of whichis inductively verified safe in a partition of the universe statespace that is maintained in covers in line 2 For soundnessthe state space covered by such partitions must be able toinclude all initial states checked in line 3 of Algorithm 2 byan SMT solver In the algorithm we use CS0 to access afield of C such as its initial state space

The algorithm iteratively samples a counterexample initialstate s0 that is currently not covered by covers in line 4 Sincecovers is empty at the beginning this choice is uniformlyrandom initially we synthesize a presumably safe policyprogram in line 9 of Algorithm 2 that resembles the neuralpolicy πw considering all possible initial states S0 of thegiven environment model C using Algorithm 1

8

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 9: An Inductive Synthesis Framework for Verifiable

Figure 6 CEGIS for Verifiable Reinforcement Learning

If verification fails in line 11 our approach simply reducesthe initial state space hoping that a safe policy program iseasier to synthesize if only a subset of initial states are con-sidered The algorithm in line 12 gradually shrinks the radiusr lowast of the initial state space around the sampled initial states0 The synthesis algorithm in the next iteration synthesizesand verifies a candidate using the reduced initial state spaceThe main idea is that if the initial state space is shrunk toa restricted area around s0 but a safety policy program stillcannot be found it is quite possible that either s0 points toan unsafe initial state of the neural oracle or the sketch isnot sufficiently expressiveIf verification at this stage succeeds with an inductive

invariant φ a new policy program Pθ is synthesized thatcan be verified safe in the state space covered by φ We addthe inductive invariant and the policy program into coversand policies in line 14 and 15 respectively and then move toanother iteration of counterexample-guided synthesis Thisiterative synthesize-and-verify process continues until theentire initial state space is covered (line 3 to 18) The outputof Algorithm 2 is [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] that essentiallydefines conditional statements in a synthesized program per-forming different actions depending on whether a specifiedinvariant condition evaluates to true or false

Theorem 42 IfCEGIS(πw P[θ ]C[middot]) = [(Pθ 1φ1) (Pθ 2φ2) middot middot middot ] (as defined in Algorithm 2) then the deterministicprogram PλX if φ1(X ) return Pθ 1(X ) else if φ2(X ) return Pθ 2(X ) middot middot middot

is safe in the environment C meaning that φ1 or φ2 or middot middot middot is aninductive invariant of C[P] proving that CSu is unreachable

Example 43 We illustrate the proposed counterexampleguided inductive synthesis method by means of a simple ex-ample the Duffing oscillator [22] a nonlinear second-orderenvironment The transition relation of the environmentsystem C is described with the differential equation

Ucircx = yUcircy = minus06y minus x minus x3 + a

where x y are the state variables and a the continuous con-trol action given by a well-trained neural feedback controlpolicy π such that a = π (x y) The control objective is toregulate the state to the origin from a set of initial statesCS0 x y | minus 25 le x le 25 and minus2 le y le 2 To besafe the controller must be able to avoid a set of unsafestates CSu x y | not(minus5 le x le 5 and minus5 le y le 5)Given a program sketch as in the pendulum example that isP[θ ](x y) = θ1x+θ2y the user can ask the constraint solverto reason over a small (say 4) order polynomial invariantsketch for a synthesized program as in Example 41Algorithm 2 initially samples an initial state s as x =minus046y = minus036 from S0 The inner do-while loop of thealgorithm can discover a sub-region of initial states in thedotted box of Fig 6(a) that can be leveraged to synthesize averified deterministic policy P1(x y) = 039x minus 141y fromthe sketch We also obtain an inductive invariant showingthat the synthesized policy can always maintain the con-troller within the invariant set drawn in purple in Fig 6(a)φ1 equiv 209x4 + 29x3y + 14x2y2 + 04xy3 + 296x3 + 201x2y +113xy2 + 16y3 + 252x2 + 392xy + 537y2 minus 680 le 0

This invariant explains why the initial state space usedfor verification does not include the entire CS0 a coun-terexample initial state s prime = x = 2249y = 2 is not cov-ered by the invariant for which the synthesized policy pro-gram above is not verified safe The CEGIS loop in line 3of Algorithm 2 uses s prime to synthesize another deterministicpolicy P2(x y) = 088x minus 234y from the sketch whoselearned inductive invariant is depicted in blue in Fig 2(b)φ2 equiv 128x4 + 09x3y minus 02x2y2 minus 59x3 minus 15xy2 minus 03y3 +22x2 + 47xy + 404y2 minus 619 le 0 Algorithm 2 then termi-nates because φ1 or φ2 covers CS0 Our system interpretsthe two synthesized deterministic policies as the followingdeterministic program Poscillator using the syntax in Fig 5def Poscillator (x y)

if 209x 4 + 29x 3y + 14x 2y2 + middot middot middot + 537y2 minus 680 le 0 φ1return 039x minus 141y

else if 128x 4 + 09x 3y minus 02x 2y2 + middot middot middot + 404y2 minus 619 le 0 φ2return 088x minus 234y

else abort unsafe

Neither of the two deterministic policies enforce safety bythemselves on all initial states but do guarantee safety whencombined together because by construction φ = φ1 or φ2 isan inductive invariant of Poscillator in the environment C

Although Algorithm 2 is sound it may not terminate asr lowast in the algorithm can become arbitrarily small and thereis also no restriction on the size of potential counterexam-ples Nonetheless our experimental results indicate that thealgorithm performs well in practice

43 ShieldingA safety proof of a synthesized deterministic program ofa neural network does not automatically lift to a safety ar-gument of the neural network from which it was derived

9

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 10: An Inductive Synthesis Framework for Verifiable

Algorithm 3 Shield (s C[πw] P φ)1 Predict s prime such that (s s prime) isin C[πw]Tt (s)2 if φ(s prime) then return πw(s) 3 else return P(s)

since the network may exhibit behaviors not captured bythe simpler deterministic program To bridge this divide werecover soundness at runtime by monitoring system behav-iors of a neural network in its environment context by usingthe synthesized policy program and its inductive invariantas a shield The pseudo-code for using a shield is given inAlgorithm 3 In line 1 of Algorithm 3 for a current state s weuse the state transition system of our environment contextmodel to predict the next state s prime If s prime is not within φ weare unsure whether entering s prime would inevitably make thesystem unsafe as we lack a proof that the neural oracle issafe However we do have a guarantee that if we follow thesynthesized program P the system would stay within thesafety boundary defined by φ that Algorithm 2 has formallyproved We do so in line 3 of Algorithm 3 using P and φas shields intervening only if necessary so as to restrict theshield from unnecessarily intervening the neural policy 4

5 Experimental ResultsWe have applied our framework on a number of challengingcontrol- and cyberphysical-system benchmarks We considerthe utility of our approach for verifying the safety of trainedneural network controllers We use the deep policy gradi-ent algorithm [28] for neural network training the Z3 SMTsolver [13] to check convergence of the CEGIS loop andthe Mosek constraint solver [6] to generate inductive in-variants of synthesized programs from a sketch All of ourbenchmarks are verified using the program sketch defined inequation (4) and the invariant sketch defined in equation (7)We report simulation results on our benchmarks over 1000runs (each run consists of 5000 simulation steps) Each simu-lation time step is fixed 001 second Our experiments wereconducted on a standard desktop machine consisting of In-tel(R) Core(TM) i7-8700 CPU cores and 64GB memory

Case Study on Inverted Pendulum We first give more de-tails on the evaluation result of our running example theinverted pendulumHerewe consider amore restricted safetycondition that deems the system to be unsafe when the pen-dulumrsquos angle is more than 23 from the origin (ie signif-icant swings are further prohibited) The controller is a 2hidden-layer neural model (240times200) Our tool interprets theneural network as a program containing three conditionalbranches

def P (η ω)

4We also extended our approach to synthesize deterministic programswhichcan guarantee stability in the supplementary material [49]

if 17533η4 + 13732η3ω + 3831η2ω2 minus 5472ηω3 + 8579ω4 + 6813η3+9634η2ω + 3947ηω2 minus 120ω3 + 1928η2 + 1915ηω + 1104ω2 minus 313 le 0

return minus1728176866η minus 1009441768ωelse if 2485η4 + 826η3ω minus 351η2ω2 + 581ηω3 + 2579ω4 + 591η3+

9η2ω + 243ηω2 minus 189ω3 + 484η2 + 170ηω + 287ω2 minus 82 le 0return minus1734281984x minus 1073944835y

else if 115496η4 + 64763η3ω + 85376η2ω2 + 21365ηω3 + 7661ω4minus111271η3 minus 54416η2ω minus 66684ηω2 minus 8742ω3 + 33701η2+11736ηω + 12503ω2 minus 1185 le 0

return minus2578835525η minus 1625056971ωelse abort

Over 1000 simulation runs we found 60 unsafe cases whenrunning the neural controller alone Importantly runningthe neural controller in tandem with the above verified andsynthesized program can prevent all unsafe neural decisionsThere were only with 65 interventions from the program onthe neural network Our results demonstrate that CEGIS isimportant to ensure a synthesized program is safe to use Wereport more details including synthesis times below

One may ask why we do not directly learn a deterministicprogram to control the device (without appealing to the neu-ral policy at all) using reinforcement learning to synthesizeits unknown parameters Even for an example as simple asthe inverted pendulum our discussion above shows that itis difficult (if not impossible) to derive a single straight-line(linear) program that is safe to control the system Even forscenarios in which a straight-line program suffices using ex-isting RL methods to directly learn unknown parameters inour sketches may still fail to obtain a safe program For exam-ple we considered if a linear control policy can be learned to(1) prevent the pendulum from falling down (2) and requirea pendulum control action to be strictly within the range[minus1 1] (eg operating the controller in an environment withlow power constraints) We found that despite many experi-ments on tuning learning rates and rewards directly traininga linear control program to conform to this restriction witheither reinforcement learning (eg policy gradient) or ran-dom search [29] was unsuccessful because of undesirableoverfitting In contrast neural networks work much betterfor these RL algorithms and can be used to guide the syn-thesis of a deterministic program policy Indeed by treatingthe neural policy as an oracle we were able to quickly dis-cover a straight-line linear deterministic program that in factsatisfies this additional motion constraint

Safety Verification Our verification results are given inTable 1 In the table Vars represents the number of variablesin a control system - this number serves as a proxy for ap-plication complexity Size the number of neurons in hiddenlayers of the network the Training time for the networkand its Failures the number of times the network failed tosatisfy the safety property in simulation The table also givesthe Size of a synthesized program in term of the numberof polices found by Algorithm 2 (used to generate condi-tional statements in the program) its Synthesis time theOverhead of our approach in terms of the additional cost

10

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 11: An Inductive Synthesis Framework for Verifiable

Table 1 Experimental Results on Deterministic Program Synthesis Verification and Shielding

Neural Network Deterministic Program as Shield PerformanceBenchmarks Vars Size Training Failures Size Synthesis Overhead Interventions NN ProgramSatellite 2 240 times 200 957s 0 1 160s 337 0 57 97DCMotor 3 240 times 200 944s 0 1 68s 203 0 119 122Tape 3 240 times 200 980s 0 1 42s 263 0 30 36Magnetic Pointer 3 240 times 200 992s 0 1 85s 292 0 83 88Suspension 4 240 times 200 960s 0 1 41s 871 0 47 61Biology 3 240 times 200 978s 0 1 168s 523 0 2464 2599DataCenter Cooling 3 240 times 200 968s 0 1 168s 469 0 146 401Quadcopter 2 300 times 200 990s 182 2 67s 641 185 72 98Pendulum 2 240 times 200 962s 60 3 1107s 965 65 442 586CartPole 4 300 times 200 990s 47 4 998s 562 1799 6813 19126Self-Driving 4 300 times 200 990s 61 1 185s 466 236 1459 5136Lane Keeping 4 240 times 200 895s 36 1 183s 865 64 3753 64354-Car platoon 8 500 times 400 times 300 1160s 8 4 609s 317 8 76 968-Car platoon 16 500 times 400 times 300 1165s 40 1 1217s 605 1080 385 554Oscillator 18 240 times 200 1023s 371 1 618s 2131 93703 6935 11353

(compared to the non-shielded variant) in running time touse a shield and the number of Interventions the numberof times the shield was invoked across all simulation runsWe also report performance gap between of a shield neuralpolicy (NN) and a purely programmatic policy (Program)in terms of the number of steps on average that a controlledsystem spends in reaching a steady state of the system (iea convergence state)

The first five benchmarks are linear time-invariant controlsystems adapted from [15] The safety property is that thereach set has to be within a safe rectangle Benchmark Biol-ogy defines a minimal model of glycemic control in diabeticpatients such that the dynamics of glucose and insulin inter-action in the blood system are defined by polynomials [10]For safety we verify that the neural controller ensures thatthe level of plasma glucose concentration is above a certainthreshold Benchmark DataCenter Cooling is a model of acollection of three server racks each with their own coolingdevices and they also shed heat to their neighbors The safetyproperty is that a learned controller must keep the data cen-ter below a certain temperature In these benchmarks thecost to query the network oracle constitutes the dominanttime for generating the safety shield in these benchmarksGiven the simplicity of these benchmarks the neural net-work controllers did not violate the safety condition in ourtrials and moreover there were no interventions from thesafety shields that affected performanceThe next three benchmarksQuadcopter (Inverted) Pen-

dulum and Cartpole are selected from classic control appli-cations and have more sophisticated safety conditions Wehave discussed the inverted pendulum example at lengthearlier The Quadcopter environment tests whether a con-trolled quadcopter can realize stable flight The environmentof Cartpole consists of a pole attached to an unactuated joint

connected to a cart that moves along a frictionless trackThe system is unsafe when the polersquos angle is more than30 from being upright or the cart moves by more than 03meters from the origin We observed safety violations in eachof these benchmarks that were eliminated using our verifi-cation methodology Notably the number of interventionsis remarkably low as a percentage of the overall number ofsimulation stepsBenchmark Self-driving defines a single car navigation

problem The neural controller is responsible for prevent-ing the car from veering into canals found on either side ofthe road Benchmark Lane Keeping models another safety-related car-driving problem The neural controller aims tomaintain a vehicle between lane markers and keep it cen-tered in a possibly curved lane The curvature of the road isconsidered as a disturbance input Environment disturbancesof such kind can be conservatively specified in our modelaccounting for noise and unmodeled dynamics Our verifi-cation approach supports these disturbances (verificationcondition (10)) Benchmarks n-Car platoon model multiple(n) vehicles forming a platoon maintaining a safe relativedistance among one another [39] Each of these benchmarksexhibited some number of violations that were remediatedby our verification methodology Benchmark Oscillator con-sists of a two-dimensional switched oscillator plus a 16-orderfilter The filter smoothens the input signals and has a sin-gle output signal We verify that the output signal is belowa safe threshold Because of the model complexity of thisbenchmark it exhibited significantly more violations thanthe others Indeed the neural-network controlled systemoften oscillated between the safe and unsafe boundary inmany runs Consequently the overhead in this benchmarkis high because a large number of shield interventions wasrequired to ensure safety In other words the synthesized

11

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 12: An Inductive Synthesis Framework for Verifiable

shield trades performance for safety to guarantee that thethreshold boundary is never violatedFor all benchmarks our tool successfully generated safe

interpretable deterministic programs and inductive invari-ants as shields When a neural controller takes an unsafeaction the synthesized shield correctly prevents this actionfrom executing by providing an alternative provable safeaction proposed by the verified deterministic program Interm of performance Table 1 shows that a shielded neuralpolicy is a feasible approach to drive a controlled system intoa steady state For each of the benchmarks studied the pro-grammatic policy is less performant than the shielded neuralpolicy sometimes by a factor of two or more (eg CartpoleSelf-Driving and Oscillator) Our result demonstrates thatexecuting a neural policy in tandem with a program distilledfrom it can retain performance provided by the neural policywhile maintaining safety provided by the verified program

Although our synthesis algorithm does not guarantee con-vergence to the global minimum when applied to nonconvexRL problems the results given in Table 1 indicate that ouralgorithm can often produce high-quality control programswith respect to a provided sketch that converge reasonablyfast In some of our benchmarks however the number ofinterventions are significantly higher than the number ofneural controller failures eg 8-Car platoon and OscillatorHowever the high number of interventions is not primar-ily because of non-optimality of the synthesized program-matic controller Instead inherent safety issues in the neuralnetwork models are the main culprit that triggers shieldinterventions In 8-Car platoon after corrections made byour deterministic program the neural model again takes anunsafe action in the next execution step so that the deter-ministic program has to make another correction It is onlyafter applying a number of such shield interventions that thesystem navigates into a part of the state space that can besafely operated on by the neural model For this benchmarkall system states where there occur shield interventions areindeed unsafe We also examined the unsafe simulation runsmade by executing the neural controller alone in OscillatorAmong the large number of shield interventions (as reflectedin Table 1) 74 of them are effective and indeed prevent anunsafe neural decision Ineffective interventions inOscillatorare due to the fact that when optimizing equation (5) a largepenalty is given to unsafe states causing the synthesizedprogrammatic policy to weigh safety more than proximitywhen there exist a large number of unsafe neural decisions

Suitable Sketches Providing a suitable sketch may needdomain knowledge To help the user more easily tune theshape of a sketch our approach provides algorithmic supportby not requiring conditional statements in a program sketchand syntactic sugar ie the user can simply provide an upperbound on the degree of an invariant sketch

Table 2 Experimental Results on Tuning Invariant DegreesTO means that an adequate inductive invariant cannot befound within 2 hours

Benchmarks Degree Verification Interventions Overhead

Pendulum2 TO - -4 226s 40542 7828 236s 30787 879

Self-Driving2 TO - -4 24s 128851 6978 251s 123671 2685

8-Car-platoon2 1729s 43952 8364 5402s 37990 9198 TO - -

Our experimental results are collected using the invari-ant sketch defined in equation (7) and we chose an upperbound of 4 on the degree of all monomials included in thesketch Recall that invariants may also be used as conditionalpredicates as part of a synthesized program We adjust theinvariant degree upper bound to evaluate its effect in oursynthesis procedure The results are given in Table 2

Generally high-degree invariants lead to fewer interven-tions because they tend to be more permissive than low-degree ones However high-degree invariants take moretime to synthesize and verify This is particularly true forhigh-dimension models such as 8-Car platoon Moreoveralthough high-degree invariants tend to have fewer inter-ventions they have larger overhead For example using ashield of degree 8 in the Self-Driving benchmark causedan overhead of 2685 This is because high-degree polyno-mial computations are time-consuming On the other handan insufficient degree upper bound may not be permissiveenough to obtain a valid invariant It is therefore essential toconsider the tradeoff between overhead and permissivenesswhen choosing the degree of an invariant

Handling Environment Changes We consider the effec-tiveness of our tool when previously trained neural networkcontrollers are deployed in environment contexts differentfrom the environment used for training Here we considerneural network controllers of larger size (two hidden layerswith 1200 times 900 neurons) than in the above experimentsThis is because we want to ensure that a neural policy istrained to be near optimal in the environment context usedfor training These larger networks were in general moredifficult to train requiring at least 1500 seconds to convergeOur results are summarized in Table 3 When the under-

lying environment sightly changes learning a new safetyshield takes substantially shorter time than training a newnetwork For Cartpole we simulated the trained controllerin a new environment by increasing the length of the pole by015 meters The neural controller failed 3 times in our 1000episode simulation the shield interfered with the networkoperation only 8 times to prevent these unsafe behaviors

12

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 13: An Inductive Synthesis Framework for Verifiable

Table 3 Experimental Results on Handling Environment Changes

Neural Network Deterministic Program as ShieldBenchmarks Environment Change Size Failures Size Synthesis Overhead Shield InterventionsCartpole Increased Pole length by 015m 1200 times 900 3 1 239s 291 8Pendulum Increased Pendulum mass by 03kg 1200 times 900 77 1 581s 811 8748Pendulum Increased Pendulum length by 015m 1200 times 900 76 1 483s 653 7060Self-driving Added an obstacle that must be avoided 1200 times 900 203 1 392s 842 108320

The new shield was synthesized in 239s significantly fasterthan retraining a new neural network for the new environ-ment For (Inverted) Pendulum we deployed the trainedneural network in an environment in which the pendulumrsquosmass is increased by 03kg The neural controller exhibitsnoticeably higher failure rates than in the previous experi-ment we were able to synthesize a safety shield adapted tothis new environment in 581 seconds that prevented theseviolations The shield intervened with the operation of thenetwork only 87 number of times per episode Similar resultswere observed when we increased the pendulumrsquos lengthby 015m For Self-driving we additionally required the carto avoid an obstacle The synthesized shield provided safeactions to ensure collision-free motion

6 Related WorkVerification of Neural NetworksWhile neural networkshave historically found only limited application in safety-and security-related contexts recent advances have definedsuitable verification methodologies that are capable of pro-viding stronger assurance guarantees For example Relu-plex [23] is an SMT solver that supports linear real arithmeticwith ReLU constraints and has been used to verify safetyproperties in a collision avoidance system AI2 [17 31] is anabstract interpretation tool that can reason about robustnessspecifications for deep convolutional networks Robustnessverification is also considered in [8 21] Systematic testingtechniques such as [43 45 48] are designed to automati-cally generate test cases to increase a coverage metric ieexplore different parts of neural network architecture by gen-erating test inputs that maximize the number of activatedneurons These approaches ignore effects induced by theactual environment in which the network is deployed Theyare typically ineffective for safety verification of a neuralnetwork controller over an infinite time horizon with com-plex system dynamics Unlike these whitebox efforts thatreason over the architecture of the network our verificationframework is fully blackbox using a network only as anoracle to learn a simpler deterministic program This givesus the ability to reason about safety entirely in terms of thesynthesized program using a shielding mechanism derivedfrom the program to ensure that the neural policy can onlyexplore safe regions

SafeReinforcement LearningThemachine learning com-munity has explored various techniques to develop safe rein-forcement machine learning algorithms in contexts similarto ours eg [32 46] In some cases verification is used as asafety validation oracle [2 11] In contrast to these effortsour inductive synthesis algorithm can be freely incorporatedinto any existing reinforcement learning framework andapplied to any legacy machine learning model More impor-tantly our design allows us to synthesize new shields fromexisting ones without requiring retraining of the networkunder reasonable changes to the environmentSyntax-Guided Synthesis forMachine LearningAn im-portant reason underlying the success of program synthesisis that sketches of the desired program [41 42] often pro-vided along with example data can be used to effectivelyrestrict a search space and allow users to provide additionalinsight about the desired output [5] Our sketch based induc-tive synthesis approach is a realization of this idea applicableto continuous and infinite state control systems central tomany machine learning tasks A high-level policy languagegrammar is used in our approach to constrain the shapeof possible synthesized deterministic policy programs Ourprogram parameters in sketches are continuous becausethey naturally fit continuous control problems Howeverour synthesis algorithm (line 9 of Algorithm 2) can call aderivate-free random search method (which does not requirethe gradient or its approximative finite difference) for syn-thesizing functions whose domain is disconnected (mixed-)integer or non-smooth Our synthesizer thus may be appliedto other domains that are not continuous or differentiablelike most modern program synthesis algorithms [41 42]In a similar vein recent efforts on interpretable machine

learning generate interpretable models such as programcode [47] or decision trees [9] as output after which tra-ditional symbolic verification techniques can be leveragedto prove program properties Our approach novelly ensuresthat only safe programs are synthesized via a CEGIS pro-cedure and can provide safety guarantees on the originalhigh-performing neural networks via invariant inference Adetailed comparison was provided on page 3Controller Synthesis Random search [30] has long beenutilized as an effective approach for controller synthesisIn [29] several extensions of random search have been pro-posed that substantially improve its performance when ap-plied to reinforcement learning However [29] learns a single

13

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 14: An Inductive Synthesis Framework for Verifiable

linear controller that we found to be insufficient to guaranteesafety for our benchmarks in our experience We often needto learn a program that involves a family of linear controllersconditioned on different input spaces to guarantee safetyThe novelty of our approach is that it can automaticallysynthesize programs with conditional statements by needwhich is critical to ensure safety

In Sec 5 linear sketches were used in our experiments forshield synthesis LQR-Tree based approaches such as [44]can synthesize a controller consisting of a set of linear qua-dratic regulator (LQR [14]) controllers each applicable in adifferent portion of the state space These approaches focuson stability while our approach primarily addresses safetyIn our experiments we observed that because LQR does nottake safeunsafe regions into consideration synthesized LQRcontrollers can regularly violate safety constraintsRuntime Shielding Shield synthesis has been used in run-time enforcement for reactive systems [12 25] Safety shield-ing in reinforcement learning was first introduced in [4]Because their verification approach is not symbolic howeverit can only work over finite discrete state and action systemsSince states in infinite state and continuous action systemsare not enumerable using these methods requires the enduser to provide a finite abstraction of complex environmentdynamics such abstractions are typically too coarse to beuseful (because they result in too much over-approximation)or otherwise have too many states to be analyzable [15] In-deed automated environment abstraction tools such as [16]often take hours even for simple 4-dimensional systems Ourapproach embraces the nature of infinity in control systemsby learning a symbolic shield from an inductive invariant fora synthesized program that includes an infinite number ofenvironment states under which the program can be guaran-teed to be safe We therefore believe our framework providesa more promising verification pathway for machine learningin high-dimensional control systems In general these effortsfocused on shielding of discrete and finite systems with noobvious generalization to effectively deal with continuousand infinite systems that our approach can monitor and pro-tect Shielding in continuous control systems was introducedin [3 18] These approaches are based on HJ reachability(Hamilton-Jacobi Reachability [7]) However current formu-lations of HJ reachability are limited to systems with approx-imately five dimensions making high-dimensional systemverification intractable Unlike their least restrictive controllaw framework that decouples safety and learning our ap-proach only considers state spaces that a learned controllerattempts to explore and is therefore capable of synthesizingsafety shields even in high dimensions

7 ConclusionsThis paper presents an inductive synthesis-based toolchainthat can verify neural network control policies within an

environment formalized as an infinite-state transition sys-tem The key idea is a novel synthesis framework capableof synthesizing a deterministic policy program based onan user-given sketch that resembles a neural policy oracleand simultaneously satisfies a safety specification using acounterexample-guided synthesis loop Verification sound-ness is achieved at runtime by using the learned determin-istic program along with its learned inductive invariant asa shield to protect the neural policy correcting the neuralpolicyrsquos action only if the chosen action can cause violationof the inductive invariant Experimental results demonstratethat our approach can be used to realize fully trustworthyreinforcement learning systems with low overhead

References[1] M Ahmadi C Rowley and U Topcu 2018 Control-Oriented Learning

of Lagrangian and Hamiltonian Systems In American Control Confer-ence

[2] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 1424ndash1431

[3] Anayo K Akametalu Shahab Kaynama Jaime F Fisac Melanie NicoleZeilinger Jeremy H Gillula and Claire J Tomlin 2014 Reachability-based safe learning with Gaussian processes In 53rd IEEE Conferenceon Decision and Control CDC 2014 Los Angeles CA USA December15-17 2014 1424ndash1431

[4] Mohammed Alshiekh Roderick Bloem Ruumldiger Ehlers BettinaKoumlnighofer Scott Niekum and Ufuk Topcu 2018 Safe Reinforce-ment Learning via Shielding AAAI (2018)

[5] Rajeev Alur Rastislav Bodiacutek Eric Dallal Dana Fisman Pranav GargGarvit Juniwal Hadas Kress-Gazit P Madhusudan Milo M K MartinMukund Raghothaman Shambwaditya Saha Sanjit A Seshia RishabhSingh Armando Solar-Lezama Emina Torlak and Abhishek Udupa2015 Syntax-Guided Synthesis In Dependable Software Systems Engi-neering NATO Science for Peace and Security Series D Informationand Communication Security Vol 40 IOS Press 1ndash25

[6] MOSEK ApS 2018 The mosek optimization software httpwwwmosekcom

[7] Somil Bansal Mo Chen Sylvia Herbert and Claire J Tomlin 2017Hamilton-Jacobi Reachability A Brief Overview and Recent Advances(2017)

[8] Osbert Bastani Yani Ioannou Leonidas Lampropoulos Dimitrios Vy-tiniotis Aditya V Nori and Antonio Criminisi 2016 Measuring NeuralNet Robustness with Constraints In Proceedings of the 30th Interna-tional Conference on Neural Information Processing Systems (NIPSrsquo16)2621ndash2629

[9] Osbert Bastani Yewen Pu and Armando Solar-Lezama 2018 VerifiableReinforcement Learning via Policy Extraction CoRR abs180508328(2018)

[10] Richard N Bergman Diane T Finegood and Marilyn Ader 1985 As-sessment of insulin sensitivity in vivo Endocrine reviews 6 1 (1985)45ndash86

[11] Felix Berkenkamp Matteo Turchetta Angela P Schoellig and AndreasKrause 2017 Safe Model-based Reinforcement Learning with StabilityGuarantees In Advances in Neural Information Processing Systems 30Annual Conference on Neural Information Processing Systems 2017 908ndash919

[12] Roderick Bloem Bettina Koumlnighofer Robert Koumlnighofer and ChaoWang 2015 Shield Synthesis - Runtime Enforcement for ReactiveSystems In Tools and Algorithms for the Construction and Analysis of

14

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 15: An Inductive Synthesis Framework for Verifiable

Systems - 21st International Conference TACAS 2015 533ndash548[13] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT

Solver In Proceedings of the Theory and Practice of Software 14th Inter-national Conference on Tools and Algorithms for the Construction andAnalysis of Systems (TACASrsquo08ETAPSrsquo08) 337ndash340

[14] Peter Dorato Vito Cerone and Chaouki Abdallah 1994 Linear-Quadratic Control An Introduction Simon amp Schuster Inc New YorkNY USA

[15] Chuchu Fan Umang Mathur Sayan Mitra and Mahesh Viswanathan2018 Controller Synthesis Made Real Reach-Avoid Specifications andLinear Dynamics In Computer Aided Verification - 30th InternationalConference CAV 2018 347ndash366

[16] Ioannis Filippidis Sumanth Dathathri Scott C Livingston NecmiyeOzay and Richard M Murray 2016 Control design for hybrid sys-tems with TuLiP The Temporal Logic Planning toolbox In 2016 IEEEConference on Control Applications CCA 2016 1030ndash1041

[17] Timon Gehr Matthew Mirman Dana Drachsler-Cohen Petar TsankovSwarat Chaudhuri and Martin T Vechev 2018 AI2 Safety and Robust-ness Certification of Neural Networks with Abstract Interpretation In2018 IEEE Symposium on Security and Privacy SP 2018 3ndash18

[18] Jeremy H Gillula and Claire J Tomlin 2012 Guaranteed Safe OnlineLearning via Reachability tracking a ground target using a quadrotorIn IEEE International Conference on Robotics and Automation ICRA2012 14-18 May 2012 St Paul Minnesota USA 2723ndash2730

[19] Sumit Gulwani Saurabh Srivastava and Ramarathnam Venkatesan2008 Program Analysis As Constraint Solving In Proceedings of the29th ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI rsquo08) 281ndash292

[20] Sumit Gulwani and Ashish Tiwari 2008 Constraint-based approachfor analysis of hybrid systems In International Conference on ComputerAided Verification 190ndash203

[21] Xiaowei Huang Marta Kwiatkowska Sen Wang and Min Wu 2017Safety Verification of Deep Neural Networks In CAV (1) (Lecture Notesin Computer Science) Vol 10426 3ndash29

[22] Dominic W Jordan and Peter Smith 1987 Nonlinear ordinary differ-ential equations (1987)

[23] Guy Katz Clark W Barrett David L Dill Kyle Julian and Mykel JKochenderfer 2017 Reluplex An Efficient SMT Solver for VerifyingDeep Neural Networks In Computer Aided Verification - 29th Interna-tional Conference CAV 2017 97ndash117

[24] Hui Kong Fei He Xiaoyu Song William N N Hung and Ming Gu2013 Exponential-Condition-Based Barrier Certificate Generationfor Safety Verification of Hybrid Systems In Proceedings of the 25thInternational Conference on Computer Aided Verification (CAVrsquo13) 242ndash257

[25] Bettina Koumlnighofer Mohammed Alshiekh Roderick Bloem LauraHumphrey Robert Koumlnighofer Ufuk Topcu and Chao Wang 2017Shield Synthesis Form Methods Syst Des 51 2 (Nov 2017) 332ndash361

[26] Benoicirct Legat Chris Coey Robin Deits Joey Huchette and AmeliaPerry 2017 (accessed Nov 16 2018) Sum-of-squares optimization inJulia httpswwwjuliaoptorgmeetingsmit2017legatpdf

[27] Yuxi Li 2017 Deep Reinforcement Learning An Overview CoRRabs170107274 (2017)

[28] Timothy P Lillicrap Jonathan J Hunt Alexander Pritzel Nicolas HeessTom Erez Yuval Tassa David Silver and Daan Wierstra 2016 Contin-uous control with deep reinforcement learning In 4th InternationalConference on Learning Representations ICLR 2016 San Juan PuertoRico May 2-4 2016 Conference Track Proceedings

[29] Horia Mania Aurelia Guy and Benjamin Recht 2018 Simple randomsearch of static linear policies is competitive for reinforcement learningIn Advances in Neural Information Processing Systems 31 1800ndash1809

[30] J Matyas 1965 Random optimization Automation and Remote Control26 2 (1965) 246ndash253

[31] Matthew Mirman Timon Gehr and Martin T Vechev 2018 Differen-tiable Abstract Interpretation for Provably Robust Neural Networks InProceedings of the 35th International Conference on Machine LearningICML 2018 3575ndash3583

[32] Teodor Mihai Moldovan and Pieter Abbeel 2012 Safe Exploration inMarkov Decision Processes In Proceedings of the 29th InternationalCoference on International Conference on Machine Learning (ICMLrsquo12)1451ndash1458

[33] Yurii Nesterov and Vladimir Spokoiny 2017 Random Gradient-FreeMinimization of Convex Functions Found Comput Math 17 2 (2017)527ndash566

[34] Stephen Prajna and Ali Jadbabaie 2004 Safety Verification of HybridSystems Using Barrier Certificates In Hybrid Systems Computationand Control 7th International Workshop HSCC 2004 477ndash492

[35] Aravind Rajeswaran Kendall Lowrey Emanuel Todorov and Sham MKakade 2017 Towards Generalization and Simplicity in Continu-ous Control In Advances in Neural Information Processing Systems30 Annual Conference on Neural Information Processing Systems 20176553ndash6564

[36] Steacutephane Ross Geoffrey J Gordon and Drew Bagnell 2011 A Re-duction of Imitation Learning and Structured Prediction to No-RegretOnline Learning In Proceedings of the Fourteenth International Confer-ence on Artificial Intelligence and Statistics AISTATS 2011 627ndash635

[37] Wenjie Ruan Xiaowei Huang and Marta Kwiatkowska 2018 Reacha-bility Analysis of Deep Neural Networks with Provable GuaranteesIn Proceedings of the Twenty-Seventh International Joint Conference onArtificial Intelligence IJCAI 2018 2651ndash2659

[38] Stefan Schaal 1999 Is imitation learning the route to humanoid robotsTrends in cognitive sciences 3 6 (1999) 233ndash242

[39] Bastian Schuumlrmann and Matthias Althoff 2017 Optimal control ofsets of solutions to formally guarantee constraints of disturbed linearsystems In 2017 American Control Conference ACC 2017 2522ndash2529

[40] David Silver Guy Lever Nicolas Heess Thomas Degris DaanWierstraandMartin Riedmiller 2014 Deterministic Policy Gradient AlgorithmsIn Proceedings of the 31st International Conference on InternationalConference on Machine Learning - Volume 32 Indash387ndashIndash395

[41] Armando Solar-Lezama 2008 Program Synthesis by Sketching PhDDissertation Berkeley CA USA Advisor(s) Bodik Rastislav

[42] Armando Solar-Lezama 2009 The Sketching Approach to ProgramSynthesis In Proceedings of the 7th Asian Symposium on ProgrammingLanguages and Systems (APLAS rsquo09) 4ndash13

[43] Youcheng Sun Min Wu Wenjie Ruan Xiaowei Huang MartaKwiatkowska and Daniel Kroening 2018 Concolic Testing for DeepNeural Networks In Proceedings of the 33rd ACMIEEE InternationalConference on Automated Software Engineering (ASE 2018) 109ndash119

[44] Russ Tedrake 2009 LQR-trees Feedback motion planning on sparserandomized trees In Robotics Science and Systems V University ofWashington Seattle USA June 28 - July 1 2009

[45] Yuchi Tian Kexin Pei Suman Jana and Baishakhi Ray 2018 DeepTestAutomated Testing of Deep-neural-network-driven Autonomous CarsIn Proceedings of the 40th International Conference on Software Engi-neering (ICSE rsquo18) 303ndash314

[46] Matteo Turchetta Felix Berkenkamp and Andreas Krause 2016 SafeExploration in Finite Markov Decision Processes with Gaussian Pro-cesses In Proceedings of the 30th International Conference on NeuralInformation Processing Systems (NIPSrsquo16) 4312ndash4320

[47] Abhinav Verma Vijayaraghavan Murali Rishabh Singh PushmeetKohli and Swarat Chaudhuri 2018 Programmatically InterpretableReinforcement Learning In Proceedings of the 35th International Con-ference on Machine Learning ICML 2018 5052ndash5061

[48] Matthew Wicker Xiaowei Huang and Marta Kwiatkowska 2018Feature-Guided Black-Box Safety Testing of Deep Neural NetworksIn Tools and Algorithms for the Construction and Analysis of Systems -24th International Conference TACAS 2018 408ndash426

15

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References
Page 16: An Inductive Synthesis Framework for Verifiable

[49] He Zhu Zikang Xiong Stephen Magill and Suresh Jagannathan 2018An Inductive Synthesis Framework for Verifiable Reinforcement Learn-ing Technical Report Galois Inc httpherowanzhugithubio

herowanzhugithubioVerifiableLearningpdf

16

  • Abstract
  • 1 Introduction
  • 2 Motivation and Overview
    • 21 State Transition System
    • 22 Synthesis Verification and Shielding
      • 3 Problem Setup
      • 4 Verification Procedure
        • 41 Synthesis
        • 42 Verification
        • 43 Shielding
          • 5 Experimental Results
          • 6 Related Work
          • 7 Conclusions
          • References