symbolic control and adaptive systemsjfu2/files/papers/fu-thesis-2013.pdf · 2013-12-03 · agent...

BOTTOM-UP SYMBOLIC CONTROL AND ADAPTIVE SYSTEMS:

ABSTRACTION, PLANNING AND LEARNING

by

Jie Fu

A dissertation submitted to the Faculty of the University of Delaware in partialfulfillment of the requirements for the degree of Doctor of Philosophy in MechanicalEngineering

Fall 2013

c© 2013 Jie FuAll Rights Reserved

BOTTOM-UP SYMBOLIC CONTROL AND ADAPTIVE SYSTEMS:

ABSTRACTION, PLANNING AND LEARNING

by

Jie Fu

Approved:Suresh G. Advani, Ph.D.Chair of the Department of Mechanical Engineering

Approved:Babatunde A. Ogunnaike, Ph.D.Dean of the College of Engineering

Approved:James G. Richards, Ph.D.Vice Provost for Graduate and Professional Education

I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.

Signed:Herbert G. Tanner, Ph.D.Professor in charge of dissertation


Signed:Ioannis Poulakakis, Ph.D.Member of dissertation committee


Signed:Joshua L. Hertz, Ph.D.Member of dissertation committee


Signed:Jeffrey Heinz, Ph.D.Member of dissertation committee

ACKNOWLEDGEMENTS

First of all, I would like to express my deepest appreciation for the support from

my advisor, Dr. Herbert Tanner. His guidance and inspirational ideas have made my

Ph.D. study a rewarding, thoughtful and joyful journey. I would like to thank Dr.

Jeffrey Heinz for his constant help, patience and encouragement. I would like to thank

my committee members, Dr. Ioannis Poulakakis and Joshua Hertz, for their valuable

advices in regard to my research and dissertation.

Over the course of my doctoral program, I have collaborated with many re-

searchers from the Linguistic department. I would like to thank all the members in

the CPS group, without the insightful discussions we had many progresses in this work

could not have been made. In addition, I would like to thank Dr. Jim Rogers and Dr.

John Case for generously sharing their ideas and knowledge. I also wish to thank all

the members of the Cooperative Robotic Laboratory for their friendship and help dur-

ing my stay at UDel. I would like to thank the funding source NSF Career #0907003,

NSF CPS #1035577, and ARL MAST CTA #W911NF-08-2-0004.

Last but not least, I want to thank my family: my parent Yingxue Wu and

Yongming Fu, and my husband Juannan Zhou, for their support, understanding and

love.

iv

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Chapter

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Technical Objectives and Contributions . . . . . . . . . . . . . . . . . 101.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1 Hybrid Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Formal Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 BOTTOM-UP SYMBOLIC PLANNING . . . . . . . . . . . . . . . 19

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Automata and Transition Systems . . . . . . . . . . . . . . . . . . . . 21

3.2.1 Automata and their Semantics . . . . . . . . . . . . . . . . . . 213.2.2 Transition Systems . . . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Register Automata and its Semantics . . . . . . . . . . . . . . 22

3.3 Hybrid Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . 24

v

3.3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.1 Predicate Abstraction and the Induced Register Automata . . 283.4.2 Weak (Bi)simulations . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Time-optimal Planning . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.5.1 Searching for Candidate Plans . . . . . . . . . . . . . . . . . . 37

3.5.1.1 A Graph Representation . . . . . . . . . . . . . . . . 373.5.1.2 Finding Walk Candidates . . . . . . . . . . . . . . . 39

3.5.2 Dynamic Programming — a Variant . . . . . . . . . . . . . . 42

3.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6.1 Control Mode a: Nonholonomic Control . . . . . . . . . . . . 453.6.2 Control Modes b and c: Catch and Release . . . . . . . . . . . 463.6.3 The System Model . . . . . . . . . . . . . . . . . . . . . . . . 473.6.4 Task Specifications . . . . . . . . . . . . . . . . . . . . . . . . 503.6.5 Solving the Planning Problem . . . . . . . . . . . . . . . . . . 51

3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 ADAPTIVE CONTROL SYNTHESIS— WITH PERFECTINFORMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2 Grammatical Inference and Infinite games . . . . . . . . . . . . . . . 58

4.2.1 Infinite Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.2.2 Grammatical Inference . . . . . . . . . . . . . . . . . . . . . . 594.2.3 Specification Language . . . . . . . . . . . . . . . . . . . . . . 594.2.4 Infinite Games . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 System Behavior as Game Play . . . . . . . . . . . . . . . . . . . . . 61

4.3.1 Constructing the Game . . . . . . . . . . . . . . . . . . . . . . 61

vi

4.3.2 Game Theoretic Control Synthesis . . . . . . . . . . . . . . . 64

4.4 Integrating Learning with Control . . . . . . . . . . . . . . . . . . . . 65

4.4.1 Languages of the Game . . . . . . . . . . . . . . . . . . . . . . 664.4.2 Learning the Game — a First Approach . . . . . . . . . . . . 664.4.3 Learning an Equivalent Game . . . . . . . . . . . . . . . . . . 70

4.4.3.1 Equivalence in Games . . . . . . . . . . . . . . . . . 704.4.3.2 Learning an Equivalent Game from Positive

Presentations . . . . . . . . . . . . . . . . . . . . . . 72

4.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 CONTROL SYNTHESIS WITH PARTIAL OBSERVATIONS . . 81

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2 Symbolic Synthesis with Partial Observations . . . . . . . . . . . . . 82

5.2.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.2 Deterministic Finite-memory Strategy . . . . . . . . . . . . . 855.2.3 Randomized Finite-memory Controllers . . . . . . . . . . . . . 89

5.2.3.1 Randomized Controllers for Reachability Objectives . 905.2.3.2 Randomized Controllers for Buchi Objectives . . . . 93

5.3 Discussions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 95

6 OUTLOOK: MULTI-AGENT SYSTEMS . . . . . . . . . . . . . . . 96

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.3 Modeling a Multi-agent Game . . . . . . . . . . . . . . . . . . . . . . 98

6.3.1 Constructing the Game Arena . . . . . . . . . . . . . . . . . . 986.3.2 Specification Language . . . . . . . . . . . . . . . . . . . . . . 996.3.3 The Game Formulation . . . . . . . . . . . . . . . . . . . . . . 100

6.4 Game Theoretic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.4.1 Pure Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . 101

vii

6.4.2 Special Cases — Buchi and Reachability Objectives . . . . . . 109

6.4.2.1 Deterministic Buchi Games . . . . . . . . . . . . . . 1096.4.2.2 Reachability Games . . . . . . . . . . . . . . . . . . 111

6.4.3 Security Strategies . . . . . . . . . . . . . . . . . . . . . . . . 1136.4.4 Cooperative Equilibria . . . . . . . . . . . . . . . . . . . . . . 117

6.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.5.1 Reachability Objectives . . . . . . . . . . . . . . . . . . . . . 1206.5.2 Buchi Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 1256.5.3 Strategy Alternatives for Agent 3 . . . . . . . . . . . . . . . . 126

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . 128

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Appendix

A ASYMPTOTIC (T,D) EQUIVALENCE CLASSES . . . . . . . . . 142B LEARNING ALGORITHM FOR THE CLASS OF STRICTLY

K-LOCAL LANGUAGES. . . . . . . . . . . . . . . . . . . . . . . . . . 144

viii

LIST OF TABLES

3.1 Pre and Post maps for the control modes of the hybrid agent. . 49

6.1 Nash equilibria for all payoff vectors in concurrent game G withreachability objectives . . . . . . . . . . . . . . . . . . . . . . . . . 123

6.2 Nash equilibria for all payoff vectors in concurrent game G with Buchiobjectives (in the case when ε /∈ Σi, for all i ∈ Π) . . . . . . . . . . 125

ix

LIST OF FIGURES

3.1 An example of a 2-register automaton. . . . . . . . . . . . . . . . . 24

3.2 The transformation semiautomaton TR(H) of hybrid agent H, forthe task specification considered. . . . . . . . . . . . . . . . . . . . 50

3.3 Discretized workspace for the mobile manipulator and optimal path.The two concentric collection of points mark parameter classrepresentatives around the object and user positions. . . . . . . . . 53

4.1 The architecture of hybrid planning and control with a module forgrammatical inference . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Learning and planning with a grammatical inference module. . . . . 68

4.3 The environment and its abstraction. . . . . . . . . . . . . . . . . . 78

4.4 A fraction of (G, v0) where v0 = ((1, c,1), 1). A state ((q1, q2, t), qs)means the robot is in q1, the recent consecutively closed (at mosttwo) doors are q2, t = 1 if player 1 is to make a move, otherwiseplayer 2 is to make a move and the visited rooms are encoded in qs,e.g. 12 means rooms 1, 2 have been visited. . . . . . . . . . . . . . . 78

4.5 Convergence of learning L2(G, v0): the ratio between the size of the

grammar inferred by the GIM and that of L2(G, v0), in terms of number of

moves made . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.1 A fragment of a game graph P and the observed game structure obsP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.1 Relating runs r in multi-agent game G, ρ in the two-player turn-basedgame graph H and τ in objective Aj. . . . . . . . . . . . . . . . . 106

6.2 A partitioned rectangular environment in which three agents roam. 120

6.3 A fragment of the multi-agent arena P = 〈Q,ACT , T 〉. . . . . . . . 121

x

6.4 A fragment of the two-player turn-based game arena H. . . . . . . 121

6.5 Fragment of the partial synchronization product H . . . . . . . . . 121

6.6 The finite state automaton (fsa)s representing the agent objectives. 121

6.7 Fragment of the two-player turn-based arena H1. . . . . . . . . . . 126

xi

ABSTRACT

This thesis develops an optimal planning method for a class of hybrid sys-

tems, and introduces machine learning with reactive synthesis to construct adaptive

controllers for finite-state transition systems with respect to high-level formal specifi-

cations in the presence of an unknown, dynamic environment.

For a class of hybrid systems that switches between different pre-defined low-

level controllers, this thesis develops an automated method that builds time-optimal

control mode sequences that satisfy given system specifications. The planning algo-

rithm proceeds in a bottom-up fashion. First, it abstracts a hybrid system of this class

into a special finite-state automaton that can manipulate continuous data expressing

attributes of the concrete dynamics. The abstraction is predicate-based, enabled by

the convergence properties of low-level continuous dynamics, and encompasses existing

low-level controllers rather than replacing them during synthesis. The abstraction is

weakly simulated by its concrete system, and thus behaviors planned using the ab-

straction are always implementable on the concrete system.

The procedure of abstraction bridges the hybrid dynamical systems with formal

language and automata theory and enables us to borrow concepts and methodologies

from those fields into control synthesis. In the presence of unknown, dynamic and po-

tentially adversarial environment, this thesis develops an adaptive synthesis method,

and advocates the integration of ideas from grammatical inference and reactive synthe-

sis toward the development of an any-time control design, which is guaranteed to be

effective in the limit. The insight is that at the abstraction level, the behavior of the

unknown environment exhibited during its interaction with the system can be treated

as an unknown language, which can be identified by a learning algorithm from finite

xii

amount of observed behaviors provided some prior knowledge is given. As the fidelity

of the environment model improves, the control design becomes more effective.

The thesis then considers reactive synthesis in the case of partial observation

(not all environment actions can be completely observed by the system) and multi-

agents. Reactive control synthesis methods are developed for systems with incomplete

information, that ensure the specifications are satisfied surely, or almost surely (with

probability 1). For the synthesis of controllers for multiple concurrent systems with

individual specifications, an approach in which each treats the others as adversarial

can be too restrictive and unnecessary. This thesis presents a decision procedure for

agent behaviors that is based on the solution of a concurrent, multi-agent infinite game.

Depending on the context in which interaction takes place, solutions can come in the

form of pure (deterministic) Nash equilibria, security strategies, or cooperative pure

Nash equilibria. The analysis and methods presented can be extended to the case of

decentralized control design for multiple reactive systems. The thesis concludes with a

brief overview and possible future research directions.

xiii

Chapter 1

INTRODUCTION

The concept of hybrid systems provides a rich modeling framework as it captures

the interaction of discrete and continuous dynamics. The discrete dynamics can express

the finite operational modes which the system can be in, for example, the flight modes

of an aircraft such as hovering, descending, etc. The continuous dynamics is determined

by the physics of evolution of continuous states such as position, velocity, acceleration,

etc., in each mode.

Many application domains for such systems are found in safety critical systems,

such as automated highway systems [53, 74], air-traffic management systems [43, 68],

robotics [1,103], etc. For such systems, the specifications and performance requirements

can be more general than maintaining stability. Examples of such specifications can be

liveness (something good will always eventually happen), safety (nothing bad will ever

happen), fairness (all constituent processes will evolve and none will starve). While the

first can somehow be captured by notions of convergence and invariance in a classical

continuous dynamical systems framework, it is not so clear how to handle the other

two.

Take the air traffic control (ATC) system for example, an ATC system has

to keep track of several aircraft simultaneously, and has to be robust to the delays,

conflicts, weather, etc. The control objective of such a system is to ensure safety and

fairness, which concerns about scheduling take-off and landing in a timely fashion. This

objective cannot be expressed in traditional specification language using the terms such

as asymptotically stable, settling time, etc. For this purpose, temporal logic [81] can

be used as the specification language, not only because it can express high-level control

1

objectives, but also because it allows us to reason about the ordering of events without

introducing time explicitly.

When the specification is given in logical formula, which is of discrete nature,

there arises a heterogeneity between the hybrid dynamics in the system and its dis-

crete specification. One needs to bring both the system and its specification in the

same formalism framework to allow for analysis. This is what motivates abstraction-

based symbolic synthesis as it enables us to perform control design with temporal logic

specifications for reactive hybrid systems that continuously interact with the dynamic

environment. In the ATC system, to coordinate effectively the aircraft, a symbolic

approach can be developed by defining a set of conflict resolution maneuvers [98]. At

the low-level design, each maneuver is a finite sequence of flight modes such as heading,

altitude and speed change for aircraft, for the purpose of avoiding a conflict between

two aircraft. With these pre-defined maneuvers, one can synthesize a symbolic control

policy to ensure the satisfaction of all specifications, without having to explicitly reason

about the continuous dynamics of each aircraft. Hereby, with abstraction and sym-

bolic synthesis, we can alleviate the difficulty in control design for hybrid systems and

reconcile the heterogeneity between the system and their high-level logic specifications.

Furthermore, through discrete abstractions, many methods for analyzing discrete sys-

tems can be made applicable to hybrid systems. In this dissertation, it is shown that

a branch of machine learning — grammatical inference [30] can be introduced as a

system identification module at the abstraction level, which is a fundamental key in

constructing adaptive controllers for hybrid systems.

This chapter presents some motivation for the research described in this disser-

tation and gives a general description of the problem on which it focuses. Following

that, an overview of the approach and a brief description of the technical contributions

of this research are provided.

2

1.1 Motivation

Significant part of the formal analysis and synthesis for hybrid systems is driven

by the thought of lifting the analysis and control problems in systems with both con-

tinuous and discrete dynamics into systems with purely discrete dynamics, for which

computationally efficient methods can be applied. Discrete representations of systems

accommodate formal, high-level logic specifications, which are easy to understand by

system designers and also sufficient for expressing many complex requirements in sys-

tems. Generally, these methods are implemented in a two-step, hierarchical framework:

a discrete controller is synthesized with respect to the control objective of the high-

level discrete system and then implemented by the low-level dynamics of the underlying

hybrid system.

This hierarchical framework relies on an abstraction procedure, which constructs

a discrete, finite-state transition system from the hybrid, infinite-state system, and

permits analysis for the hybrid system to be equivalently performed on the discrete

system, with efficient computational methods. In general, the major difference between

various abstraction methods lies in their definitions of a state-equivalence relation. This

relation induces partitions of the continuous state space such that each block contains

states which behave in a “similar” manner. Then different blocks are mapped into

different abstract states. Currently, many abstraction methods are difficult to scale

to real-world applications because of state-explosion: the number of abstract states

is typically exponential in the number of variables used for defining the equivalence

relation. To combat this, one direction is to explore a bottom-up abstraction method:

for complex dynamical systems, one can design a finite set of low-level controllers

for simple objectives, and then reduce complex problems into problems of finding a

concatenation sequence of existing low-level controllers.

Abstraction techniques for hybrid systems existing, logical analysis can be used

for abstraction-based verification and control design with respect to high-level system

specifications and control objectives. In verification of hybrid systems, model checking

and theorem proving in separation tackle different aspects of verification problems.

3

Comparatively, model checking provides systematic ways to explore the state space of

the abstract system and verify if a given property holds in the system. It can facilitate

the detection of errors and guide the system design by generating a counterexample

explaining why a given property fails to hold. Theorem proving, on the other hand,

provides a constructive proof in showing why the correct behavior is exhibited by the

system.

Intuitively, formal verification treats the system as a program, and tests or

proves whether it can perform as desired. Formal control synthesis is analogous to de-

signing a program, which meets a given specification. However, considering a system in

isolation is always problematic due to the exogenous inputs from the environment. Our

goal is to synthesize controllers such that no matter how the uncontrollable, dynamical

environment behaves, the system can still accomplish the assigned task. Assuming the

worst case behavior for the environment, we formulate the interaction between a system

and its environment as a two-player, zero-sum game: the system player aims to satisfy

the desired specification while the environment player interferes to violate it. With

this game formulation, we are able to convert the control design problem into finding

a winning strategy for the system player against the environment. In this respect, al-

gorithmic game theory is employed to synthesize controllers which autonomously react

to the external inputs from the environments in real-time.

This game theoretic approach, also known as reactive synthesis method, pro-

duces correct-by-construction controllers, provided the environment dynamics and the

initial condition of the system satisfy some known assumptions in the form of logic

formulas. However, when these assumptions mistakenly or incompletely capture the

dynamics of the environment, the synthesis may lead to errors and poor system per-

formance. One approach to tackle this problem [70], is to combine iterative motion

planning with reactive synthesis. This method, while capable of discovering the changes

in the environment model, it cannot do so for the underlying dynamics that generate

the changes. As a consequence, the iterative planning approach in [70], which no

longer guarantees the satisfiability of the original specification, strives to satisfy some

4

specification close to it.

Limitations in these aforementioned synthesis methods are caused by the in-

completeness in the system’s knowledge with respect to its environment. With insight

from model-based adaptive control design in traditional control theory, and machine

learning from artificial intelligence, we are interested in the following question: is there

a method to complete our knowledge of the environment through the observed behavior

of it? The rationale is that once a correct model of the environment is obtained, the

problem is then reduced into a typical reactive synthesis problem, solvable with vari-

ous existing methods. The question itself points towards a direction for us to answer

it: the process of questing knowledge from experiences, is what we call learning. If a

system learns about its unknown environment from past observations, then it can im-

prove its current controller autonomously, adapting to the knowledge obtained about

its adversary.

So far, we have only considered the case which involves a single system and its

dynamic environment. For this case, it is reasonable to treat the latter as an opponent.

However, when the environment is composed of multiple autonomous systems, with

each having its own objective and preferences over the outcomes of their interaction,

the above view of the environment can be quite conservative: it will not allow the

possibility of cooperation for mutual benefit. On the other hand, the central problem

in mechanism design [78] is to motivate autonomous interacting agents by giving them

the right incentives, so that some desired behavior emerges as a stable expression

of interaction between them. To this end, this thesis studies a range of problems

in reactive synthesis for hybrid systems: abstraction, control design with incomplete

information, and analysis of multi-agent systems. We show that the proposed solutions

and methods can be incorporated into a coherent framework for symbolic control of

hybrid systems.

5

1.2 Problem Description

In reactive synthesis for hybrid systems, controllers are synthesized through

discrete abstractions, and then implemented with the low-level concrete dynamics.

Similarly, adaptive symbolic control synthesis emphasizes the incorporation of learning

into synthesis at the discrete level, through appropriate abstractions for hybrid systems.

This thesis mainly aims to solve the following problems:

1. Given a system that is capable of switching between different parameterizable

control modes, each of which has well-defined convergent continuous dynamics,

determine an optimal plan in the form of a sequence of parameterized control

modes such that by executing the plan, the system is steered to a desired final

state from any given initial state.

2. In the presence of an unknown, dynamic, adversarial environment, a system aims

to accomplish a task specified with high-level logic formula. Assuming that some

knowledge of the environment is given as a prior, construct a controller that is

adaptive to the dynamics of the environment and eventually converges to the

one that allows the system to accomplish the task, whenever such an outcome is

possible.

3. When a system interacting with a dynamic environment has limited sensor capa-

bilities, the control design can be unrealizable if we require the task is completed

with certainty. Instead, we consider, with some probabilistic measures of success

for control design, i.e., the task is completed with probability 1, can we obtain

more permissive control strategies for partial observation cases?

4. For the special class of multi-agent games that arise as models for interaction

among self-interested agents with objectives expressed in the form of a high-level

logic formula, what are the decision procedures for solution concepts in traditional

game theory, such as Nash equilibria, security strategies, cooperative equilibria

etc.?

6

1.3 Challenges

While control synthesis at the high level based on abstraction is attractive both

from theoretical and practical perspectives, challenges also remain both in the design of

a scalable abstraction method and the incorporation of learning into symbolic control

synthesis. Particularly, what exactly is meant by learning at this abstract level is rarely

formalized, while in fact there is a clear conceptual link between what we call adaptive

symbolic control and model-based adaptive control in traditional control theory.

The problem of scalability when performing abstraction and synthesis is press-

ing. Significant amount of work [27, 51] is based on improving the scalability of ab-

straction focuses on the counterexample-guided abstraction refinement (CEGAR) ap-

proach [29]. This method allows us to compute an abstract system which omits enough

detail to overcome the state-explosion problem, while still keeping sufficient informa-

tion to ensure that the specified objective can be met. The limitation of this method

is that, by construction, the abstract model generated by CEGAR is task-oriented:

once the objective is changed, the abstract system needs to be recomputed. So far, an

abstraction method which shares the advantages of CEGAR but is not task-oriented

has yet to be developed. Here, we take a first step along this direction by focusing on

a specific subclass of hybrid systems. For this class we aim to develop an abstraction

method fulfilling the requirements on scalability and sufficiency, while being indepen-

dent in the task specifications.

The other main challenge in hierarchical control design is to ensure that the sys-

tem satisfies the task specification in the presence of disturbances and exogenous input

from a dynamic environment. Current limitations of reactive synthesis for symbolic

control are primarily caused by having incomplete knowledge about this environment.

Because of this, introducing learning into symbolic control seems almost natural. How-

ever, among the different learning paradigms in machine learning literature, it is not

clear which one can interface well with symbolic synthesis. The choice of a learning

paradigm not only depends on the information accessible by the system during its in-

teraction with its environment, but also on what we decide that we want to learn. In

7

cases when the system interferes the behavior of its environment, it is more meaning-

ful to learn the interaction, rather than the environment itself. Then, in order to be

able to adapt the controller in real-time, we need the learning algorithm to update the

environment model in an efficient way. Determining appropriate learning paradigms

is thus a crucial step. Eventually, our goal is to establish a complete framework that

combines learning with symbolic synthesis seamlessly.

1.4 Approach Overview

Let us first describe the class of hybrid systems we are focusing on. A dy-

namical system in this class is capable of switching between different continuous-state

controllers, each of which yields some convergent closed-loop continuous dynamics. In

particular, these low-level controllers are parameterizable, and the parameters allow

the designer to ensure that at steady state, the state of the system is within a desired

set.

When we say that we plan the behavior of such a system, we mean that we

determine a temporal sequence of parameterized low-level controllers. Due to the

underlying convergent dynamics of low-level controller, restrictions have to be enforced

on the concatenation of two control modes. We approach an optimal planning problem

for this class of systems in a bottom-up way: firstly, we propose an abstraction method

to generate a discrete finite-state abstract model, which, exposes and offer access to

these (continuous) parameters. This abstract model accepts data words, which are

sequences of tuples consisting of a label, identifying the control mode that is to be

activated and a set of parameters for this mode. By proving that the concrete hybrid

system simulates the abstract one, it is guaranteed that a data word accepted in the

abstract model, can always be translated into an implementable plan in the original

system.

After translating the original specification into a specification on the abstract

model, we develop an optimal planning method at the abstract level. This method

includes two steps: the first step is to compute a finite sequence of control modes on

8

a finite-state transition system induced from the abstract model. In the second step,

a variant of dynamic programming is developed to determine the parameterization of

the control modes found in this sequence. The plan may eventually be sub-optimal due

to the over-approximation of continuous dynamics during the abstraction procedure.

With such an abstraction method at hand, we consider the control synthesis

problem in the presence of unknown, dynamic environments purely at the abstraction

level. We assume both the system and the environment admit discrete abstractions

in the form of some finite-state transition systems. With these abstract models, we

describe the interaction between the system and its environment as a two-player, turn-

based game. Assuming some prior knowledge is given to the system about the class of

models that its adversary may admit, a particular machine learning approach known

as grammatical inference (GI), is introduced and utilized as a system identification

method at the abstraction level. The adaptive control design proposed here brings

together exploration and exploitation: when executing the current controller, the sys-

tem explores the environment and collects the information in the form of sensory data,

which then serve as inputs to the GI module. With this information, the system re-

fines the abstract model of its environment, and subsequently the game it is involved

in through an algorithm which learns the discrete dynamics of the environment (or

their interaction) under some learning criterion. The controller is then adapted to this

new model generated by this learning algorithm. The convergence and correctness of

this adaptive controller is guaranteed by the convergence of learning. It represents a

method for reactive control synthesis which is, in the limit, correct-by-construction.

In the case where the system has only partial observations of what goes on

around it, its interaction with the environment takes the form of a two-player, turn-

based game with incomplete information. We approach the synthesis problem using a

knowledge-based subset construction and distinguish two different success criteria for

control design: one is the notion of a sure-winning control strategy, which ensures the

goal is satisfied certainly, and the other is that of an almost-sure control strategy, which

ensures the goal is satisfied with probability 1. We show that when the information is

9

complete, there is no difference between these two types of controllers. However, when

one allows probabilistic measures of success, we can find solutions that almost-surely

succeed in cases where sure-winning control strategies do not exist.

We extend formal synthesis methods for multi-agent systems with control spec-

ifications expressed in some formal logic. We model multiple system interaction as a

multi-agent, concurrent, infinite game. A play in such a game is a possibly infinite

sequence of interleaving states and concurrent actions of all systems involved. Solu-

tion concepts from algorithmic game theory are employed to compute any emergent

stable interaction between these rational systems, as it comes in the form of a (Nash)

equilibrium. We propose a decision procedure for equilibria in this class of games by

converting the multi-agent, concurrent infinite game into a two-player, turn-based in-

finite game. The winning strategy of one player in the latter can be translated into an

equilibrium in the former. Then we examine coalition games, which often arise when

individual systems are allowed to team-up for their mutual benefit.

1.5 Technical Objectives and Contributions

This dissertation integrates abstraction, optimal planning, symbolic reactive

synthesis, and machine learning to build a coherent framework for abstraction-based

adaptive control design for hybrid systems in the presence of unknown, dynamic en-

vironments. The major technical objectives and contributions of this dissertation are

the following.

1. A novel abstraction method for a class of hybrid systems. The method is scal-

able, technically sound and independent of the choice of the control objectives.

We show that with this abstraction method, an optimal planning problem can

be solved using the abstract model, and the solution is guaranteed to be imple-

mentable in the original hybrid system.

2. An adaptive symbolic control framework that integrates learning with reactive

10

synthesis. With correct elementary prior knowledge about the class of the un-

known, dynamic and adversarial environment the system may be interacting

with, the framework allows an automatic adaptation of the control design with

respect to any new knowledge acquired about the environment. We ensure that

the adaptive controller eventually meets the given control objective, whenever

possible.

3. A reactive synthesis method for systems with partial observation, with two differ-

ent measures of success (sure and almost-sure) for control design. We show that

the control design can be made more permissive by allowing the specification is

satisfied almost surely (with probability 1).

4. Decision procedures for stable equilibrium behaviors during the interactions of

multiple systems, where each has independent control objectives. Concepts from

algorithmic game theory are brought to bear and adapted for the stability analysis

of this class of multi-agent systems. The results can be applied to the design of

decentralized multi-agent systems with control objectives, expressed in some class

of formal logic languages.

1.6 Thesis Outline

This dissertation is organized as follows. In Chapter 2 we provide a literature

review that offers background behind most of the theoretical results of this disserta-

tion, and we motivate the approach followed here. Chapter 3 presents a method for

abstraction and optimal planning applicable to a class of hybrid systems. We introduce

a new discrete model — the register automaton — as an abstraction model and provide

a bottom-up, hierarchical, optimal planning method with respect to given reachability

specifications. Adaptive symbolic control design methods for systems operating in an

unknown, dynamic environment are presented in Chapter 4, for the case of complete

information (every state variables and environment action is observable). Experimen-

tal results that show the convergence of adaptive controller are presented. For systems

11

with sensing uncertainties, Chapter 5 treats the control synthesis problem with respect

to different measures of success. We show that for a temporal logic specification such

as liveness, the controller must have finite memory and may need to be randomized.

Chapter 6 focuses on game-theoretic modeling and analysis for multi-agent concurrent

systems, in which each agent has its own task specification and preference over the

outcomes. We present decision procedures for equilibria and security strategies, and

discuss the possibility of utilizing the developed methods in decentralized control de-

sign for multi-agent systems with respect to temporal logic specifications. Chapter 7

concludes the dissertation and focuses on possible directions for future work.

12

Chapter 2

LITERATURE REVIEW

2.1 Hybrid Systems

As embedded computing becoming more pervasive, many engineering systems

are designed to incorporate both discrete and continuous dynamics, giving rise to

dynamical systems that are known as hybrid.

A hybrid automaton [50] is a mathematical model for a hybrid system, which

facilitates analysis and control design by combining concepts from formal language the-

ory and dynamical systems into a single theoretical framework. A hybrid automaton

includes a set of discrete system states, or modes, and transition relations between

them. In each discrete mode, the continuous state of the hybrid system is governed by

flow conditions specified by differential equations. Transitions in a hybrid automaton

can be of two types: one is the transition which continuous states undergo, determined

by the dynamics in a given control mode; and the other is the transition between dis-

crete modes, determined by some logical condition called the guard of that transition.

Hybrid automata can capture many characteristics of systems in practice. There are

even stochastic hybrid automata [54] which incorporates randomness into both the

discrete transitions and the continuous dynamics of discrete modes, and are capable

of expressing some of the inherent uncertainty in the system and its environment in

many real world applications.

Analysis and control of hybrid automata has been utilized in several safety

critical applications [69]. A safety property may require the state trajectories to reach

a specified set while avoiding unsafe regions. In robotic systems, which can make a

nontrivial class of hybrid systems, liveness properties are often important. For instance,

13

in a surveillance task, a robot or a team of robots may have to ensure that some critical

regions are visited indefinitely often. These types of system specifications, in general,

cannot be translated into traditional control objectives. For this purpose, temporal

logic [71] is employed to specify these high-level requirements [2, 26]. The downside

is that the introduction of such high-level requirements renders analysis and control

design problems for hybrid systems even more difficult.

To tackle these challenges, a multi-disciplinary approach to hybrid systems has

emerged, that borrows methods from computer science, control engineering and applied

mathematics. It fosters a large and growing body of work on hierarchical control design

and formal methods for hybrid systems.

2.2 Abstraction

One critical step in formal methods is the procedure of abstraction. Abstraction

can map a hybrid system into a purely discrete model of computation, which approx-

imate the dynamics of the original system. The purpose of abstraction is to lift the

verification and control synthesis problems from the original continuous/discrete space

to a discrete space, where some of these problems can be solved with smaller analytical

and computational cost.

To solve in discrete and transfer to hybrid, a formal relation between the ab-

stract system and its concrete hybrid system has to be established. To this end, a

variety of abstraction methods have been developed, based on simulation, bisimulation

relations [17], or approximate bisimulation relations [42, 92]. Depending on how a set

of continuous states is grouped into a single abstract state, an abstraction method can

be predicate-based or discretization-based, or a mix of these two. In predicate-based

methods [3], a set of predicates is selected to partition the state space, and the size

of abstract systems is exponential in the number of predicates. In discretization-based

methods such as [60, 61, 82, 94, 97], the size of the abstract model depends on the

resolution of discretization at choice.

14

For abstractions to be useful, we need to omit some detail of concrete dynamics

of the systems. On the other hand, we also need to keep enough information so that

the important part of the behavior of the hybrid system can be correctly captured in its

abstract model. To strike a balance, CEGAR for program design has been introduced

for predicate abstraction of hybrid systems [4, 27]. The method starts with a coarse

abstract model of the hybrid system and refines it iteratively until a specified safety

property is verified to be satisfiable in the abstract model. Hence, the abstract system

retains just enough information to exhibit the satisfiability of the safety property. In

this way, CEGAR-based methods resolve the state-explosion problem to some extend.

However, there are two limitations in this method, one is that the abstract model

obtained for one specification may need to be further refined in order to exhibit the

satisfiability of another; another limitation is that this approach has only been applied

to safety properties so far, and the extension to abstraction with respect to liveness is

not clear.

Once the appropriate abstraction methods are obtained, verification and syn-

thesis for a hybrid system can proceed using its abstract model, with tools and methods

from model checking [28], theorem proving and algorithmic game theory [45].

2.3 Formal Synthesis

Due to the complexity and increased scale of most engineering systems, re-

searchers are interested in applying computational, algorithmic approaches to analysis

and control design of hybrid systems. For this purpose, model checking and reactive

synthesis have been introduced.

Formal synthesis of control protocols can be performed both in a top-down and

in a bottom-up fashion, or even some variant of both. Top-down approaches [34] take

the specification as an input and directly construct a hybrid automaton that satisfies

the specification. Bottom-up approaches [59], alternatively, start with the abstraction

of a given system, and attempts to investigate how to satisfy a given specification in

the system at hand. For stochastic systems, which can be abstracted or modeled as

15

Markov decision process (mdp), with temporal logic constraints, optimal control design

has been developed in a bottom-up approach [32]. An integration of both top-down

and bottom-up approaches is introduced [62] for solving temporal logic robot motion

planning problems. In this mixed approach, the control protocol is designed at the

abstract level, and then each discrete output indicated by the discrete controller is

refined into a sequence of implementable low-level controllers in the hybrid system.

The environment dynamics can be captured by introducing uncontrollable or

probabilistic transitions into the system’s dynamics. Alternatively, reactive systems

captures the interaction between a system and its environment as a two-player, zero-

sum game, in which the environment is assumed to satisfy a logical formula. Reactive

synthesis provides us both a decision procedure and a control design algorithm, which,

given a system, a desired property and an assumption on the dynamics of its envi-

ronment, will verify whether there exists a realizable controller that ensures that the

property holds in the system. The application of reactive synthesis in hybrid systems

is frequently encountered in robot motion planning. Formulas from General Reactivity

(1) (GR(1)) formulas [80], which is a fragment of Linear Temporal Logic (ltl), have

been used in [63] to express the dynamics of the system, the environment behavior,

and the task specification. The satisfiability of a GR(1) formula, once verified, leads

to an automaton that implements a discrete control protocol. This discrete solution is

then refined on the concrete hybrid system and ensures the completion of the task. To

alleviate the computational complexity of ltl synthesis, one can employ receding hori-

zon control [106]. Meanwhile, to improve the efficiency of synthesis, a compositional

method can be brought to bear [105]: given a set of specifications capturing various

safety, liveness constraints in a system, the controllers are synthesized separately for

individual specifications and then combined in their parallel executions. The synthesis

method is also developed for unfeasible tasks in [99]. An algorithm is developed that

synthesizes a plan that is allowed to violate only lowest priority rules and for a shortest

amount of time.

16

Correctness of the controller in reactive synthesis can be guaranteed if the as-

sumptions on the environment, i.e., the environment formula, is satisfied during the

interaction between system and environment. For partially unknown environment, a

multi-layer synergistic framework that implements an iterative planning strategy has

been proposed [70]. When the original specification becomes unrealizable due to new,

just-discovered environment constraints, a new plan is synthesized so as to satisfy the

specification as close as possible, according to a pre-defined metric of proximity to

satisfaction for the specifications. Although discovered constraints are due to the en-

vironment dynamics, however the dynamics itself is not identified. In order to identify

the dynamics of the environment, learning should be introduced. So far, the use of

machine learning in temporal logic control in the presence of unknown environments

has been limited to some application of reinforcement learning [95]. In that work,

a gradient-based approach to relational reinforcement-learning of control policies is

developed, for temporal logic planning in a stochastic environment.

2.4 Perspectives

When the environment is not stochastic, but rather dynamic and adversarial,

we need to develop a way of knowledge acquisition, so as to construct robust, adaptive

systems. Yet, except for some limited application of reinforcement learning, existing

work has not explored full potential of different machine learning paradigms.

A key observation is that once abstracted, the behavior of both system and

environment can be viewed as a formal object (automaton, language, grammar, etc.),

and the identification of the environment model, and subsequently all possible inter-

actions with it, becomes essentially a process of inference—to generalize any formal

object that can describe this model based on the finite amount of observed behaviors.

With this insight, we introduce formal learning as the inference method of choice. In

formal learning theory we find several criteria and algorithms for learning formal ob-

jects such as languages from data presentations. Unlike reinforcement learning, formal

learning is decoupled from control design: it is essentially system identification. Once

17

the dynamics of the system is identified, any appropriate control design method of

choice can implement a control strategy based on the identified model. This structure

is reminiscent of the synergy between adaptation and control in continuous dynamical

systems. Essentially, learning from observations enables us to reduce the problem of

control design with unknown environment into a problem well-examined in reactive

synthesis.

To this end, our goal is to develop a coherent framework that integrates formal

learning and reactive synthesis. Further, we aim to extend the solution we obtained

under the assumption of perfect observation to the case of partial observation, and from

single-agent and environment interaction to multi-agent interactions, in the context of

formal analysis and control design.

18

Chapter 3

BOTTOM-UP SYMBOLIC PLANNING

3.1 Overview

Many engineering systems can switch between different operating modes, and

often the user does not have access to implement or change pre-existing low-level control

behaviors in these systems. For example, industrial manipulators are equipped with

built-in PID joint controllers, which the user can set the PID gains to but not able to

modify. On the other hand, manufactors do not allow the access to built-in controllers

to avoid the unsafe operation as well as liability issues.

For planning and control design purpose, we model these systems into a class of

hybrid dynamical systems. A hybrid system in this class is capable of switching between

given operating modes, each with well defined pre-conditions and post-conditions. Pre-

conditions determine when a certain mode can be activated, and post-conditions de-

scribe the guaranteed steady-state behavior. The control modes are parameterizable in

the sense that the set of states satisfying the pre- or post-conditions of a control mode

is determined by parameters. Besides some industrial robotic systems, we can find

many other dynamical systems fall into this class such as those found in applications

of legged locomotion [89] and devices that interact with human subjects [11,86]

Given the special structure in this type of systems, planning problems can be

solved by sequencing and parameterizing the system’s existing low-level controllers in-

stead of designing a new controller from scratch. Also from the system design prospec-

tive, it is still meaningful to build simple controllers and reuse them for complex control

objectives. However, for this class of hybrid systems there is no decision procedure yet

on how to generate a feasible, optimal sequence of low-level controllers with respect to

19

a given objective. For this purpose, we consider a hierarchical framework, in which a

plan can be synthesized using the abstract model of the hybrid system and then be

implemented with the existing control modes in the concrete system.

Existing models for discrete abstraction, such as finite-state transition systems,

are incapable of manipulating continuous information, which may come in the form

of parameters. These parameters are exactly the type of information to keep track of

since they determine the sequencing of transitions in the discrete abstraction.

Hence, we adopt and adapt a new computational model, called register automa-

ton [57, 77], as the model for the abstract systems. The size of the abstract model

obtained through our method is not dependent on the discretization resolution on the

continuous state space. Instead, continuous states are grouped together according to

the convergence properties of each individual low-level controller.

The proposed abstraction links the concrete system and its abstract model

through a weak simulation relation, which ensures the plan generated using the ab-

stract model is feasible in the original hybrid system. We propose a planning method

with the abstract model that utilizes graph search algorithms and a variant of dynam-

ical programming. In principle, suboptimal solutions to a planning problem can be

obtained using its abstraction. One can even obtain optimal ones if the underlying

continuous dynamics and cost functions are simple enough [38].

In Section 3.2, we introduce the models used for the abstractions and their se-

mantics. Section 3.3 introduces a model for the class of hybrid systems. Section 3.4

presents the new abstraction method and establish a (weak) simulation between the

hybrid system and its abstract model. Section 3.5 outlines a new planning algorithm

using this abstract model. Section 6.5 presents a numerical case study, with a mobile

manipulator tasked with grasping and delivering an object through a sequence of ma-

neuvers implemented through given control laws. Section 3.7 concludes this chapter

with a summary of the results and thoughts for future extensions.

20

3.2 Automata and Transition Systems

3.2.1 Automata and their Semantics

In this section we review some background material about automata and formal

language theory [52].

Let Σ denote a fixed, finite alphabet, and let Σ∗, contain finite strings over this

alphabet. The empty string is denoted λ, and for a string w, its length is denoted |w|.A string v is a prefix (suffix ) of another string w if there exists a sequence x ∈ Σ∗ such

that w = vx (respectively, w = xv).

A semiautomaton is a tuple A = 〈Q,Σ, T 〉, where Q is a set of finite states, Σ

is a finite alphabet, and T : Q× Σ→ Q is the transition function. The mapping from

(q1, σ) to q2 via T is also written as q1σ→ q2, and can be expanded recursively in the

usual way, i.e., given u = w1w2, and w1, w2 ∈ Σ∗, we have T (q1, u) = T (T (q1, w1), w2).

In this context, semiautomata are assumed deterministic in transitions. If T (q, σ) is

defined for a given (q, σ) ∈ Q× Σ, we write T (q, σ) ↓.We think of a deterministic finite state automaton (dfa) as a quintuple A =

〈Q,Σ, T, I, F 〉 where 〈Q,Σ, T 〉 is a semiautomaton deterministic in transitions, I is the

initial state, and F is the set of final states. A word w is accepted by A if T (I, w) ∈ F .

The language L(A) is the set of words accepted by A.

3.2.2 Transition Systems

Definition 1. [28] A labeled transition system is a tuple TS = 〈Q,Σ, T 〉 with com-

ponents

Q a set of states;

Σ a set of labels;

T ⊆ Q× Σ×Q a transition relation.

The transition (q1, σ, q2) ∈ T is commonly denoted q1σ→ q2.

21

A transition system differs from a semiautomaton because in a transition system

the set of states and the set of transitions may not be finite, or even countable.

3.2.3 Register Automata and its Semantics

Given a finite alphabet Σ, and a subset D ⊆ Rk, pairs of the form wi = (σi, di) ∈Σ×D are called atoms. Concatenations of atoms form finite sequences w = w1 · · ·wnover Σ×D called data words. Let dom(w) be the index set 1, . . . ,|w| of the positions

of the atoms wi = (σi, di) in w. For i ∈ dom(w), the data projection valw(i) = di

gives the data value associated with the symbol σi. Similarly, the string projection

strw(i) = σi gives the symbol associated with a data value in atom wi. The following

computational machine 1 operates on a data word w ∈ (Σ×D)∗.

Definition 2 (Register Automaton cf. [36, 57]). A nondeterministic two-way register

automaton is a tuple R = 〈Q, q0, F,Σ, D, k, τ,∆〉, in which

Q a finite set of states;

q0 ∈ Q the initial state;

F ⊆ Q the set of final states;

Σ a finite alphabet;

D a set of data values;

k ∈ N the number of registers;

τ : 1, . . . , k → D ∪ ∅ the register assignment;a

∆ a finite set of read,b or writec transitions.

1 In [57] the set of logical tests that the machine can perform does not explicitly appearin the definition, because they are assumed to be only equality tests. In [36], however,the definition of register automata explicitly includes a set of register tests in the formof logical propositions.

22

a When τ(i) = ∅, this means that register i is empty. The initial register assignment

is denoted τ0. Given (σ, d) ∈ Σ×D, a register can perform a test in the form of a

first-order logical formula ϕ, constructed using the grammar ϕ ::= d ≤ τ(i) | d <τ(i) | ¬ϕ | ϕ ∧ ϕ. The set of all such formulae is denoted Test(τ).

b Read transitions are of the form (i, q, ϕr)σ→ (q′, δ), where q and q′ belong in Q, i

ranges in 1, . . . , k, σ in Σ, ϕr in Test(τ), and δ ∈ right,stay,left, respectively.

c Write transitions are of the form (q, ϕw)σ→ (q′, i, δ), where ϕw ∈ Test(τ), q and q′

are in Q, and all other elements range in the same sets as in read transitions.

Given a data word w, a configuration γ of R is a tuple [j, q, τ ], where j is a position

in the input data word, q is a state, and τ the current register content. Configurations

γ = [1, q0, τ0] and γ = [j, q, τ ] with q ∈ F , are initial and final, respectively. Given

γ = [j, q, τ ] and input (σj, dj), the transition (i, p, ϕr)σj→ (p′, δ) applies to γ if, and only

if, p = q and ϕr is true, while (p, ϕw)σj→ (p′, i, δ) applies to γ if, and only if, p = q and

ϕw is true.

The semantics of this machine is as follows. At configuration [j, q, τ ], the ma-

chine is in state q, the input read head is at position j in the data word, and the

contents of the registers are expressed by a vector τ . Upon reading wj = (σj, dj), if

ϕr is true and (i, q, ϕr)σj→ (q′, δ) ∈ ∆, then R enters state q′ and the read head moves

in the direction of δ, i.e., j′ = j + 1, j′ = j, j′ = j − 1 for δ ∈ right,stay,left. The

configuration is now [j′, q′, τ ]. If ϕw is true and (q, ϕw)σj→ (q′, i, δ) ∈ ∆, then R enters

state q′, dj is copied to register i, and the read head moves in the direction δ (in this

order). The configuration is now [j′, q′, τ ′], where the updated register assignment τ ′

is such that for κ = 1, . . . , i − 1, i + 1, . . . , k, it is τ ′(κ) = τ(κ) and τ ′(i) = dj. The

automaton is deterministic if at each configuration for a given data atom there is at

most one transition that applies. If there are no left-transitions it is called one-way.

In which follows, a simple example of register automaton is provided to illustrate the

definition.

23

Example 1. Consider a language over Σ×D of the following property: the data value

in an atom that immediately follows an atom containing symbol a, has to be the same as

the data value in the atom with the symbol a. This language is recognized by a one-way,

2-register automatonR2 = 〈Q, q0, F,Σ, 2, τ,∆〉 = 〈q0, q1, q2, q0, q0, q1, a, b, c, 2, τ :

2 → R2 ∪ ∅,∆〉 where τ0(1) = τ0(2) = ∅, ϕr ⇔ d = τ(i), ϕw ⇔ d 6= τ(i) , and ∆:

(q0, ϕw)b,c→ (q0, 1, right), (q0, ϕw)

a→ (q1, 2, right), (q1, ϕw)a,b,c−−→ (q2, 2, right), (2, q1, ϕr)

b,c→(q0, right), (2, q1, ϕr)

a→ (q1, right), (2, q2, ϕr)a,b,c−−→ (q2, right) , (1, q2, ϕr)

a,b,c−−→ (q2, right),

(q2, ϕw)a,b,c−−→ (q2, 2, right).

q0 q1 q2a

b, c

b, c

a

a, b, c

a, b, c

Figure 3.1: An example of a 2-register automaton.

Intuitively, on receiving initially an atom with symbol a, the machine stores the

data value in τ(2) and enters q1 which indicates “I have just seen an a.” If the data

value of the next atom doesn’t equal τ(2), then machine will enter q2 and stay there

forever, otherwise, depending on the symbol in the atom and the machine returns to q0

(if b or c) or stays in q1 (if a). The values associated with b or c are stored τ(1).

3.3 Hybrid Agents

This section we present the mathematical model for a class of hybrid systems,

referred to as a hybrid agent. Then we state the planning problem.

3.3.1 Mathematical Model

Our special class of hybrid systems is a hybrid agent :

Definition 3 (Hybrid Agent). The hybrid agent is a tuple

H = 〈Z,Σ, ι,P , πi,AP , f,Pre,Post, s,∆H〉

24

where the components are defined as follows.

Z = X × L a set of continuous and Boolean states;a

Σ a finite set of control modes;b

ι : Σ→ 1, . . . , k indices for the elements of Σ;

P ⊆ Rm a vector of control parameters;

πi : Rm → Rmi a set of canonical projections;c

AP a set of atomic propositions over Z × P; d

fσ : X × L× P → TX a finite set of parameterized vector fields;e

Pre: Σ→ C the pre-condition of σ ∈ Σ and C is defined next;f

Post: Σ→ C the post-condition of σ ∈ Σ;g

s :Z×P→ 2P is the parameter reset map;h

∆H :Z×P×Σ→Z×P×Σ is the transition map.i

a Here, X ⊂ Rn is a compact set, and L ⊆ 0,1r, with n, r ∈ N. A state z ∈ Z is

called a composite state.

b The symbols in Σ label the different closed-loop continuous dynamics.

c For i = 1, . . . , k, we write p = (π1(p)ᵀ, . . . , πk(p)ᵀ)ᵀ.

d AP is a set of atomic propositions, denoted α. A literal β is defined to be either

α or ¬α, for some α ∈ AP. Set C is a set of logical sentences, each of which is a

conjunction of literals, i.e., C = c = β1∧β2 . . .∧βn | (∃α ∈ AP)[βi = α∨βi = ¬α]and for any c ∈ C, a proposition in AP appears at most once [84].

e For each σ ∈ Σ, fσ is parametrized by p ∈ P and ` ∈ L. The set X is positively

invariant [58] under fσ. Due to the compactness and invariance of X , each fσ has

a compact, attractive limit set parametrized by p ∈ P, denoted L+(p, σ) [58].

f Pre(σ) maps mode σ to a logical sentence over Z × P that needs to be satisfied

whenever the machine switches to mode σ from any other mode.

g Post(σ) maps mode σ to a logical sentence over Z × P that is satisfied when the

trajectories of fσ reach an ε-neighborhood of its limit set.

25

h The reset map assigns (z, p) ∈ Z × P to a subset of P which contains parameter

values p′ for which there is a mode σ, with pre-condition Pre(σ) satisfied by (z, p′).

i The transition map sends (z, p, σ) to (z, p′, σ′) if (z, p) satisfies Post(σ) and (z, p′)

satisfies Pre(σ′) with p′ ∈ s(z, p).

The configuration of H is a tuple [z, p, σ]. A transition from σi to σi+1 (if any)

is forced and can occur once the trajectory of fσi (z, p) hits an ε-neighborhood of its

limit set.2

This model describes a continuous dynamical system that switches between

different control laws based on some discrete logic. The discrete logic is a formal

system consisting of the atomic propositions in AP together with the logical connec-

tives ¬ and ∧. The semantics of the set of logical sentences C generated, expresses

the convergence guarantees available for each component vector field fσ in the form

Pre(σ) =⇒ Post(σ). (Formally, it is Pre(σ) ∧ ♦Post(σ), where ♦ is the temporal

logic symbol for eventually ; however time here is abstracted away.) The switching

conditions, on the other hand, depend not only on the continuous variables, but also

on the discrete control modes: a transition may, or may not be triggered, depending

on which mode the hybrid system is in. Control over H is exercised by selecting a

particular sequence of parametrized control modes. Resetting in the parameters of

the system, activates transitions to specific modes, which in turn steer the continuous

variables toward predetermined limit sets.

Compared to the definition for a hybrid system given in [69], H is special because

it does not involve jumps in the continuous states, its discrete transitions are forced,

and the continuous vector fields converge. The model for H, however, also allows the

system evolution to be influenced by, possibly externally set, continuous and discrete

variables (p and `). In addition, initial and final states are not explicitly marked,

2 Written L+(p, σ)⊕Bε(0), where ⊕ denotes the Minkovski (set) sum and Bε(x) is theopen ball of radius ε centering at x.

26

allowing the machine to accept a family of input languages instead of a single one as

that of [69].

Let us now describe more formally the limit sets of the continuous dynamics fσ,

and highlight their link to the Pre and Post conditions of each mode. To this end,

let φσ(t;x0, `, p) denote the flow of vector field fσ(x; `, p) passing from x0 at time t = 0.

The positive limit set in control mode σ, when parametrized by p is expressed as

L+(p, σ) =y | ∃tn : lim

n→∞tn =∞,

φσ(tn;x0, `, p)→ y as n→∞, ∀x0 ∈ Ω(p, σ),

where Ω(p, σ) ⊆ X is the attraction region of control mode σ parametrized by p. We

assume that L+(p, σ), for a given σ and for all p ∈ P , is path connected.3 If it is not,

and there are isolated components L+i (p, σ) for i = 1, . . . , B(σ), one can refine a control

mode σ into σ1, . . .σB(σ), one for each L+i (p, σ). For simplicity, we assume that for H

of Definition 3, Σ does not afford any further refinement. For each discrete location

σ, the formulae Pre(σ) and Post(σ) are related to the limit sets and their attraction

regions in that location as follows:4

(z, p) ≡ (x, `, p) |= Post(σ) ⇐⇒ (x, `, p) ∈

(x, `, p) | x ∈ L+(p, σ)⊕ Bε(0), ` ∈ L

(z, p) ≡ (x, `, p) |= Pre(σ) ⇐⇒ (x, `, p) ∈

(x, `, p) | x ∈ Ω(p, σ), ` ∈ L.

A state z, which together with some parameter p satisfy Pre(σ), can evolve

along φσ(t; z, p) to some other composite state z′ for which (z′, p) satisfies Post(σ),

and we write zσ[p]→ z′. A sequence of the form (σ1, p1) · · · (σN , pN) is an input to H,

specifying how control modes are to be concatenated and parametrized in H. The

input sequence is a data word. We say that a data atom (σ1, p1) is admissible at the

initial setting (z0, p0) of H if p1 = p0 and (z0, p1) satisfies Pre(σ1), or if p1 ∈ s(z0, p0)

3 A set is path connected if any two points in the set can be connected with a path (acontinuous map from the unit interval to the set) [88].

4 Symbol |= is read “satisfies,” and we write (z, p) |= c if the valuation of logicalsentence c ∈ C over variables (z, p) is true.

27

and (z0, p1) satisfies Pre(σ1). A data atom (σ′, p′) is admissible in H at configuration

[z, p, σ], if there is a [z, p′, σ′] ∈ Z × P × Σ such that T ([z, p, σ]) = [z, p′, σ′]. A pair

of data atoms (σj, pj)(σj+1, pj+1) is admissible at configuration [z, p, σ] if (σj, pj) is

admissible at [z, p, σ], and there is a composite state z′ ∈ Z to which z evolves to

under σj parameterized by pj (i.e. zσj [pj ]→ z′), giving a configuration [z′, pj, σj] where

the second input atom (σj+1, pj+1) is also admissible. A data word w is admissible in

H if every prefix of w is admissible.

3.3.2 Problem Statement

The planning problem addressed in this chapter is a reachability problem: for a

given Spec ∈ C, the goal is to design a control policy that drives the system from its

initial configuration to a configuration where Spec is satisfied.

Problem 1. Given a hybrid agent H at an initial configuration satisfying a formula

Init ∈ C, find an admissible sequence (σ1, p1) · · · (σN , pN) so that the configuration of

H after N transitions, for some N ∈ N, satisfies Spec ∈ C.

3.4 Abstraction

In this section we employ predicate abstraction to induce a discrete, finite-state

model of the concrete hybrid agent. Then we show that the hybrid agent weakly

simulates its abstract model.

3.4.1 Predicate Abstraction and the Induced Register Automata

Each hybrid agent H can be associated to a special one-register automaton.

Since we do not mark initial and final states in H, the discrete system is a semiau-

tomaton. We say that this one-register semiautomaton is induced by H. The relation

between the state-parameter pairs of H, and the states of the register semiautomaton

is expressed by a map.

Definition 4 (Valuation map). The valuation map VM : Z × P → Q ⊆ 1,0|AP| is

a function that maps a state-parameter pair (z, p), to a binary vector q ∈ Q of length

28

|AP|. The entry at position i in q, denoted q[i], is 1 or 0 depending on whether αi

in AP evaluated at (z, p) is true or false, respectively. For q ∈ Q, we denote this

valuation αi(z, p) = q[i],

With reference to H and VM(·), a set valued map λ : P ×Q×Q× Σ → 2P is

defined as

λ(τ ; q, q′, σ′) 7→p′ | (∀z : VM(z, τ) = q)[

p′ ∈ s(z, τ) ∧ (z, p′) |= Pre(σ′) ∧ VM(z, p′) = q′]. (3.1)

Note that λ may not be defined for every q, σ and q′.

The register semiautomaton R(H) which serves as an abstraction of H is now

defined as follows.5

Definition 5 (Induced register semiautomaton). The deterministic finite one-way reg-

ister semiautomaton induced by hybrid agent H (with reference to Definition 3), is a

tuple R (H) = 〈Q,Σ,P , 1, τ,∆R〉, with

Q a finite set of states;a

Σ the alphabet (same as that of H);

P the data set (same as that of H);

1 an m-dimensional array register;

τ :1 7→P∪∅ the register assignment; b

∆R a finite set of read, c and write d transitions.

a The set of states is defined as

Q =q ∈ 0,1|AP| : ∃ (z, p) ∈ Z × P : VM(z, p) = q

.

5 This machine has only one register, so to lighten notation we drop the argumentfrom the current assignment of the register.

29

b Given input data atom (σ, p) ∈ Σ×P, the set Test(τ) consists of formulae defined

by the grammar ϕ ::= p = τ | πj(p) = πj(τ) | p ∈ λ(τ ; q, q′, σ) | ¬ϕ | ϕ ∧ ϕ, where

q, q′ ∈ Q and j ∈ 1, . . . , |Σ|.c A read transition (q, ϕr)

σj→ (q′, right) where ϕr is τ = pj, is defined if for all z such

that VM(z, τ) = q, the pair (z, τ) satisfies Pre(σj) and there exists a continuous

evolution zσj [pj ]→ z′ such that VM(z′, pj) = q′.

d A write transition (q, ϕw)σj→(q′, stay

)where ϕw is p ∈ λ(τ ; q, q′, σj) ∧¬

[πι(σj)(pj) =

πι(σj)(τ)], is defined if there exists a parameter p in P such as the set

p′ ∈ P |

p′ ∈ λ(p; q, q′, σj) ∧ πι(σj)(p′) 6= πι(σj)(p)

is not empty.

With the machine at configuration [j, q, τ ], and upon receiving input wj = (σj, pj), if

pj = τ , the read transition (q, ϕr)σj→ (q′, right) applies as long as it is in ∆R. In this

case, the machine moves to state q′, and the input read head advances one position.

If, on the other hand, πι(σj)(pj) 6= πι(σj)(τ), while data value pj belongs to the set

λ(τ ; q, q′, σj) for some q′ ∈ Q, then the write transition (q, ϕw)σj→ (q′, stay) applies as

long as it is in ∆R. Then the machine reaches q′ without moving the input read head,

and overwrites the content of its register with pj.

A data atom (σ, p) is admissible at configuration [j, q, τ ] if there is transition

in ∆R that applies to [j, q, τ ] when (σ, p) appears at the input. A pair of data atoms

(σ1, p1)(σ2, p2) is admissible if there is a transition in ∆R that applies to some configu-

ration [j, q, τ ] on input (σ1, p1), taking R(H) to configuration [j+ 1, q′, τ ′], where some

other transition in ∆R applies on input (σ2, p2). A data word w is admissible if every

prefix of w is admissible. Compared to the register automaton of Definition 2, the

construction of Definition 5 differs. First, there are no initial and final states—it is a

semiautomaton—and second, there is a single register that stores an array rather than

a single variable. Register tests, though, are performed element-wise on the register.

For the special class of hybrid systems considered here, the registers of such a

30

model turn out to be adequate for capturing the continuous behavior, up to the reso-

lution allowed by the given set of atomic propositions. The only change we introduce

to the standard register automaton model is the capacity to perform inequality tests

on the data; however, given that the most basic logical operation is set inclusion, and

in order for a machine to be able to do any equality test, it has to do so by means of

a conjunction of inequalities. Thus, the extension we propose does not fundamentally

require any additional computational power on the part of the machine.

The write transitions in R(H) are silent, in the sense that they do not advance

the read head of the machine and thus do not produce any observable change. Read

transitions, on the other hand, are observable. A concatenation of any number of silent

transitions with a single observable transition triggered by input atom wj, taking the

machine from state q to state q′ is denoted qwj q′, and we refer to this transition

sequence as a composite transition. Since only one observable transition is taken in a

composite transition, the read head advances only one step. A composite transition is

maximal if the machine cannot make another transition without reading a new data

atom.

Proposition 1. Let w = w1 · · ·wn be an admissible input sequence for R(H). Then

any maximal composite transition from state q to state q′ contains either a single read

transition, or a write transition followed by a read transition.

Proof. Let R(H) be at configuration [j, q, τ ]. Suppose for the data atom wj = (σj, pj),

R(H) takes a composite transition, qwj q′. If pj = τ then the machine jumps from

q to q′ and advances the read head by one position. In this case, the configuration

changes from [j, q, τ ] to [j + 1, q′, τ ]. If pj 6= τ , no read transition applies, which

means that a write transition must have taken place. Once this write transition is

completed, τ has the value of pj. The machine still reads wj = (σj, pj) on the input

tape, since the read head has not advanced. But now, upon reading wj again, the

machine finds τ = pj. A read transition is triggered and the read head advances one

position forward. Configuration [j, q, τ ] changes first to some intermediate [j, qt, τ ′]

31

after the write transition, and then to the final [j + 1, q′, τ ′] after the read transition.

In any case, a composite transition either includes a single read transition or a write

transition followed by a read transition—the latter referred to as a write-read transition

pair.

3.4.2 Weak (Bi)simulations

To ensure any plan generated by the abstract model is feasible in the concrete

hybrid agent, two systems have to be related formally.In our case, this relation is a

weak (bi)simulation.

In a transition system TS = 〈Q,Σ, T 〉, the set of alphabet can be partitioned

into two subsets: Σε ⊆ Σ and Σ \ Σε. We call a transition that is labeled with a label

from Σε, silent; otherwise observable. We write q ; q′ to denote that q′ is reachable

from q with an arbitrary number of silent transitions, and qσ; q′ if q′ is reachable from

q a composite transition containing one observable transition labeled σ.

Definition 6 (Weak (observable) simulation [90]). Consider two (labeled) transition

systems over the same input alphabet Σ: TS1 = 〈Q1,Σ, T1〉 and TS2 = 〈Q2,Σ, T2〉. Let

Σε ⊂ Σ be a set of labels for silent transitions. An ordered binary relation R ⊆ Q1×Q2

is a weak (observable) simulation if: (i) R is total, i.e., for any q1 ∈ Q1 there exists a

state q2 ∈ Q2 such that (q1, q2) ∈ R, and (ii) for every ordered pair (q1, q2) ∈ R, if there

exists a state q′1 ∈ Q1 which the machine can reach with a composite transition from

q1, i.e., q1σ;1 q

′1, then there also exists q′2 ∈ Q2 that can be reached with a composite

transition from q2, i.e., q2σ;2 q

′2, and (q′1, q

′2) ∈ R. Then TS2 weakly simulates TS1

and we write TS2 & TS1.

In other words, TS2 weakly simulates TS1 if any input admissible in TS1 is also

admissible in TS2. In that sense, a hybrid agent that weakly simulates its induced

register semiautomaton can implement every input sequence admissible in the register

semiautomaton. Indeed, we show it is the case:

32

Theorem 1. Hybrid agent H weakly simulates its induced register semiautomaton

R(H) in the sense that the ordered total binary relation R defined as (q, z) ∈ R ⇔∃ p ∈ P , VM(z, p) = q, satisfies

(q, z) ∈ R and qwj q′ with wj = (σj, pj) =⇒

∃ z′ ∈ Z : zσj [pj ]→ z′ with

(q′, z′

)∈ R . (3.2)

Proof. First note that relation R is total by construction, since any state q ∈ Q is by

definition the image under the valuation map VM of some (z, p) ∈ Z ×P . To establish

that R is a weak simulation, let the register semiautomaton R(H) be at configuration

[j, q, τ ], with a state q for which we can find a state z ∈ Z in H to related it with:

(q, z) ∈ R. Suppose now that R(H) takes a (composite) transition wj; then according

to Proposition 1, this composite transition consists of either a single read transition,

[j, q, τ ]wj→ [j + 1, q′, τ ], or a write-read pair: [j, q, τ ]

wj→[j, qt, τ ′

] wj→ [j + 1, q′, τ ′]. The

mere existence of a transition originating from q on input (σj, pj) ensures that for any

z that satisfies VM(z, τ) = q, it holds that either (z, τ) satisfies Pre(σj), with τ = pj (if

we have a single read transition), or that the ι(σj) components of the control parameter

and register do not match, meaning πι(σj)(τ) 6= πι(σj)(pj), and there is some qt ∈ Q,

such that VM(z, pj) = qt and pj ∈ λ(τ ; q, qt, σj) (if we have a write-read transition

pair). In the latter case, by the definition of λ, we know that (z, pj) must satisfy

Pre(σj). If wj triggers a single read transition (the case τ = pj), then there must exist

a continuous evolution in H in control mode σj parameterized by pj, taking z to z′

(namely, zσj [pj ]→ z′) at which VM(z′, pj) = q′; it follows that (q′, z′) ∈ R. If, instead, wj

triggers a write-read transition pair, then after updating its register by setting τ = pj,

R(H) still reads (σj, pj) as the input. Since (z, pj) satisfies Pre(σj) and now τ = pj,

R(H) has to take a read transition to reach q′. The argument of the previous case

applies and completes the proof.

Theorem 1 suggests that while all admissible input sequences in R(H) will be

also admissible in H, it is also the case that a control policy that takes H from its

33

present state into another that satisfies Spec, might not have a matching run in R(H).

This is not necessary for the purposes of planning, but it is essential for verification. To

ensure a matching run we need to strengthen the link between the two models. Theorem

2 gives sufficient conditions for a weak bisimulation to be established between H and

R(H).

Theorem 2. Given the hybrid agent H and its induced register semiautomaton R(H),

the binary relation R defined as (q, z) ∈ R ⇐⇒ ∃p ∈ P , VM(z, p) = q, is a weak

bisimulation relation under the following conditions:

1) given p ∈ P, for any two z1, z2 ∈ Z, if VM(z1, p) = VM(z2, p) = q, then

whenever (z1, p) satisfies Pre(σ) we have (z2, p) also satisfying Pre(σ). In addition,

the parametrized control mode σ[p] that takes z1 to z′1, takes z2 to some z′2 for which

VM(z′1, p) = VM(z′2, p).

2) given p ∈ P, for any two z1, z2 ∈ Z, for which VM(z1, p) = VM(z2, p) = q, if

p′ ∈ s(z1, p) and VM(z1, p′) = q′, then p′ ∈ s(z2, p) and VM(z2, p

′) = q′.

3) given z ∈ Z, and any p1, p2 ∈ P, if (z, p1) satisfies Pre(σ) and (z, p2) does

not satisfy Pre(σ), then the ι(σ) components of p1 and p2 do not match: πι(σ)(p1) 6=πι(σ)(p2).

Proof. Since we know from Theorem 1 that H weakly simulates R(H), we only need

to show the implication (3.2) in the opposite direction: if the conditions above are

satisfied, then given (q, z) ∈ R ⊆ Q × Z, zσ[p]→ z′ =⇒ ∃ q′ ∈ Q : q

(σ,p) q′ ∧ (q′, z′) ∈

R. To this end, select any po ∈ P such that q = VM(z, po), and examine the two

possibilities:

Case 1, where po = p

The evolution zσ[p]→ z′, implies that the pair (z, p) satisfies Pre(σ). Given

condition a), we know that any other z1 such that VM(z1, p) = q (and is therefore

related to q), will also make a pair (z1, p) that satisfies Pre(σ). This means that any

such z1 will evolve to some z′1 in mode σ parameterized by p. We can collect all these

34

limit points z′1 to a set Z ′(p) , z′ | z σ[p]→ z′, for some z : VM(z, p) = q. Condition a)

also ensures that any z′1, z′2 ∈ Z ′(p), VM(z′1, p) = VM(z′2, p) = q′ for some state q′ ∈ Q.

Based on Definition 5, there exists a read transition in R(H) taking q to q′ upon input

(σ, p). All z′ ∈ Z ′(p) give VM(z′, p) = q′ and thus (q′, z′) ∈ R.

Case 2, where po 6= p

Without loss of generality assume that (z, po) does not satisfy Pre(σ); other-

wise, we can have zσ[po]→ z′, which reduces this case to Case 1. Condition c) then

requires that πι(σ)(po) 6= πι(σ)(p). Since we are given that z

σ[p]→ z′, we can conclude that

(z, p) satisfies Pre(σ). The definition of the reset map then suggests that p ∈ s(z, po).Let VM(z, p) = qt. Condition b) ensures that for any state z1 that makes a pair (z1, p

o)

such that VM(z1, po) = VM(z, po) = q, it is p ∈ s(z1, p

o) and VM(z1, p) = VM(z, p) = qt.

Let the set of all such states z1 be Z1. Since (z, p) satisfies Pre(σ) and z ∈ Z1,

using condition a) we have that for all z1 ∈ Z1, the pair (z1, p) satisfies Pre(σ).

Recall that λ(po; q, qt, σ) = p ∈ P | (∀z : VM(z, po) = q)[p ∈ s(z, po) ∧ (z, p) |=Pre(σ) ∧ VM(z, p) = qt], and note that this set is nonempty since it always contains

p. Therefore, a write transition (q, ϕw)σ→ (qt, stay) applies on input atom (σ, p) with

formula ϕw expressed as p ∈ λ(po; q, qt, σ) ∧ πι(σ)(po) 6= πι(σ)(p). This write transition

takes q to qt and updates the the register with p. Now we have Case 1.

Supported by Theorem 1 and 2, we proceed with abstraction-based time-optimal

planning, knowing that a plan generated by the induced register automaton is always

implementable in the hybrid agent.

3.5 Time-optimal Planning

We propose a two-step procedure for abstraction-based time-optimal planning.

The first step is to determine a sequence of symbols in Σ, that is, a sequence of

controllers, such that with some parameterization of this sequence, the control objective

can be satisfied. The second step takes this sequence and determine the parameters for

each individual controller, such that the goal state is reached with the optimal cost.

35

Any transition in R(H) may incur a cost, but in this context we assume only ob-

servable transitions do so. The cost of an observable transition in R(H), corresponding

to a continuous evolution in H, is determined by the component continuous dynamics

active during that time period, the initial conditions for the continuous states, and the

assignment of parameters. The component dynamics when H is at control mode σ is

expressed as x = fσ (x, `, p), with σ ∈ Σ, p ∈ P , ` ∈ L, and x ∈ X . An incremental

cost function F : X × R+ → R+ is used to define the atomic cost gσ(x, `, p) for H

evolving in control mode σ along flow φσ(t;x, `, p) for t ∈ [t0, tf ]:

gσ(x, `, p)def=

∫ tf

t0

F(x(t)

)dt .

We define the incremental cost using the indicator function:

F (x(t))def= 1L+(p,σ)⊕Bε(0)c

(x(t)

)where 1A denotes the indicator function of set A, ⊕ the Minkovski (set) sum, Bε(x) is

the open ball of radius ε of appropriate dimension centered at x, and ·c denotes set

complement. Other choices are of course possible; however, this choice of R yields an

atomic cost gσ which measures the time it takes the flow of vector field fσ (x, `, p) to

hit an ε-neighborhood of L+(p, σ):

gσ(x, `, p) =

∫ ∞0

F(φσ(t;x, `, p)

)dt . (3.3)

In an admissible data word w = (σ1, p1) . . . (σN , pN), for any σi−1, σi appearing

consecutively in str(w), the data value pi−1 that comes along with σi−1 should match

with some state z ∈ Z, in a way that the pair (z, pi−1) satisfies Post(σi−1). In addition,

for that same z, (z, pi−1) either also satisfies Pre(σi), or its image under the reset map

s contains some other pi 6= pi−1, for which (z, pi) satisfies Pre(σi).

We can thus eliminate the dependence of gσ on z = (x, `) in (3.3) by conserva-

tively over-approximating the atomic cost for a transition σi ∈ str(w), using a function

of parameters:

gσi(pi−1, pi)

def= max

z:(z,pi−1)∈S

∫ ∞0

F(φσi(t; z, pi)

)dt (3.4)

36

where

Sdef=

z | (z, pi−1) |= Pre(σi) ∧Post(σi−1), pi = pi−1

z | (z, pi−1) |= Post(σi−1), pi ∈ s(z, pi−1), (z, pi) |= Pre(σi), otherwise.

The integral in (3.4) does not always have to be computed explicitly. This is because

the time required for a continuous state x ∈ X to converge under controller σ to an ε

neighborhood of L+(p, σ) can be over-approximated using Lyapunov-based techniques,

discussed briefly in Appendix A.

The accumulated cost Jw for executing data word w = (σ1, p1) . . . (σN , pN) from

configuration [z, p, σ], assuming that w is admissible at [z, p, σ], is upper bounded by

Jw(z, p) ≤ Jw(z, p)def= gσ1(z, p1) +

N∑i=2

gσi(valw(i − 1), valw(i)

).

The optimization problem can then be stated as follows:

Problem 2. With the hybrid agent H at an initial configuration [z0, p0, σ0], where

(z0, p0) satisfies Init ∈ C and σ0 ∈ Σ, find out of all admissible sequences w =

(σ1, p1) · · · (σN , pN) solving Problem 1, the one that achieves minpjNj=1Jw(z0, p0).

3.5.1 Searching for Candidate Plans

Due to the existence of registers, register automata do not have a standard

graphical representation, so it is not clear how reachability analysis can be performed

using graph search methods. In such a machine, a state may be reached either by

a read, or a write transition; however, the nature of the incoming transition matters

when it comes to reasoning as to what happens next. Configurations, on the other

hand, cannot be enumerated due to the inclusion of the continuous data in τ .

3.5.1.1 A Graph Representation

For the purpose of planning using graph search algorithms, we suggest an embed-

ding of R(H) into a labeled transition system, hereby referred to as the transformation

37

semiautomaton, which brings out some information about register updates and the

nature of transitions.

Definition 7. The transformation semiautomaton of R(H) is a tuple TR(H) = 〈Q, Σ, ∆R〉consisting of:

Q ⊆ Q × p, p′ a finite set of states;a

Σ = Σ ∪ Λ ∪ θ a set of transition labels;b

∆R a set of transitions of four types.c–f

a Q contains couples where the first element is a state of R(H) and the second element

is a symbol, either p or p′. Whenever a state in R(H) is reached with a write

transition, its corresponding state in TR(H) is marked with a p′.

b Subset Λ contains labels indexing all different possible write transitions in R(H),

each write transition assigned to a unique λ in Λ. The singleton θ contains an

auxiliary label marking trivial write transitions (write self-loops) in R(H) which do

not modify the register content.

c One type is (q, p)λi99K (q′, p′), defined if q′ is accessible from q in R(H) via a write

transition (q, ϕw)σ→ (q′, stay).

d Another type is (q, p)θ99K (q, p′), defined if q is accessible from any q′ ∈ Q via a

write transition.

e A third type is (q, p)σ→ (q′, p), defined if there exists a read transition (q, ϕr)

σ→(q′, right) and q is not accessible via a write transition from any other state in R(H).

f The last type is (q, p′)σ→ (q′, p), defined if there exists a read transition (q, ϕr)

σ→(q′, right) and q is accessible via at least one write transition from some state in

R(H).

We define the injective function Λ : Λ→ Q×Q that singles out the transition

of TR(H) that is labeled by the particular label in Λ. Consequently, λ(τ ; q, q′, σ) ≡λ(τ ; Λ(λ), σ).

It is straightforward to show that TR(H) and R(H) are weakly bisimilar; intu-

itively, one merges any pair of states of the form (q, p) and (q, p′) in TR(H).

38

Proposition 2. The transformation semiautomaton TR(H) and the induced register

semiautomaton R(H) are weakly bisimilar: there exists an ordered binary relation R

on Q × Q such that: (i) R is total, and (ii) whenever 6 (q, (q, ∗)) ∈ R there exists a

read or write transition from q to some q′ in R(H) for some σ ∈ Σ, then there exists

a composite transition in TR(H), (q, ∗) a (q′, ∗) with a ∈ Σ ∪ Λ and (q′, (q′, ∗)) ∈ R.

Conversely, if there is a transition in TR(H) taking (q, ∗) to (q′, ∗), then there exists

a composite transition in R(H) taking q to q′ while (q, (q, ∗)) ∈ R and (q′, (q′, ∗)) ∈ R.

Proof. Define R implicitly as a partition on Q in which (q, p) and (q, p′) belong in the

same block and the equivalence class is labeled by q. Note that TR(H) is constructed

in a way that guarantees R to be total. First take the case of a read transition

(q, ϕr)σ→ (q′, right) in R(H). By construction, TR(H) can either take (q, p)

σ→ (q′, p)

or (q, p′)σ→ (q′, p), and obviously (q′, (q′, p)) ∈ R. Any transition (q, ϕw)

σ→ (q′, stay)

in R(H) can be matched by the transition (q, p)λ99K (q′, p′) in TR(H), and since

(q′, (q′, p′)) ∈ R, it follows that TR(H) & R(H).

The other direction is shown as follows: consider any (q, ∗) a (q′, ∗), with

a ∈ Σ ∪ Λ. If a ∈ Σ, three possible cases arise: a) (q, p)a→ (q′, p), b) (q, p′)

a→ (q′, p),

and c) (q, p)θ→ (q, p′)

a→ (q′, p), for some θ ∈ Θ. In all three cases, the end state (q′, p)

is related to q′ via R and there is always a transition of the form (q, ϕr)a→ (q′, right) in

R(H) by construction of TR(H). If a ∈ Λ, then by construction there is a transition in

R(H): (q, ϕw)σ→ (q′, stay) and since q′ can be reached by a write transition, there exists

(q, p)a→ (q′, p′) ∈ ∆R. Since both (q, (q, p)) and (q′, (q′, p′)) belong in R, we conclude

that it is also the case that R(H) & TR(H).

3.5.1.2 Finding Walk Candidates

Let us define a set of ternary vectors q ∈ 0,1, ∗|AP|, where the semantics of

∗ at location i within a vector q, is that atomic proposition αi can be either true or

false we do not know. In that sense, a ternary vector q can be identified with a set of

6 ∗ stands for either p or p′.

39

binary vectors, and thus we may write q ∈ q. Recalling formula Spec in Problem 1,

we represent the set of all binary vectors q for which a pair (z, p) with VM(z, p) = q

satisfies Spec, by a single ternary vector qSpec. If qSpec[i] = ∗, this means that αi does

not appear in Spec.

Now we recast problem 2, which is given on H, as a problem defined on R(H)

and TR(H):

Problem 3. For a given Spec ∈ C and a pair (z0, p0) satisfying Init, and for any

qf ∈ qSpec, find a data word w = (σ1, p1) · · · (σN , pN) for which

1. there exists a walk w in TR(H) from (q0, p) to (qf , p) with q0 = VM(z0, p0) and

qf ∈ qSpec, such that its projection to Σ, denoted w Σ, satisfies w Σ= str(w),

and

2. Jw(z0, p0) is minimized with respect to pjNj=1, where pN = pf as specified in

Spec.

Condition 2) restates the optimality requirement of Problem 2. Theorem 1

ensures that w is a solution to Problem 1.

For the first part, because of the restrictions imposed by the set-valued map

λ, the run on the transformation automaton might have to revisit some state in Q a

few times in order to bring the parameter to the value specified in Spec. Thus, we

need to search for walks, 7 instead of simple paths. To find the walks we augment

TR(H) by adding the initial and desired final states based on Problem 3, and obtain

a dfa. Then we generate a regular expression (regex) of this dfa.8 From this regex

we can construct successively longer walks satisfying condition 1) of Problem 3, and

then optimize them using a modified version of dynamic programming discussed next.

7 A walk is a path which may include cycles.

8 A regex is defined recursively as follows [52]: (1) empty string ε and all σ ∈ Σ areregexs (2) If r and s are regexs, then rs (concatenation), (r + s) (union) and r∗, s∗

(Kleene-closure) are regexs; (3) There are no regex other than those constructed byapplying rules (1) and (2) above a finite number of times.

40

With cycles allowed there is no bound on the length of admissible strings in the dfa,

and thus we limit the number of walks that can be checked for optimality by setting

an upper bound on the cost, based on an assumed maximum affordable cost, and the

cost of the least expensive observable transition.

With initial state (q0, p) and final state (qf , p), we obtain the dfa

〈TR(H), (q0, p), (qf , p)〉 = 〈Q, Σ, ∆R, (q0, p), (qf , p)〉

and find an regex, denoted RE(H), associated with this dfa using known methods [18].

Replacing every occurrence of the Kleene star ∗ in RE(H) with a natural number,

gives a set W(m) of all admissible walks of length m in the dfa, W(m)def= w|w ∈

RE(H), |w| = m. Any walk in W(m) has a matching admissible input data word on

R(H) (Theorem 1). However, TR(H) has no information on specific register values,

and thus the corresponding admissible data word may not comply with the requirement

for p0 and pf . To remove inadmissible walks we develop a procedure for translating

a walk in TR(H) to a family w of data words in R(H), in which all individual words

w have the same symbol string str(w) but different the data value assignments. The

domains of possible data value assignments is specified by a sequence of set-valued

maps:

Given a walk w = u1 · · ·um, set i := 1, j := 1, and for 1 ≤ i ≤ m, distinguish

three cases:

1. ui ∈ Σ: then, set σj := ui, wj := (σj, pj), Mj(·) := idP (·), j := j + 1, i := i+ 1;

2. ui ∈ Λ: then, set σj := ui+1, Mj(·) := λ(· ; Λ(ui), σj), wj := (σj, pj), j := j + 1,

i := i+ 2;

3. otherwise, set σj := ui+1, Mj(·) := idP (·), wj := (σj, pj), j := j + 1, i := i+ 2.

In the above, idP : p 7→ p is the identity map on P . A walk w = u1 . . . um is thus

translated into a family of data words w = (σ1, p1) · · · (σN , pN), and a sequence of

set-valued maps Mi(·) : P → 2P , for i ∈ 1, . . . , N.

41

To check whether a walk w generates a data word w ∈ w that can match the pa-

rameter specifications, we use the sequence of set-valued maps Mi(·)Ni=1 constructed

by this procedure and verify the consistency condition

p∈P |∃z∈Z :VM(z, p) = qf ∩MN · · · M1(p0) 6= ∅. (3.5)

3.5.2 Dynamic Programming — a Variant

We start with the best case—the shortest possible walk found—and test locally

for optimality. If an optimal solution is encountered, the search stops.

Let the maximal allowable cost for any solution to Problem 3 be Jmax. Then,

if the minimum cost of executing an observable transition labeled with σ ∈ Σ is some

Jmin > 0, an upper bound U on the length of data words translated from walks is

Udef=⌈Jmax

Jmin

⌉.

If (3.5) holds, then there exists a sequence of N parameter values piNi=1 with

|w Σ | = N ≤ m such that w = (σ1, p1) · · · (σN , pN) is an admissible input for

R(H). Input w takes the register semiautomaton from configuration [1, q0, p0] to some

configuration[N + 1, qf , pf

]where qf ∈ qSpec. Among all walks which pass the test

(3.5), we pick the shortest one as the candidate most likely to yield the optimal solution

to Problem 3.

In the case a candidate walk is found, we modify the standard dynamic pro-

graming (DP) algorithm of [14], and apply it in its new form to obtain an optimal

sequence of parameters. We first obtain a set of subsets of P , denoted Pii∈dom(w),

where each Pi consists of all parameter values that can be used to parametrize a control

mode (data atom) at stage i of execution of an input data word in w, and is found as

Pi def= Mi . . . M1(p0) ∩

(MN . . . Mi+1

)−1(S) .

where S = p | ∃z ∈ Z : VM(z, p) = qf. Closed-form expressions for the optimal

values of parameters and the accumulated cost can be obtained in the special case

where the continuous dynamics of each control mode associated with a data atom

in the input w is linear, and the related atomic cost is quadratic. In more general

42

(nonlinear) cases, sets Pi, for i ∈ dom(w) may have to be discretized. Naturally, the

resolution of this discretization affects the optimality of the solution obtained.

Assuming a general case where closed form solutions for the optimal parame-

ters are impractical, consider a partition of Pi into Ki blocks, enumerate the blocks,

and let pi[k] denote the representative of the parameter values belonging to block

k ∈ 1, . . . , Ki. The DP algorithm selects the optimal sequence of parameter repre-

sentatives p1∗, . . . , pN

∗ in the family w as follows.

Let i = N , and for each pN−1[k] ∈ PN−1 for k = 1, . . . , KN−1, set

PN∗(pN−1[k]) := arg min

pN [j]∈PNgσN(pN−1[k], pN [j]

)(3.6a)

J∗N(pN−1[k]) := gσN(pN−1[k], PN

∗(pN−1[k])). (3.6b)

This process constructs two discrete maps on PN−1. The first map associates pN−1[k]

to the value PN∗(pN−1[k]), which the parameter should be reset to in order to trigger

the transition with the minimum cost. The second map associates a representative

pN−1[k], assumed to be written at the register before the σN transition is triggered, to

the minimum accumulated cost J∗N(pN−1[k]) incurred during the σN transition.

For i = N − 1, . . . , 2 we repeat

Pi∗(pi−1[k]) := arg min

pi[j]∈Pi

gσi(pi−1[k], pi[j]) + J∗i+1(pi[j])

(3.7a)

J∗i (pi−1[k]) := gσi(pi−1[k], Pi

∗(pi−1[k]))

+ J∗i+1

(Pi∗(pi−1[k])

). (3.7b)

Finally, for i = 1 we finish by setting

P1∗(p0) := arg min

p1[j]∈P1

gσ1(z0, p1[j]) + J∗2 (p1[j])

(3.8a)

J∗1 (z0, p0) := gσ1(z0, P1

∗(p0))

+ J∗2 (P ∗1 (p0)) . (3.8b)

Then the optimal sequence of parameter representatives p1∗, . . . , pN

∗ is obtained

iteratively. This sequence identifies a particular member of the input word family w

as the solution w∗ to Problem 3. The (conservative) accumulated cost is given by

J∗1 (z0, p0).

43

The time complexity of this variant of DP with discretized parameter space

is polynomial O(NK2), in which N = |w| and K = maxi=1,...,N Ki. To generate a

candidate data word for the algorithm DP, one checks (3.5), which in the worst case

requires the enumeration of all data words of maximal length U .

The solution obtained can be sub-optimal because: (i) in the case when a weak

bisimulation cannot be established between H and R(H), there may exist sequences

of parametrized control modes with lower cost that are only admissible in H; (ii) the

accumulated cost computed in R(H) over-approximates the time needed for executing

w from (z, p) in H, and it is conceivable that a word with a higher accumulated cost Jw

might actually be executed faster; (iii) if the upper bound on the length of data words

U is smaller than the length of the optimal solution, then the optimal solution is not

analyzed; and (iv) the discretization on the parameter space introduces quantization

errors.

3.6 Case Study

The effectiveness of abstraction and planning algorithm is illustrated with an

example. The problem to be solved is as follows: a mobile manipulator is instructed to

fetch a document at the printer and deliver it to the user. The locations of the printer

and the user is known.

The mobile manipulator consists of two subsystems: a wheeled mobile platform,

and a two degree-of-freedom robotic arm moving on a vertical plane. The robot exhibits

three different behaviors: (i) it can move from some initial position to a desired posture

(position and orientation), (ii) it can reach out with its arm, grasp an object in the

workspace and hold it, and (iii) it can reach out with its arm to a desired position

and release an object held in its gripper. When the robot performs any one of these

maneuvers we say that it is in a particular control mode, and these modes are labeled a,

b, and c, respectively. The controller responsible for each of these behaviors is given to

us a priori and no access to its low-level software is permitted. We have to determine

44

the sequence and parameterization of the controllers to achieve the desired outcome:

printout delivered to user.

One obvious (to a human) solution is to bring the robot to the vicinity of the

printer, have it reach out and pick up the printout from the output tray, then navigate

to the user and deliver the paper stack. However, it is not clear how such a plan can

be generated automatically.

3.6.1 Control Mode a: Nonholonomic Control

The mobile platform is modeled kinematically as a unicycle

x = v cosϑ y = v sinϑ ϑ = ω

where v the velocity and ω the angular velocity are the control inputs. Control mode

a steers the robot’s posture Xpdef= (x, y, ϑ)ᵀ ∈ R2 × S1 from an initial configuration

Xp0 =(x0, y0, ϑ0

)ᵀto a target Xpf =

(xf , yf , ϑf

)ᵀ. A coordinate transformation

naturally reduces this problem to steering the unicycle to the origin.9

The controller in mode a is designed based on [83]. Let x1 = ϑ mod (2π),

x2 = x cosϑ + y sinϑ, x3 = −2(x sinϑ − y cosϑ) +(ϑ mod (2π)

)(x cosϑ + y sinϑ).

Define

ω = −k1x1 + k3xr3x2

v = −k1x2 + 0.5(x1x2 − x3)ω

where k1, k3 > 0 are control gains, r = mn

, and m < n are odd naturals. The closed

loop system is

x1 = −k1x1 + k3xr3x2

x2 = −k1x2 x3 = −k3xr3 . (3.9)

Vector field fa is defined by the right-hand sides of (3.9). It can be verified that with

Cdef= x3(0)1−r and tf

def= C

k3(1−r) , when t ≤ tf , x3(t) = sign(C)∣∣ |C| − k3(1− r)t

∣∣ 11−r , and

9 Here, the workspace is obstacle-free. If obstacles are present, one may replace thecontroller in mode a with one that can handle obstacles, such as [102]. The challengecomes in approximating convergence rate—for this, see Appendix A.

45

for t ≥ tf , x3(t) = 0. Then for t ≥ tf , x1(t) = x1(tf )ek1(t−tf ) , x2(t) = x2(tf )e

k1(t−tf ) ,

where x1(tf ) =x1(0)+

k3x2(0)

∫ tf0 e2k1s(C−k3(1−r)s)

r1−r ds

ek1tf

and x2(tf ) = x2(0)ek1tf .

Post(a) is defined as the area where x1 and x2 are in a ball of radius ε of

the origin. It is guaranteed that Post(a) is satisfied in time at most maxtf +

ln(

√2x1(tf )

2ε)

k1,

ln(√2x2(0)2ε

)

k1

after switching to control mode a.

3.6.2 Control Modes b and c: Catch and Release

In control modes b and c, the robot’s arm maneuvers to pick, and release an

object, respectively. The arm is mounted on the mobile platform at a height hp. The

lengths of the two arm links are l1, l2, and the corresponding joint angles are ψ1, ψ2. Let

Ψdef= (ψ1, ψ2)ᵀ. The workspace of the arm is the set of end-effector absolute positions

pa = (pxa, pya, pza)ᵀ ∈ R3, reachable in the sense that given Xp = (x, y, ϑ) we have

√(pxa − x)2 + (pya − y)2 + (pza − hp)2 ∈ [ |l1 − l2|, |l1 + l2| ]

tanϑ = pya−ypxa−x

(3.10)

If (3.10) is true, we write pa ∈ W (Xp). The system is kinematically redundant: for

a given pa, many postures Xp can satisfy (3.10). Let the set of all these postures be

W−1(pa).

Inverse kinematics yields the joint angles Ψd def= (ψd1 , ψ

d2)ᵀ that positions the

end-effector to a desired pa:

ψd2 = cos−1 (pxa−x)2+(pya−y)2+(pza−hp)2−(l21+l22)

2l1l2

ψd1 = tan−1 (pza−hp)(l1+l2 cosψd2)−l2√

(pxa−x)2+(pya−y)2 sinψd2

l2(pza−hp) sinψd2+√

(pxa−x)2+(pya−y)2(l1+l2 cosψd2).

Let Ψh def= (ψh1 , ψ

h2 )ᵀ denote the center of the arm’s workspace, the joint angle com-

bination for which the distance between the end-effector the workspace boundary is

maximized. With the workspace being a compact set, the existence of this joint angle

configuration is ensured.

The error in joint angles is written Eψ(t)def= Ψ(t)−Ψd. With direct joint angle

control, and with steady state considered reached when |ψ1−ψd1 | ≤ ε and |ψ2−ψd2 | ≤ ε,

46

vector fields fb and fc are defined by the closed loop joint error dynamics Eψ = −KEψ,

where Kdef=

(b1 0

0 b2

). The difference between the two control modes is that while in

mode b the arm’s gripper is initially open and closes to grasp the object at the desired

end-effector position, in mode c the originally closed gripper opens at the arm’s desired

configuration. With the arm anywhere within its workspace, the maximum time to

complete a pick (b) or place (c) maneuver is

Tj = max2 ln

(|ψh1−ψ

d1 |

ε

)b1

,2 ln

(ψh2−ψ

d2

ε

)b2

.

3.6.3 The System Model

We model the robot as a hybrid agentH = 〈Z,P ,Σ, ι, πi,AP , fσ,Pre,Post, s,∆H〉with components:

Z = X × L set of composite states a

P = R2 × S1 × R3 set of control parameters b

Σ = a, b, c set of control modes c

ι = (a, 1), (b, 2), (c, 3) indexing bijection on Σ

πi, i ∈ 1, 2, 3 projection function on p ∈ P d

fσ, σ ∈ Σ parameterized vector fields c

AP = α1, α2, α3, α4 indexed atomic propositions e

Pre : Σ→ C precondition of mode σ ∈ Σ f

Post : Σ→ C postcondition of mode σ ∈ Σ f

s : Z × P → 2P system parameter reset map g

∆H : Z×P×Σ→Z×P×Σ mode transition map.

47

a X = R2 × S1 × S2 × R3 is the set of continuous variables describing the posture of

the platform Xp ∈ R2 × S1, the joint angles of the arm Ψ ∈ S2, and the position of

the manipulated object Xodef= (xo, yo, zo)

ᵀ ∈ R3. Here, L = g contains a single

Boolean variable g that expresses whether the gripper is closed (g = 1), or not

(g = 0).

b The parameter vector p = (pᵀp, pᵀa)ᵀ ∈ P describes the desired posture pp ∈ R2 × S1

for the mobile platform and the absolute position reference pa ∈ R3 for the arm’s

end-effector. Component pa ∈ R3 parameterizes modes b, c.

c In control mode a, the mobile platform evolves according to fa and converges to

a desired posture pp; in control mode b, the joint angles evolve under fb, the arm

picks up an object at Xo and holds it; in mode c, the joint angles evolve under fc

and the arm releases the object at pa.

d Defined as π1(p)def= pp, π2(p)

def= pa, π3(p)

def= pa.

e Proposition α1 is Xp ∈ pp ⊕ Bε(0) and when true, it means that the platform is

ε-close to its reference position; α2 is Xo ∈ pa⊕Bε(0), and when true, the object is

in an ε-neighborhood of position pa; α3 is pa ∈ W (pp), and when true, it suggests

given the platform being at pp, the parameter component pa specifying a reference

location for the end-effector, is within the reachable workspace. Proposition α4 is

true iff g = 1.

f C is the set of logical sentences obtained with AP . Table 3.1 summarizes the Pre

and Post for each mode.

g For p = (pᵀp, pᵀa)ᵀ and p′ = (p′p

ᵀ, p′aᵀ)ᵀ, writing p′ ∈ s(z, p) implies that p′a /∈ pa+Bε(0)

or p′p /∈ pp + Bε(0).

h Exactly as in Definition 3.The induced register semiautomaton for H is the tuple R(H) = 〈Q,Σ,P ∪

∅, τ,∆R〉, with:

48

Table 3.1: Pre and Post maps for the control modes of the hybrid agent.

a b c

Pre ¬α1 α1 ∧ α2 ∧ α3 ∧ (¬α4) α1 ∧ (¬α2) ∧ α3 ∧ α4

Post α1 α1 ∧ (¬α2) ∧ α3 ∧ α4 α1 ∧ α2 ∧ α3 ∧ (¬α4)

Q set of statesα

τ : 1→ P ∪ ∅ register assignment

∆R transition relation.β–γ

α This set can be practically restricted to 0000,1000,1110,1011,1001,0110,0001.More states exist, but for this task of reaching qf from q0 (see Section 3.6.4), the

remaining states are either unreachable from q0, or cannot reach qf , and thus are

ignored.

β The read transitions are the following:

(0000, ϕr)a→ (1000, right), (1110, ϕr)

b→ (1011, right), (1011, ϕr)c→ (1110, right),

(0001, ϕr)a→ (1001, right), (0110, ϕr)

a→ (1110, right).

γ The write transitions are the following:

(1000, ϕw)a→ (0000, stay) , (1000, ϕw)

b→ (1110, stay), (1011, ϕw)a→ (0001, stay),

(1011, ϕw)c→ (1011, stay), (1001, ϕw)

c→ (1011, stay), (1110, ϕw)a→ (0110, stay).

The set-valued maps λ appearing in the write transitions of R(H) are defined

through (3.1):

λ(τ ; 1000,0000, a

)=p′ ∈ P | p′a = pa, p

′p ∈ R2 × S1 \W−1(pa) \ pp

λ(τ ; 1000,1110, b

)=p′ ∈ P | p′p = pp ∈ W−1(Xo), p

′a = Xo

λ(τ ; 1011,0001, a

)=p′ ∈ P | p′p ∈ R2 × S1 \ W−1(pa), p′a = pa

λ(τ ; 1011,1011, c

)=p′ ∈ P | p′p = pp, p

′a ∈ W (pp) \ pa

λ(τ ; 1001,1011, c

)=p′ ∈ P | p′p = pp, p

′a ∈ W (pp)

λ(τ ; 1110,0110, a

)=p′ ∈ P | p′p ∈ W−1(pa) \ pp, p′a = pa

.

49

0000, p 1000, p 1110, p 1011, p 0001, p

0000, p′ 1110, p′ 1011, p′ 1001, p 0001, p′

0110, p 0110, p′θ

λ6

a

θ θ λ5, θ θ

λ1

λ2

λ3

λ4

a b

ca

Figure 3.2: The transformation semiautomaton TR(H) of hybrid agent H, for the taskspecification considered.

For a fixed tuple (q, q′, σ) ∈ Q × Q × Σ, the set-valued map λ(· ; q, q′, σ) maps τ ∈ Pto a subset of P , and can be inverted on appropriate subsets of P :

λ−1(p′; 1000,0000, a

)= p ∈ P | pp ∈ R2 × S1 \ p′p ∪W−1(pa), pa = p′a

λ−1(p′; 1000,1110, b

)= p ∈ P | pp = p′p ∈ W−1(p′a), pa = R3 \W (p′p)

λ−1(p′; 1011,0001, a

)= p ∈ P | pp ∈ W−1(p′a), pa = p′a

λ−1(p′; 1011,1011, c

)= p ∈ P | pp = p′p, pa ∈ W (p′p) \ p′a

λ−1(p′; 1001,1011, c

)= p ∈ P | pp = p′p, pa ∈ R3 \W (p′p)

λ−1(p′; 1110,0110, a

)= p ∈ P | pp ∈ W−1(p′a) \ p′p, pa = p′a .

The transformation semiautomaton TR(H) = 〈Q,Σ, ∆R〉 is described graphi-

cally in Fig. 3.2. The assignment of labels λi to transitions in R(H) is done by the

function Λ : Λ → Q× Q. Explicitly, Λ(λ1) = (1000,0000), Λ(λ2) = (1000,1110),

Λ(λ3) = (1011,0001), Λ(λ4) = (1001,1011), Λ(λ5) = (1011,1011), Λ(λ6) =

(1110,0110).

3.6.4 Task Specifications

Given some initial configuration for the robot, Xp(0) = (0, 1, π4)ᵀ, Ψ(0) = Ψh =

(0, π)ᵀ, g = 0, and the manipulated objectXo(0) = (−1, 2, 0.3)ᵀ, we seek a time-optimal

plan for the robot to pick the object and deliver it to a user located at Xu = (2, 3, 0.4)ᵀ.

To avoid trivial solutions, we assume that Xo(0) /∈ W (Xp(0)), and W−1(Xo(0)) ∩W−1(Xu) = ∅, which means the object is not within the vicinity of initial base location,

50

and that the arm cannot deliver the object to the user without the robot base having

to reposition itself.

Assume the register initialized with p0 = (Xp(0)ᵀ, Xᵀu)ᵀ, which sets the register

semiautomaton to state 1000. When the user receives the object at time tf , for t > tf

the system holds. At time tf , we have Xu ∈ W (Xp(tf )), Xp(tf ) ∈ π1(pf ) + Bε(0), and

Xo = Xu, g = 0; thus α1, α2, α3 evaluate true. The semiautomaton would then be

at state 1110, while π2(pf ) = Xu. Thus, when (z, p) satisfies Spec, this means that

VM(z, p) = 1110, and π2(p) = π2(pf ) = Xu. The objective is thus to find the shortest

walk w from (1000, p) to (1110, p) in TR(H), which ensures (3.5) is satisfied for some

data word in the family w given by the translation procedure.

3.6.5 Solving the Planning Problem

For the dfa obtained from TR(H) with initial (1000, p) and final (1110, p)

states, the equivalent RE(H) is:

RE(H) = (λ1 a)∗(λ2 b

(λ3 aλ4 c+ (λ5 + θ ) c

)(θ b(λ3 aλ4 c+ (λ5 + θ) c

)+ λ6 a

)∗). (3.11)

By replacing the Kleene-star in (3.11) with natural numbers, we obtain strings that

correspond to walks of certain length in the graph of TR(H). Let us denote W(m) the

set of walks of length m. Substitution in (3.11) verifies that the set of walks has to be

of even length with m > 3. For m = 4, we find W(4) = λ2 b λ5 c, λ2 b θ c. Set w =

λ2 b λ5 c translates to w = (b, p1)(c, p2) and Mj(·)2j=1 = M1 = λ(· ; Λ(λ2), b),M2 =

λ(· ; Λ(λ5), c), in which Λ(λ2) = (1000,1110) and Λ(λ5) = (1011,1011). The

resulted map M(p0) = M2 M1

((Xp(0)T, XT

u )T)

= ∅ since M1

((Xp(0)T, XT

u )T)

=

(Xp(0)T, p′aT)T | p′a ∈ Xo(0) ∩W (Xp(0)) = ∅ because the initial position of the

object is not within the workspace of the mobile platform. The same procedure applies

to the other walk and it turns out none of them satisfies (3.5). For m = 6 (3.11)

generates W(6) = λ1 a λ2 b λ5 c, λ1 a λ2 b θ c, λ2 b λ3 aλ4 c, λ2 b λ5 c λ6 a, λ2 b θ c λ6 a,all of which are rejected. For example, walk w = λ1 a λ2 b λ5 c translates to w =

51

(a, p1)(b, p2)(c, p3) and M(·) = λ(· ; Λ(λ5), c)λ(· ; Λ(λ2), b)λ(· ; Λ(λ1), a), so M(p0) =

(p′ᵀp , p′ᵀa )ᵀ | p′p ∈ W−1(Xo(0)), p′a ∈ W−1(p′p) \ Xo(0); but ∀ p′p ∈ W−1(Xo(0)), one

has Xu /∈ W−1(p′p) \ Xo(0), unless the user can get the object without the robot’s

base moving—which is trivial.

Finally, for m = 8, we find a walk w = λ1 a λ2 b λ3 a λ4 c, which translates to w =

(a, p1)(b, p2)(a, p3)(c, p4), and M1(·) = λ(· ; Λ(λ1), a

), M2(·) = λ

(· ; Λ(λ2), b

), M3(·) =

λ(· ; Λ(λ3), a

), M4(·) = λ

(· ; Λ(λ4), c

). Since the composition of maps M(p0) =

(p′ᵀp , p′ᵀa )ᵀ | p′p ∈ R2 × S1, p′a ∈ W (p′p)

allows p′a = Xu, this walk is a candidate,

and the search is terminated.

Now we resort to DP to obtain the optimal sequence of parameter vectors

pi = (pTpi, pTai)

T for i = 1, . . . , 4 = N . We have p4 = pf , and must satisfy π2(p4) = Xu.

The range of possible parameter values at each stage is:

P1 = M1

([Xp(0)Xu]

)∩(M4M3M2

)−1(W−1(Xu)×Xu

)= R2 × S1 \ Xp(0) ∪W−1(Xu) × Xu ∩W−1(Xo(0))× R3

= W−1(Xo(0))× Xu

P2 =(M4M3

)−1(W−1(Xu)×Xu

)∩M2M1

([Xp(0)Xu]

)=(R2 × S1 \W−1(Xu)

)×(R3 \ Xu

)∩W−1(Xo(0))× Xo(0)

= W−1(Xo(0))× Xo(0)

P3 = M3M2M1

([Xp(0)Xu]

)∩M−1

4

(W−1(Xu)×Xu

)=(R2 × S1 \W−1(Xo(0))

)× Xo(0) ∩W−1(Xu)× R3 \ Xu

= W−1(Xu)× Xo(0)

P4 = W−1(Xu)× Xu .

We discretize the domain of parameter pp using a polar coordinate system, in which the

radial increment between successive parameter settings is 0.06 m, and the angular in-

crement is 10. Figure 3.3 shows two sets of possible parameter settings for the position

component of pp, clustered around the object’s position at Xo = (−1, 2, 0.3)T, and the

user’s location at Xu = (2, 3, 0.4)T. The geometric parameters of the robot are l1 = l2 =

52

0.2 m, and hp = 0.15 m. Then, after setting rmin(z) =√

max0, (l1 − l2)2 − (z − hp)2,and rmax(z) =

√(l1 + l2)2 − (z − hp)2, the domains P1, P2, P3 and P4 are covered by

sets of points p1[k]N1k=1, p2[k]N2

k=1, p3[k]N3k=1, and p4[k]N4

k=1, respectively, where

N1 = N2 = 36⌊rmax(0.3)−rmin(0.3)

0.06

⌋, N3 = N4 = 36

⌊rmax(0.4)−rmin(0.4)

0.06c.

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.51

1.5

2

2.5

3

3.5

X

Y

Figure 3.3: Discretized workspace for the mobile manipulator and optimal path. Thetwo concentric collection of points mark parameter class representatives around theobject and user positions.

The DP algorithm described in Section 3.5 runs as follows.

N = 4: For every p3[k], compute (3.6)

P4∗(p3[k]) = argminp4[j]∈P4

gc(p3[k], p4[j]) .

N = 3, 2: For every p2[k] and p1[k], compute (3.7)


ga(p2[k], p3[j]) + J∗4 (p3[j])


gb(p1[k], p2[j]) + J∗3 (p2[j])

N = 1: Finish by evaluating (3.8) for z0 = [Xp(0) Xo(0)]

P1∗(p0) = argminp1[j]∈P1

ga(z0, p1[j]) + J∗2 (p1[j])

.

53

We find p∗1 = (−0.60, 1.85, 2.79, 2, 3, 0.4), p∗2 = (−0.60, 1.85, 2.79,−1, 2, 0.3), p∗3 =

(2.10, 2.91, 2.62,−1, 2, 0.3) and p∗4 = (2.10, 2.91, 2.62, 2, 3, 0.4). The accumulated cost

is Jw(z0, p0) = gc + ga + gb + ga = 5.49 + 31.72 + 6.08 + 35.80 seconds. The resulting

path on the horizontal plane of the mobile manipulator is shown in Fig. 3.3.

3.7 Conclusions

In this chapter, an abstraction method for a special class of hybrid systems is in-

troduced, which approximates the original system with a discrete system of managable

size and is independent of control objectives. This method depends on the convergent

continuous dynamics of the system, which affords a partitioning of the continuous state

space based on the asymptotic properties of the vector fields, and the capacity of the

system to re-parametrize its continuous controllers. With the abstraction method, an

abstraction-based optimal planning methods is hereby demonstrated. Although the

method is applicable for this special class of systems, it is noted that this systems class

represents a wide range of systems, for which low-level stable controllers have been

designed and can be reused for achieving complicated control objectives.

This partitioning introduced by this abstraction gives rise to purely discrete

abstract systems—no dynamics on the continuous values—which are weakly simulated

by the underlying concrete hybrid dynamics. Since the method does not require a state-

quantization, the state-explosion problem is avoided and the solution is thus scalable to

practical applications. Moreover, as the abstract model is derived directly based on the

dynamics of the underlying hybrid agents, there is also no dependence on the control

objective. Hence, once the specification or objective is changed, the same abstract

model can be reused for optimal planning, with newly designated initial state and the

set of final states.

With the established weak (bi)simulation relation, it is guaranteed that any

plan generated by this abstraction-based planning method is implementable in the

concrete hybrid system, provided the model of the system is correct and there is no

external disturbance from the environment. Due to the over-approximation occurred

54

in the process of abstraction, the plan is in general suboptimal, except for some special

conditions which are identified.

55

Chapter 4

ADAPTIVE CONTROL SYNTHESIS— WITH PERFECTINFORMATION

4.1 Overview

In this chapter, we show that game theory and grammatical inference can be

jointly utilized to synthesize and implement adaptive controllers for finite-state tran-

sition systems operating in unknown, dynamic, and potentially adversarial environ-

ments. Finite-state transition systems can arise as discrete abstractions of dynamical

systems [5, 13, 92, 93]. The synthesis of controllers becomes adaptive in the sense that

the agent completes the information that is missing from its model about its environ-

ment, and subsequently updates its control policy, during execution time.

Reactive synthesis and algorithmic game theory have been introduced for formal

control design in the presence of dynamic environments, in which the system computes

the control output based on real-time information [34,63,106]. In this work, an assump-

tion on the environment is known [63], and the specification of the system is satisfiable

provided the environment dynamics satisfy this assumption. In cases when the envi-

ronment is partially unknown and continuously discovered as the system executing the

actions, an iterative planning framework is developed [70]. However, the re-synthesized

motion plan cannot guarantee the satisfaction of the original task specification.

The following question is raised in this dissertation: is there a method to convert

the problem of designing a symbolic controller for a system that interacts with an un-

known adversarial environment, into a synthesis problem where environment dynamics

is known?—then several known design solutions, e.g., algorithmic game theory [45] and

discrete event system (des) control theory [21], could be applied. This chapter answers

this question in affirmative by proposing a framework that incorporates learning with

56

control design at the abstract level. The identification of the environment model, and

subsequently all possible interactions with it, is essentially a process of inference—to

generalize any formal object that can describe this model based on the finite amount

of observed behaviors. Grammatical inference (GI), as a sub-field of machine learn-

ing, is a paradigm that identifies formal objects through presentation of examples with

or without a teacher [30], and thus the methodology naturally fits in this problem

formulation.

Fig. 4.1 serves as a graphical description of our framework that integrates learn-

ing with control synthesis. With product operations, we combine the system, its task

specification, and its unknown dynamic environment, into a game. On a game graph

constructed based on the agents’ inferred model of its environment, a controller is then

derived. The environment model may be crude and incorrect in the beginning, but as

the agent collects more observations, its grammatical inference module refines it, and

under certain conditions on the observation data and the environment model structure,

the fidelity of the continuously updated model converges to a point where a controller

is found, whenever the latter exist.

robot(s)environment

abstraction

control

planninglearning

transitionsystem

transitionsystem environment

actuators

sensors

specification! !

abstraction

identification

? AsA1A2

GIM G(i)

!Ha

A(i)2

Figure 4.1: The architecture of hybrid planning and control with a module for gram-matical inference

57

The framework is modular and flexible: different types of grammatical infer-

ence algorithms can be applied to learn the behavior of the environment under certain

conditions, without imposing constraints on the method to be used for control: learn-

ing is decoupled from control. In the following, we propose two different approaches

for implementing adaptive control design within this framework and demonstrate the

effectiveness with robot motion planning example.

4.2 Grammatical Inference and Infinite games

This section gives some background on temporal logic, algorithmic games, and

grammatical inference.

4.2.1 Infinite Words

Given a finite alphabet Σ, the set of infinite sequences is denoted Σω. A word

w ∈ Σω is called an ω-word. A ω-regular language L is a subset of Σω. The prefixes

of a ω-regular language L is denoted Pr(L) = u ∈ Σ∗ | (∃w ∈ L)(∃v ∈ Σω)[uv = w].Given an ω-word w, Occ(w) denotes the set of symbols occurring in w, and Inf(w) is

the set of symbols occurring infinitely often in w. Given a finite word w ∈ Σ∗, last(w)

denotes the last symbol of w. We refer to the i+ 1 th symbol in a word w by writing

w(i); the first symbol in w is indexed with i = 0.

We extend the definition of automata in section 3.2.1 to the machines that

accept ω-regular languages. An automaton, is a quintuple A = 〈Q,Σ, T, I,Acc〉 where

〈Q,Σ, T 〉 is an semiautomaton (sa) deterministic in transitions, I is the set of initial

states, and Acc is the acceptance component. A word w = σ0σ1 . . . generates run

ρw = q0q1 . . . in A if and only if T (qi, σi) = qi+1, for 0 ≤ i < |w|. Different types of

Accs give rise to:

• finite state automaton, in which case Acc = F ⊆ Q, and A accepts w ∈ Σ∗ if the

run ρw ∈ Q∗ satisfies ρw(0) ∈ I and last(ρw) ∈ F , and

• Buchi automata, in which case Acc = F ⊆ Q, and A accepts w ∈ Σω if the run

ρw ∈ Qω satisfies ρw(0) ∈ I and Inf(ρw) ∩ F 6= ∅.

58

The set of (in)finite words accepted by A is the language of A, denoted L(A).

An automaton is deterministic if it is deterministic in transition and I is a singleton.

In this case, with a slight abuse of notation, we denote I the single initial state. A

dfa with the fewest number of states recognizing a language L is called a canonical

automaton for L. Unless otherwise specified, we understand that A is the sa obtained

from an fsa A by unmarking the initial state and final states from A.

An automaton is complete if for any q ∈ Q and σ ∈ Σ, T (q, σ) is defined. Any

automaton can be made complete by adding a non-final state sink such that if for a

given q ∈ Q, σ ∈ Σ, T (q, σ) is undefined, then let T (q, σ) = sink.

4.2.2 Grammatical Inference

A positive presentation φ of a language L is a total function φ : N → L ∪ #such that for every w ∈ L, there exists n ∈ N such that φ(n) = w [55]. Here # denotes

a pause, a moment in time when no information is forthcoming. A presentation φ can

also be understood as an infinite sequence φ(0)φ(1) · · · containing every element of L,

interspersed with pauses. Let φ[i] denote the finite sequence φ(0)φ(1) . . . φ(i).

Grammars are finite descriptions of potentially infinite languages. The language

of a grammar G is L(G). A learner (learning algorithm, or grammatical inference

machine (GIM) ) is a program that takes the first i elements of a presentation, i.e.

φ[i], and outputs a grammar G, written GIM(φ[i]) = G. The grammar outputted

by GIM is the learner’s hypothesis of the language. A learner GIM identifies in the

limit from positive presentations a class of languages L if for all L ∈ L, and for all

presentations φ of L, there exists a n ∈ N such that for all m ≥ n, GIM outputs a

grammar GIM(φ[m]) = G, and L(G) = L [44].

4.2.3 Specification Language

We use ltl [33] to concisely specify desired system properties such as response,

liveness, safety, stability, and guarantee [28]. Informally speaking, ltl allows one to

59

reason about the change over time of the truth value of logical propositions. ltl is

built recursively from a set of predicates P as follows

ϕ := ϕ | ¬ϕ | ϕ1 ∨ ϕ2 | ©ϕ | ϕ1Uϕ2 | > |⊥,

where> and⊥ are unconditional true and false, respectively, “next”(©) and “until”(U)

are temporal operators.

Given negation (¬) and disjunction (∨), we can define conjunction (∧), impli-

cation ( =⇒ ), and equivalence (⇔). Additional temporal operators can be obtained

such as “eventually”(♦ = True U) and “always” ( = ¬♦¬).

A ltl formula ϕ over P can be translated into a ω-regular language over the

alphabet 2P , and one can construct a Buchi automaton that accepts this language with

methods in [9, 41].

4.2.4 Infinite Games

This section briefly reviews deterministic turn-based, two-player zero-sum games

with perfect information.

Definition 8 ( [45]). A two-player turn-based zero-sum game is a tuple G = 〈V1 ∪ V2,

Σ1 ∪Σ2, T, I, F 〉, where 1) Vi is the set of states where player i moves, 2) Σi is the set

of actions for player i, V1 ∩ V2 = Σ1 ∩Σ2 = ∅, V = V1 ∪ V2; 3) T : Vi×Σi → Vj is the

transition function where (i, j) ∈ (1, 2), (2, 1); 4) I is the set of initial game states,

and 5) F ⊆ V1 ∪ V2 is the winning condition: in reachability (resp. safety or Buchi)

games: a run ρ is winning for player 1 if Occ(ρ)∩F 6= ∅ (resp. Occ(ρ) ⊆ F for safety,

and Inf (ρ) ∩ F 6= ∅ for Buchi).

A run ρ = v0v1v2 . . . ∈ V ∗ (V ω) is a finite(infinite) sequence of states such that

for any 0 ≤ i < |ρ|, there exists σ ∈ Σ, T (vi, σ) = vi+1. A play p = v0σ0v1σ1 . . . ∈(V ∪Σ)∗ (or (V ∪Σ)ω) is a finite (or infinite) interleaving sequence of states and actions

such that the projection of p onto V is a run ρ in the game and the projection of p

onto Σ is a word that generates the run ρ.

60

A strategy for player i in game G is a function Si : V ∗Vi → 2Σi that takes a

run ρ and outputs an action for player i to take. It satisfies that for any run ρ ∈ V ∗,Si(ρ) = σ implies that T (last(ρ), σ) is defined. A memoryless strategy Si : Vi → 2Σi

outputs an action for player i to take depending only on the current state in the game.

For reachability and Buchi games, a memoryless winning strategy always exists for one

of the players [45].

We say player 1 follows strategy S1 in a play p = v0σ0v1σ1 · · · if for all n ≥ 1,

σ2n−2 ∈ S1(v0v1 · · · v2n−2). The definition of player 2 following strategy S2 can be

obtained dually. An initialized game, denoted (G, v0), is the game G with a designated

initial state v0 ∈ I. A strategy is winning for player i, denoted WSi, if every run in

(G, v0) with player i adhering to WSi, results in player i winning. The winning region

of player i, denoted Wini, is the set of states from which she has a winning strategy.

We define a projection operator for a tuple s = (s1, . . . , sN) ∈ S1× S2 . . .× SN ,

as πi((s1, . . . , sN)) = si, for 0 ≤ i ≤ N . For a set of tuples S, we write πi(S) =⋃s∈Sπi(s), and for a sequence of tuples w = s1s2 . . . , we apply the operator element-

wise in the form πi(w) = πi(s1)πi(s2) . . . .

4.3 System Behavior as Game Play

First of all, the interaction between two dynamical systems (the agent and its

environment) is modeled as a two-player, turn-based, zero-sum game in which the task

specification of the system determines the winning condition of the system player.

4.3.1 Constructing the Game

Let AP be the set of atomic propositions describing world states (or the state

of the combined agent-environment system). The set of world states C is defined to

be the set of all conjunctions of atomic propositions or their negations, i.e. C = c =

`1∧`2 . . .∧`n | (∃α ∈ AP)[`i = α∨`i = ¬α], such that, for any c ∈ C, one proposition

in AP appears at most once.

61

Assume now that the behavior of both the agent (player 1) and its environ-

ment (player 2) can be captured by some labeled transition system (lts), A1 =

〈Q1,Σ1, T1,AP1, LB1〉 for player 1, and A2 = 〈Q2,Σ2, T2,AP2, LB2〉, for player 2, where

for i = 1, 2, each component 〈Qi,Σi, Ti〉 is a sa, AP i is the set of atomic propositions

that can be changed by player i’s actions, AP = AP1 ∪ AP2, and LBi : Qi → C is a

labeling function.

We assume an action σ ∈ Σi has conditional effects:

1. the pre-condition of action σ, denoted Pre(σ) ∈ C, is a sentence that needs to

be satisfied for the player to initiate action σ, and

2. the post-condition of action σ, denoted Post(σ) ∈ C, is the sentence that is

satisfied when action σ is completed.

Given c ∈ C, if c =⇒ Pre(σ), then the effect of action σ on c, denoted σ(c) ∈ C is

the unique world state after performing σ when the world state is c. These conditional

effects can be directly related with the pre- and post-conditions of control modes in a

hybrid agent (see Chapter 3).

Without loss of generality, we assume the alphabets of A1 and A2 to be disjoint,

i.e. Σ1 ∩Σ2 = ∅. It is possible that player i can give up his turn, in which case, we say

that she “plays” a generic (silent) action εi ∈ Σi. We assume Pre(εi) = Post(εi) = >,

for i = 1, 2. In addition, εi(c) = c, i.e., a silent action cannot change the world state.

Definition 9 (Turn-based product). Given the models of system and environment

A1 = 〈Q1,Σ1, T1,AP1, LB1〉 and A2 = 〈Q2,Σ2, T2,AP2, LB2〉, the turn-based product

P = 〈Q,Σ, δ,AP , LB〉 is a lts denoted A1 A2, defined as follows:

Q = Q1 × Q2 × 0,1 is the set of states, where the last component is a Boolean

variable t ∈ 0,1 denoting whose turn it is to play: t = 1 for player 1, t = 0

for player 2.

Σ = Σ1 ∪ Σ2 is the alphabet.

δ is the transition relation. δ((q1, q2, t), σ

)= (q′1, q2,0) if t = 1, q′1 = T1(q1, σ),

with LB1(q1) ∧ LB2(q2) =⇒ Pre(σ); and δ((q1, q2, t), σ

)= (q1, q

′2,1) if t =

0, q′2 = T2(q2, σ), with LB1(q1) ∧ LB2(q2) =⇒ Pre(σ).

62

LB : Q→ C is the labeling function, and is defined by: for (q1, q2, t), LB(q1, q2, t) =

LB1(q1) ∧ LB2(q2) 6=⊥.

The time complexity of constructing P is polynomial in the size of the models

of two players’ ltss.

The task specification is given as a ltl formula Ω over AP and can be translated

into a language over the set of world states [41] , accepted by a completed deterministic

automaton As = 〈S, C, Ts, Is, Fs〉 where sink ∈ S. Intuitively, the task specification

encoded in As specifies a set of histories over the world states.

The turn-based product P gives snapshots of different stages in a game. It does

not capture any of the game history that resulted in this stage. We overcome the lack

of memory in P by using another product operation with As and P .

Definition 10 (Two-player turn-based game automaton). Given the turn-based prod-

uct P = 〈Q,Σ, δ, LB〉 and the task specification As = 〈S, C, Ts, Is, Fs〉, a two-player

turn-based game automaton is constructed as a special product of P and As, denoted

G = P nAs = (A1 A2)nAs = 〈V,Σ, T, I, F 〉, where

V = V1 ∪ V2 where V1 ⊆ (q, s) | q = (q1, q2,1) ∈ Q ∧ s ∈ S is the set of states at

which player 1 makes a move and V2 ⊆ (q, s) | q = (q1, q2,0) ∈ Q ∧ s ∈ S is

the set of states of player 2.

T : V ×Σ→ V is the transition relation and is defined by T ((q, s), σ) = (q′, s′) if and

only if δ(q, σ) = q′, and Ts(s, c) = s′ with c = LB(q′).

I = (q, s) ∈ V | s = Ts(Is, LB(q)) is the set of possible initial game states.

F = (q, s) ∈ V | s ∈ Fs is the winning condition.

From P and As the game automaton G is constructed in time polynomial in the

size of P and As. With a slight abuse of notation, the labeling function in G is defined

as LB(v) = LB(π1(v)) where π1(v) ∈ Q.

63

For a fixed initial state v0 ∈ I, when As is a dfa, (G, v0) is a reachability game.

When As is a deterministic Buchi automaton (dba), (G, v0) is a Buchi game. The runs

in (G, v0) and As are related as follows:

ρ, G : (q(0), s(0))σ1−→ (q(1), s(1)) . . .

ρs, As : IsLB(q(0))−−−−→ s(0) LB(q(1))−−−−→ s(1) . . .

(4.1)

where LB(q(1)) = σ1(LB(q(0))).

If player 1 wins (G, v0), then it means that the task specification encoded in Asis satisfied if player 1 follows her winning strategy:

Proposition 3. For any winning run ρ ∈ V ∗(or V ω) of player 1 in G, LB(ρ) ∈ C∗(or

Cω) is accepted by As.

Proof. Since ρ is winning for player 1, in a reachability (resp. Buchi) game, last(ρ) ∈ F(resp. Inf(ρ) ∩ F 6= ∅). Projecting ρ on the state set S of As, we obtain last(π2(ρ)) ∈π2(F ) ⊆ Fs (resp. Inf(π2(ρ)) ∩ Fs 6= ∅). Since the run in As corresponding to ρ is

ρs = Isπ2(ρ) by (4.1), we have last(ρs) ∈ Fs (resp. Inf(ρs) ∩ Fs 6= ∅) and thus the

word generating ρs, which is LB(ρ), is accepted in As by the definition of acceptance

component.

4.3.2 Game Theoretic Control Synthesis

For a game G = 〈V1 ∪ V2, Σ1 ∪ Σ2, T, I, F 〉 and for a set of states X ⊆ V , the

attractor [45] of X, denoted Attr(X), is the largest set of states W ⊇ X in G from

where player 1 can force a run into X. It is defined recursively as follows. Let W0 = X

and set

Wi+1 := Wi ∪v ∈ V1 | (∃σ.T (v, σ) ↓) [T (v, σ) ∈ Wi ]

∪v ∈ V2 | (∀σ.T (v, σ) ↓ ) [T (v, σ) ∈ Wi ]

. (4.2)

Since G is finite, there exists the smallest m ∈ N such that Wm+1 = Wm = Attr(X).

If G is a reachability game, the winning region of player 1 is Win1 = Attr(F ) and

the winning region of player 2 is Win2 = V \Win1. Player 1 has a memoryless winning

64

strategy if the game starts at some initial state v0 ∈ Win1∩ I. Given v0 ∈ Win1∩ I, the

memoryless winning strategy WS1 is computed as follows: (1) obtain a set of subsets

Yi, i = 0, . . . ,m in the following way: let Y0 = W0 = F and set Yi := Wi \Wi−1, for

all i ∈ 1, . . . ,m; (2) given v ∈ Yi ∩ V1 for some 1 ≤ i ≤ m, then define WS1(v) =σ ∈ Σ1 | T (v, σ) ∈ Yi−1, i ≥ 1

.

In the case that G is a Buchi game, the winning region of player 1, Win1, is

obtained by recursively computing the set of states Z [45] in the following way:

1. Z0 = V ,

2. for i ≥ 0, Xi := Attr(Zi), and

Yi :=v ∈ V1 | (∃σ ∈ Σ1 : T (v, σ) ↓)[T (v, σ) ∈ Xi]

∪v ∈ V2 | (∀σ ∈ Σ2 : T (v, σ) ↓)[T (v, σ) ∈ Xi]

, (4.3)

3. Zi+1 = Yi ∩ F .

The set Z = Zm = Zm+1 is the fixed point and the winning region for player 1

is Win1 := Attr(Z). The memoryless winning strategy WS1 of player 1 on Win1 is

computed as follows: (1) for v ∈ Win1 \ Z the winning strategy is defined in the same

way as that for the reachability game on the graph of G in which Z is the winning

condition; (2) if v ∈ Z, then define WS1(v) = σ ∈ Σ1 | T (v, σ) ∈ Win1. Applying

WS1(v) leads to a state within Win1, from which point onwards the strategy defined in

case (1) applies.

The time complexity of solving reachability and Buchi games are linear and

polynomial, respectively, in the size of the game automaton G.

4.4 Integrating Learning with Control

The problem of synthesizing a strategy (controller) has a solution if player 1

has full knowledge of the game that is being played. Suppose, however, that player 1

has knowledge of her own capabilities and objective, but does not have full knowledge

of the capabilities of player 2. How can player 1 plan effectively given her incomplete

knowledge about which game she is actually playing?

65

In this section, we show that when player 1 does not have complete knowledge of

her opponent, and as a result of the game, the integration of GI as a learning mechanism

in a setting where the game is repeated sufficiently many times, can eventually give

player 1 a winning strategy.

4.4.1 Languages of the Game

The intuition is that from abstraction level, both the system and its environ-

ment’s behavior, can be understood as their languages. Thus the identification of a

model for the environment or their interaction becomes the problem of learning the

language of the environment or the game. First of all, we define what we meant by the

languages of the environment and game.

For a designated initial state v0 ∈ V , the language of the game (G, v0), denoted

L(G, v0) = L(〈V,Σ, T, v0, V 〉), is the set of finite prefixes of all possible behaviors of

two players (sequences of interleaving actions) in the game. The language of player i,

for i ∈ 1, 2, is the projection of L(G, v0) on Σi, denoted Li(G, v0).

We assume the game (G, v0) is played repeatedly. During the game play, the

system obtain positive presentations of the languages of the game and the environment.

Let the presentation of language L(G, v0) obtained in the repeated game to be φ, define

φ(0) = λ, and denote φ[i] the presentation obtained after move i = 1, . . . , n. Since

games are repeated, the move index i counts from the first move in the very first game

until the current move in the latest game. If move i+1 is the first in one of the repeated

games and player k plays σ ∈ Σk then φ(i + 1) = σ; otherwise φ(i + 1) = φ(i)σ. The

projection of φ on player 2’s alphabet is a positive presentation of L2(G, v0), denoted

φ2.

4.4.2 Learning the Game — a First Approach

In this section we consider the case when the system cannot interfere with the

environment action in their interaction, that is, L2(G, v0) = L(〈Q2,Σ2, T2, q20, Q2〉)where q20 is the initial state of the environment given the initial game state being v0.

66

That is, v0 = (q0, s0) and q0 = (q10, q20, t). For certain classes of languages to which

L2(G, v0) belongs, in this case we can directly identify the model of the environment in

the form of a sa and consequently the game.

The assumptions that allow the implementation of the first approach are the

following:

Assumption 1. 1) Player 1 cannot restrict player 2; 2) The model of player 2 is

identifiable in the limit from positive presentations by a GIM; 3) Player 1 has prior

knowledge for selecting the correct GIM; and 4) the observed behavior of player 2 suffices

for a correct inference to be made, i.e., it contains a characteristic sample.

Definition 11. Let L be a class of languages identifiable in the limit from positive

presentation by a normal-form learner GIM, the output of which is an fsa. Then we

say that an sa A = 〈Q,Σ, T 〉, where sink /∈ Q, is identifiable in the limit from positive

presentations if for any q0 ∈ Q, the language accepted by fsa A = 〈Q,Σ, T, q0, Q〉 is

in L, and given a positive presentation φ of L(A), there exists an m ∈ N such that

∀n ≥ m, GIM(φ[m]) = GIM(φ[n]) = A. The learner GIMSA for sa A is constructed

from the output of GIM by unmarking the initial and final states.

Let SA(GIM) be the set of sas identifiable in the limit from positive presentations

by the normal-form learner GIM. Now given an sa A1, an objective As, and a class of

semiautomata SA, define the class of games

GAMES(A1,As, SA) =G | ∃A2 ∈ SA. G = (A1 A2)nAs

.

For this class of games we have the following result.

Theorem 3. If, for all A2 ∈ SA(GIM), there exists A2 ∈ range(GIM) such that L(A2) =

L2

((A1 A2) n As

), then GAMES

(A1,As, SA(GIM)

)is identifiable in the limit from

positive presentations.

67

Proof. For any game G ∈ GAMES(A1,As, SA(GIM)

)and any data presentation φ of

L(G, v0), denote φ2[n] for n ∈ N the projection of φ[n] on Σ2. Then define a learning

algorithm Alg as follows:

∀φ,∀n ∈ N, Alg(φ[n]) =(A1 GIMSA(φ2[n])

)nAs .

We show that Alg identifies GAMES(A1,As, SA(GIM)

)in the limit.

To this end, consider any game G ∈ GAMES(A1,As, SA(GIM)

), there is an

A2 ∈ SA(GIM) such that G = (A1 A2) n As. Consider now any data presentation φ

of L(G, v0). Then φ2 is a data presentation of L2(G, v0). By assumption there exists

A2 ∈ range(GIM) such that L(A2) = L2(G, v0). Thus φ2 is also a data presentation

of L(A2). Therefore, there is m ∈ N such that for all n ≥ m it is the case that

GIMSA

(φ2[n])

)= A2. Consequently, there is m′ = 2m such that for all n ≥ m′,

Alg(φ[n]) = (A1 A2)nAs = G.

Since G and φ are selected arbitrarily, the proof is completed.

Winning Strategy: WS[0]1 WS

[1]1 . . . WS

[i]1 . . . → WS1

↑ ↑ ↑Hypothesis of the Game: G[0] G[1] . . . G[i] . . . → G

↑ ↑ ↑Hypothesis of Player 2: A

[0]2 A

[1]2 . . . A

[i]2 . . . → A2

↑ ↑ ↑Data Presentation: φ2[0] φ2[1] . . . φ2[i] . . .

Figure 4.2: Learning and planning with a grammatical inference module.

Figure 4.2 illustrates how identification in the limit proceeds. Through inter-

actions with player 2, player 1 observes a finite initial segment of a positive presen-

tation φ2[i] of L2(G, v0), and uses the GIM to update a hypothesized model of player

2. Specifically, the output of GIM(φ2[i]) becomes a dfa (see Appendix A), which af-

ter removing the initial state and the finality of final states, yields a semiautomaton

A[i]2 . The labeling function LB

[i]2 in A

[i]2 is defined as LB

[i]2 = ∧σ∈IN(q)Post(σ), where

IN(q) , σ ∈ Σ2 | (∃q′ ∈ Q[i]2 )[T

[i]2 (q′, σ) = q]; this is the set of labels of incoming

transitions of the state q. The computation of labeling function is of time complexity

linear in the size of A[i]2 . Given LB

[i]2 , the interaction function U2(·) is updated in linear

68

time O(|Q1| × |Q[i]2 |). Based on the interaction functions and the updated model for

player 2, player 1 constructs a hypothesis (model for) G [i], capturing her1 best guess

of the game being played, and uses this model to compute WS[i]1 , which converges to

the true WS1 as A[i]2 converges to the true A2. Strategies WS

[i]1 for i < n, are the best

responses for the system given the information it has so far, but having been devised

based on incorrect hypotheses about the game being played, they cannot guarantee

winning. There is no guaranteed upper bound on the number of games player 1 has to

play before the learning process converges because one does not know at which point

a characteristic sample of player 2’s behavior is observed. However, as soon as this

happens, convergence is guaranteed. The game learning procedure is summarized in

the following.

1. The game starts with initial state v0 ∈ I, i := 0, and the hypothesized game is

G [0].

2. At state v = (q1, q2,1, qs), player 1 computes Win[i]1 in G [i]. If v ∈ Win

[i]1 , a winning

strategy WS[i]1 exists in (G [i], v). Player 1 plays WS

[i]1 (v), and proceeds to step 4.

If v /∈ Win[i]1 , player 1 loses and jumps to step 3; if T (v, σ) ∈ F , player 1 wins and

jumps to step 5.

3. With probability p, player 1 makes a move randomly selected from available

moves at that time instance and jumps to step 4; or player 1 jumps to step 5

with probability 1− p.4. Player 2 makes a move. Player 1 observes the move, updates A

[i]2 to A

[i+1]2 , and

G [i] to G [i+1]. Player 1 sets i := i+ 1 and goes to step 2.

5. The game is restarted at a random initial state v0 and If v0 /∈ Win[i]1 , player 1

makes a random move and goes to step 4; otherwise, player 1 jumps to step 2.

When player 1 finds herself out of her assumed winning set she can either quit and

restart the game, or explore an action with probability 0 ≤ p ≤ 1 and keep playing

hoping that her opponent’s response allows her to improve her hypothesis of the game.

1 In this context, we refer to player 1 as a “she” and player 2 as a “he”.

69

4.4.3 Learning an Equivalent Game

The assumption for the first approach can be restrictive in cases when the en-

vironment behavior can be constrained by the system during their interactions. For

example, a mobile robot’s environment can be composed by another mobile robot and

the interference of actions goes in both direction. On the other hand, the learning

algorithms used in the first approach are restricted to those that output fsas.

To relax these assumption in the first approach, in this section, we combine

action model learning [84] with grammatical inference to directly identify an equivalent

game of the original one being played. The equivalence considered here is a modified

version of game equivalence in [15], and leads to guarantees that even if the true model

of the environment is never found, the controllers built based on the equivalent model

are effective in terms of satisfying the system specification.

We assume the following condition for the second approach to apply.

Assumption 2. The language of the environment in the system and environment

interaction, L2(G, v0), belongs to a class of languages identifiable in the limit from

positive presentations, and player 1 has correct prior knowledge of the class of languages

to which L2(G, v0) belongs.

4.4.3.1 Equivalence in Games

Through the concept of bisimulation in transition systems we establish the

equivalence between two games.

Definition 12. [91] A bisimulation of two transition systems P = 〈Q,Σ, δ, LB〉 and

P ′ = 〈Q′,Σ, δ′, LB′〉 is a binary relation R ⊆ Q × Q′ that whenever (q, q′) ∈ R and

σ ∈ Σ, the following conditions hold:

(i) LB(q) = LB′(q′).

(ii) if δ(q, σ) = p, then δ′(q′, σ) = p′ for some p′ ∈ Q′ such that (p, p′) ∈ R.

(iii) if δ′(q′, σ) = p′, then δ(q, σ) = p for some p ∈ Q such that (p, p′) ∈ R.

70

We write P ' P ′ if P and P ′ are bisimilar. For designated initial states q0, q′0, we say

(P, q0) is bisimilar to (P ′, q′0) and write (P, q0) ' (P ′, q′0) if and only if after trimming

all states inaccessible from q0 and q′0, P ' P ′ and (q0, q′0) ∈ R.

The following definition is adapted from [15].

Definition 13 (Equivalence between games). Two games (G, v0) = 〈V,Σ, T, v0, F 〉 and

(G ′, v′0) = 〈V ′,Σ, T ′, v′0, F ′〉 are equivalent if there exist two functions r : V ∗ → V ′∗ and

r′ : V ′∗ → V ∗ such that given a winning strategy of player 1 in (G, v0), WS1 : V ∗ → Σ1,

the strategy WS′1 : V ′∗ → Σ1 defined by for any ρ′ = v′0v′1 . . . v

′n ∈ V ′∗, WS′1(ρ′) =

WS1(r′(ρ′)) is winning for player 1 in (G ′, v′0) and vice versa.

Proposition 4. If (P, q0) and (P ′, q0) are bisimilar, then the games (G, v0) = (P, q0)n

As and (G ′, v′0) = (P ′, q′0)nAs are equivalent, for any deterministic objective automaton

As = 〈S, C, T, Is, Fs〉.

Proof. Define r, r′ in Definition 13 using complete induction: for initial states v0 =

(q0, s0) and v′0 = (q′0, s0) where s0 = Ts(Is, LB(q0)) = Ts(Is, LB′(q′0)) since LB(q0) =

LB′(q′0), let r(v0) = v′0 and r′(v′0) = v0, we have π2(v0) = π2(v′0) and (π1(v0), π1(v′0)) =

(q0, q′0) ∈ R; then suppose r, r′ are defined for finite runs ρ = v0v1 . . . vn and ρ′ =

v′0v′1 . . . v

′n such that r(ρ) = ρ′, r′(ρ′) = ρ and for all 0 ≤ i ≤ n, π2(vi) = π2(v′i) ∈ S and

(π1(vi), π1(v′i)) ∈ R.

Consider σ ∈ Σ for which T (vn, σ) ↓, and let vn+1 = T (vn, σ). Suppose vn =

(qn, sn) and v′n = (q′n, s′n), by the assumption of r, r′, we have (qn, q

′n) ∈ R and s′n = sn.

Since As is total and δ′(q′n, σ) is defined by bisimulation, T ′(v′n, σ) is defined. The

transitions in G and G ′ are related:

G : (qn, sn)σ−→ (qn+1, sn+1)

l R l RG ′ : (q′n, sn)

σ−→ (q′n+1, s′n+1)

where sn+1 = Ts(sn, LB(qn+1)) and s′n+1 = Ts(sn, LB′(q′n+1)). Since q′n+1 is related to

qn+1 through R, from LB(qn+1) = LB′(q′n+1) we must have s′n+1 = sn+1.

71

Let r(ρvn+1) = ρ′v′n+1 and r′(ρ′v′n+1) = ρvn+1 and inductively it follows that for

two runs ρ ∈ V ∗ and ρ′ ∈ V ′∗ such that r(ρ) = ρ′ and r′(ρ′) = ρ, it holds

(∀i : 0 ≤ i < |ρ|)[π2(vi) = π2(v′i) ∧ (π1(vi), π1(v′i)) ∈ R] .

Now suppose WS1 : V ∗ → Σ1 is a winning strategy for player 1 in (G, v0). For

any run ρ produced by player 1 applying WS1, let r(ρ) be the run produced by player

1 applying WS′1 (Definition 13) and note that LB(ρ) = LB′(r(ρ)), where LB and LB′ are

the labeling functions of G and G ′, respectively. Because of this latter equality between

the images of the labeling functions, when ρ is winning for player 1, by Proposition 3

we can infer LB(ρ) is accepted by As, and consequently r(ρ) is winning for player 1 in

(G ′, v′0) since LB′(r(ρ)) = LB(ρ) is accepted by As as well.

Proposition 4 sets the theoretical foundation that allows us to compute a winning

strategy for player 1 in a game equivalent to the original one when the latter is unknown

but can be learned from positive presentations.

4.4.3.2 Learning an Equivalent Game from Positive Presentations

The learning module in our framework combines two learning processes that

work in parallel: one aims to identify a transition system that keeps track of the

updates of world states during the course of the game, and the other is a typical GIM.

By combining these two we are able to compute a game equivalent to the true game

in the sense of Definition 13.

We assume that player 1 always knows whose turn it is (i.e. the Boolean value

t ∈ 0, 1) and has full observation of the set of atomic propositions AP , i.e., at any

time instance during the game, player 1 knows the current evaluation of α for each

α ∈ AP whose value can be determined (either true or false). In the repeated game,

the (move) index i counts from the very first game until the current move.

Definition 14 (World state transition system). During the game in which (G, v0) being

played repeatedly, the world state transition system constructed by player 1 at index n

72

is W (n) = 〈C × 0,1,Σ, Tw, (c0, t0)〉, where C × 0,1 is a set of states and (c0, t0) is

the initial state;2 Tw : (C × 0,1)× Σ→ C × 0,1 is the transition relation defined

based on the observations of player 1 as follows:

1. t = 1: Tw((c,1), σ) = (c′,0) is defined if for c ∈ C with c =⇒ Pre(σ), we have

c′ = σ(c) is the world state that captures the effect of σ on world state c.

2. t = 0: Tw((c,0), σ) = (c′,1) is defined if for c ∈ C, after player 2 plays σ, the

observed world state is c′.

In the course of the game G, player 1 updatesW as follows. Let the world state at

index n be c ∈ C. At n+1, suppose player 2 plays σ ∈ Σ2 and the world state becomes c′.

Then W (n+ 1) is obtained from W (n) incrementally by first adding state (c′,1) if it is

not already included in the state set and then defining a transition Tw((c,0), σ) = (c′,1)

if not already existing. Outgoing transitions of (c′,1) are subsequently added according

to the definition of Tw: for any σ ∈ Σ1 such that Pre(σ) is satisfied by c′, we add a

transition from (c′,1) to (σ(c′),0) labeled σ.

The incremental construction of W is reminiscent of learning an action model

with full observation [84] by treating the action of player 2 as the set of actions whose

conditional effects have to be learned. The convergence of learning is guaranteed:

informally, according to the turn-based product P , the set of world states that may

actually be encountered during players’ interaction is fixed. Once the set of states in

W (i) for some i ∈ N converges to a set that contains LB(Q), then one adds transitions

with a known state set using the rules defined above. The convergence is reached when

no more transition can be added. The construction of a world-state transition system

is linear in the size of the state space.

Definition 15. Suppose that upon the initialization of the game (G, v0) player 1 is

at state I1, W (n) = 〈C × 0,1,Σ, Tw, (c0, t0)〉 has been constructed at index n and

a hypothesis of L2(G, v0) is given in the form of a grammar G by a GIM, for which

2 By construction, not all states in C × 0,1 can be accessed in W (n).

73

one finds a dfa B = 〈Qh2 ,Σ2, T

h2 , I

h2 , F

h2 〉 such that L(B) = L(G). The hypothesized

turn-based product is

HP = W (n) ×s A1 ×s B = W (n) ×s 〈Q1,Σ1, T1, I1, Q1〉 ×s B = 〈H,Σ, δ′, h0, LB′〉

where H = C × 1,0 × Q1 × Qh2 is the state set and h0 = (c0, t0, I1, I

h2 ) is the initial

state; the transition relation δ′ is defined as follows: for h = (c, t, q1, qh2 ),

1) σ ∈ Σ1 ∧ t = 1: δ′(h, σ) = (Tw((c, t), σ), T1(q1, σ), qh2 );

2) σ ∈ Σ2 ∧ t = 0: δ′(h, σ) = (Tw((c, t), σ), q1, Th2 (qh2 , σ)) .

The labeling function LB′ : H → C is defined such that for any h = (c, t, q1, qh2 ) ∈ H,

LB′(h) = π1(h) = c.

Theorem 4. Let GIM be a learning algorithm that identifies in the limit from positive

presentations a class of regular languages L. Suppose game (G, v0) with v0 = (q0, s0)

is such that L2(G, v0) ∈ L, and consider any positive presentation of L2(G, v0) denoted

φ2 : N→ L2(G, v0) ∪ #. Let the algorithm Alg be defined by

Alg(φ2[n]) , (W (n)×s A1 ×s A2(φ2[n]))nAs ,

where n ∈ N, A2(φ2[n]) = 〈Qh2 ,Σ2, T

h2 , I

h2 , F

h2 〉 is a dfa that accepts the language

generated by the grammar GIM(φ2[n]), and A1 is obtained from A1 by assigning I1 =

π1(q0) as the initial state and all states final. Then there exists some index N ∈ N,

such that

1. L(GIM(φ2[N ])) = L2(G, v0);

2. W (n) = W (N) for all n ≥ N ;

3. Alg(φ2[N ]) is game equivalent to (G, v0).

Proof. It suffices to prove that at index N ∈ N, the hypothesized turn-based prod-

uct W (N) ×s A1 ×s A2(φ2[N ]) ' (P, q0) = 〈Q,Σ, δ, q0, LB〉, because it follows from

Proposition 4 that Alg(φ2[N ]) is equivalent to (G, v0).

74

At index N , let the hypothesized turn-based product be HP = W (N)×sA1 ×sA2(φ2[N ]) = 〈H,Σ, δ′, h0, LB

′〉. Let the relation R ⊆ Q×H be defined as (q0, h0) ∈ R

with (q, h) ∈ R whenever there exists w ∈ Σ+ such that δ(q0, w) = q and δ′(h0, w) = h.

We show R is a bisimulation:

First we show that if (q, h) ∈ R, then LB(q) = LB′(h): given that LB′(h0) =

c0 is the world state at the initialization of game, LB′(h0) = LB(q0) = c0 due to

the uniqueness of the initial world state. For (q, h) ∈ R, we assume without loss of

generality that there exists σ ∈ Σ for which δ(q, σ) = q′ and δ′(h, σ) = h′. Since in a

deterministic game, for the same world state c the world state resulting from applying

action σ on c is unique, and given LB(q) = LB′(h) = c we can infer LB(q′) = LB′(h′) =

σ(c) = c′ ∈ C. Inductively, it follows that for any (q, h) ∈ R, LB(q) = LB′(h).

Next we show that if (q, h) ∈ R, then for every σ such that δ(q, σ) = q′ ↓, there

must exist h′ ∈ H such that δ′(h, σ) = h′ and (q′, h′) ∈ R and vice versa. So far, we

have

(q, h) ∈ R =⇒ LB(q) = LB′(h) = c ∧ (∃w ∈ Σ∗)[δ(q0, w) = q ∧ δ′(h0, w) = h] .

Consider any σ such that δ(q, σ) ↓ and let q′ = δ(q, σ) = δ(q0, wσ). In showing that

δ′(h, σ) ↓, two cases can arise:

Case 1

σ ∈ Σ1. By definition of the turn-based product, and from the fact that σ

is taken at state q of P , we can infer LB(q) =⇒ Pre(σ), which means in W (N),

Tw((c,1), σ) = (σ(c),0) is defined. Meanwhile as δ(q0, wσ) ↓, let u1 be the projection

of w on Σ1, and note that u1σ ∈ L1(G, v0) ⊆ L(A1) implies T1(I1, u1σ) ↓. By Defini-

tion 15, we have δ′(h, σ) ↓. Let h′ = δ′(h0, wσ) = δ′(h, σ). Then (q′, h′) ∈ R by the

definition of R.

Case 2

σ ∈ Σ2. Similarly to the previous case, since δ(q0, wσ) ↓, let u2 be the projection

of w on Σ2, and note that u2σ ∈ L2(G, v0) ⊆ L(A2). As L(A2(φ2[N ])) = L2(G, v0)

75

in the limit, it follows that T h2 (Ih2 , u2σ) ↓. For Tw((c,0), σ) to be defined in W (N),

player 1 must have observed player 2 taking action σ when the world state is c. In

the true turn-based product, given LB(q) = c, unless player 2 never plays σ at q (in

which case we can safely assume wσ /∈ L(G, v0)), there must exist a time index k ≤ N

when the transition labeled σ from (c,0) is added to W (k). Now since T h2 (Ih2 , u2σ) ↓and Tw((c,0), σ) ↓, by construction we have that h′ = δ′(h, σ) = δ′(h0, wσ) ↓, and thus

(q′, h′) ∈ R by the definition of R.

Till now, we have shown that P is simulated by HP: any sequence of actions of

player 1 and 2 in P can be matched with a sequence of actions in HP. For the other

direction note that any sequence of actions in HP can also be matched by a sequence

of actions in P , because observed behaviors originate from the true turn-based product

P that captures all possible interactions. Having shown W (N)×s A1 ×s A2(φ2[N ]) '(P, q0), Proposition 4 allows us to conclude that the game Alg(φ2[N ]) is equivalent to

(G, v0).

We say Alg identifies a game equivalent to (G, v0) in the limit from positive pre-

sentations of L2(G, v0). From Theorem 4 and Proposition 4, it follows that the winning

strategy of player 1 computed using Alg(φ2[N ]) ensures the satisfaction of the task

specification accepted by As. The winning strategy computed in Alg(φ2[N ]) converges,

through functions r, r′, to the winning strategy for player 1 in the (G, v0). There is no

need to compute r and r′ explicitly because for any finite run ρ in Alg(φ2[N ]), WS′1(ρ′)

computed using Alg(φ2[N ]) is exactly the action given by WS1(r′(ρ′)) in (G, v0).

Until GIM converges, there can be no guarantee that an effective strategy can

be found. However, since the output of GIM is always consistent with the history of

observed environment behavior, the adaptive controller performs at least as good as any

other synthesized one without making any inference. With observations accumulating,

the adaptive controller is monotonically improving.

The computational complexity of learning depend on which grammatical infer-

ence algorithm is being used by GIM. It has been shown [49] that many of the learning

76

algorithms for lattice classes of languages are learnable by algorithms that are set

driven (i.e., the order of data is irrelevant), and poly-time iterative (i.e. can compute

the next hypothesis in polytime from the previous hypothesis and current data point

alone).

4.5 Case Study

In this section, we illustrate the method presented herein with a robot motion

planning example: : a robot (player 1) aims to visit rooms 1 through 4 in Fig. 4.3a,

when the doors a, b, c, d, e, f, g are controlled by an adversary (player 2).

We assume player 2 adheres to the following rules: 1. doors b, c, e, f (around

room 0) can be kept closed for at most two rounds, and the rest can be closed for at

most one round.3 2. two doors closed consecutively, if not the same, must be adjacent

to each other (doors are adjacent if connected via a wall; for example, doors b, c are

adjacent to a.). Player 1 can either stay in her current room or move to an adjacent

one, if the door connecting the two is open, but cannot stay in rooms 3, 4, 0 for more

than one round. Note that although in principle our method applies to cases where

player 1 restricts the behavior of player 2, in this particular example this is not the

case.4

As the first game starts, player 1 is informed that the language of player 2 is

in the class of strictly 3-local languages [48] (see Appendix for characterization and

available GIMs for this class of languages). Such information is inferred from the fact

that player 2 has a finite memory of size 3. Figure 4.3b gives a graphical description

of a fraction of the game automaton (G, v0), which totally has 1214 states and 3917

transitions. The winning region for player 1 contains 996 states.

3 Doors can be automatic sliding doors with different—designed—closing time spans.

4 Perhaps that could happen if the robot were to remain at a certain door thus prevent-ing it from closing, but this would not be particularly beneficial in terms of achievingits goal.

77

a

b

c

d

e

f

g

1 2

3

0

4

(a) A graphical depiction of the envi-ronment

cstartba

cc

cb

efce

ee

fe

b

c b

a

e

(b) A fragment of A2 where the initial state is c.Since player 2 cannot close b, c, e, f for more thantwo rounds, she needs to maintain a memory of size2 keeping track of recently closed two doors, e.g.ab means doors a is closed in previous turn andthe current closed door is b. Upon initialization, cmeans so far only door c has been closed.

Figure 4.3: The environment and its abstraction.

(2, cd, 0), 12

(1, c, 1), 1start

(2, c, 0), 12

(1, c, 0), 1

(2, ce, 1), 12

(1, cc, 1), 1 (2, cc, 0), 12

(2, cd, 1), 12

(0, ce, 0), 12

(1, ce, 0), 12

d

1

0

2

2

2

1

e

c

Figure 4.4: A fraction of (G, v0) where v0 = ((1, c,1), 1). A state ((q1, q2, t), qs) meansthe robot is in q1, the recent consecutively closed (at most two) doors are q2, t = 1 ifplayer 1 is to make a move, otherwise player 2 is to make a move and the visited roomsare encoded in qs, e.g. 12 means rooms 1, 2 have been visited.

Figure 4.5 shows how the learning algorithm converges after about 3895 turns

(368 games). Convergence is quantified by measuring the ratio of the size (cardinality)

of grammar GIM(φ[n]) over that of the grammar describing L2(G, v0); the latter has 121

factors of length 3 (see Appendix B for background details). Interestingly, although it

takes 368 games for the learning algorithm to converge, we observe that after 7 games

player 1 only loses once (in the 51st game). This fact suggests that the controllers

computed using the hypothesized games, even when those are not game-equivalent to

the actual game, can still be effective. We compare this outcome with the case where

player 1 has no capacity to learn, and replans in a way similar to [70] using a naive

78

model for player 2: when player 1 observes a door is closed (open), she will assume

the door will remain closed (open) until she observes it opening (closing) again. With

player 2 exploiting every opportunity to prevail, player 1 achieves a win ratio of 27%

when no learning is employed, compared to a ratio of 98% when GIM is used.

0 1000 2000 3000 4000 5000 60000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Move index

Rate

of C

onve

rgen

ce

Figure 4.5: Convergence of learning L2(G, v0): the ratio between the size of the grammarinferred by the GIM and that of L2(G, v0), in terms of number of moves made

4.6 Conclusions

This chapter shows how particular classes of reactive systems, which capture the

interactions between systems and their environments, can be identified in the limit. By

doing so, the control policy for the system can adapt to the identified model of the

environment successively in their interaction. The prerequisites for this are as follows.

1) The behavior of environment in the reactive system corresponds to a language which

belongs to a class of languages identifiable in the limit from positive presentations.

2) the system has the prior knowledge of the class of languages which the environment

behavior belongs to. Provided these conditions hold, it is guaranteed that the system

can compute a winning strategy (control policy) which converges to the true winning

strategy in the limit from positive presentations.

The learning results in this chapter are primarily made possible by factoring the

game according to its natural subsystems: the dynamics of the system, the dynamics of

its environment, and the task specification. This isolated the uncertainty in the game

79

to the model of the environment. Consequently, the framework is flexible and modular :

for a continuous dynamical system interacting with an unknown environment, as long

as both the dynamical system and the environment afford consistent abstractions—in

the sense that abstract plans are always implementable—in the form of finite-state

machines, a grammatical inference module can be incorporated and is guaranteed to

work provided the prerequisites above hold.

An implicit assumption in this work is the availability of complete and precise

sensing information during the execution of the controllers. In practice, this ideal

assumption of complete information is in general difficult to be realized. However,

the proposed framework does not prevent it from affording extensions to games with

imperfect information. Control synthesis with partial observations has been studied

extensively in des theory [21]. In parallel, the solution concept for games with partial

observations is developed in algorithmic game theory [23]. In the next chapter, we

study the control synthesis for reactive systems with partial observations and provide

an outlook for the integration of learning methods under noisy data with the synthesis

methods, in order to construct adaptive temporal-logic controllers for systems with

limited sensing modalities.

80

Chapter 5

CONTROL SYNTHESIS WITH PARTIAL OBSERVATIONS

5.1 Overview

An implicit yet unrealistic assumption in the work on control synthesis with

temporal logic specifications is the availability of complete and precise sensing infor-

mation during the execution of the controllers.

In this chapter, we develop automatic synthesis methods for systems with partial

observation based on the solution concepts for two-player, turn-based, temporal-logic

game with incomplete information [23]. A direction application of results in algorithmic

game theory is impossible due to the lack of an appropriate definition of sensor in the

game formulation. In control theory of discrete event systems, two definitions of sensor

models have been introduced. The first definition of sensing uncertainty [21] simply

partitions the set of events into observable and unobservable ones, and only captures

global sensing uncertainties. For example, if the robot positioned at (x, y) cannot

obtain information for the value of y, then any event that involves the variable y will

be unobservable. Another definition of sensor models introduces a mask function which

is a mapping from the state and/or action space into the observation space [64]. In this

chapter, we introduce an observation function based on the second definition, which

can be used to capture both local and global sensing uncertainties. With the sensor

model defined, we formulate the interaction between a system and its environment

into a variant of partial-information, turn-based, temporal-logic game [6,23] and derive

control synthesis methods with respect to the given system’s specification.

In the case of partial observation, a general approach to control design for a

reactive system is to construct another system with complete information based on

81

a knowledge-based subset construction [66, 96] and then apply the solution for games

with complete information [45]. Methods are also developed when the specification is

expressed in modal µ-calculus [7]. Recently, a nondeterministic control policy has been

introduced for reachability objectives [108]. For safety objectives in infinite state des,

Kalyon et al. [56] introduce a synthesis method that generates a k-memory controller

(a controller with finite memory of length k).

In this chapter, we examine synthesis problems under partial observations with

temporal logic specifications including safety, obligation (reachability), liveness and

persistence (Buchi) requirements. Two different measures of success are discussed.

One requires to complete the task certainly (sure winning). The other requires the

specification to be met with probability 1 (almost-sure winning). In the case of complete

information, these two measures are equivalent because the strategies used by the

system and its environment are both deterministic. In the case of partial observations,

when the system cannot always be sure of its current state and can only hypothesize

about the set of states to which the current state might belong, it has to select an

action randomly from a set of admissible actions.

We show that by allowing the system to keep a finite-memory of history and to

randomize its choice of actions, more permissive control policies can be derived with

respect to temporal logic specifications under the almost-sure winning criteria. Thus,

when a deterministic control policy cannot be computed, one can still try to compute

a randomized one that realizes the objective with probability 1.

5.2 Symbolic Synthesis with Partial Observations

In this section, we formalize the notion of sensor and present a game formula-

tion that captures the interaction between the system and its environment under an

incomplete information regime. Then, by adapting the solutions for games with partial

observations [23], automatic synthesis methods with respect to reachability and Buchi

objectives are developed.

82

5.2.1 The Model

For a set S, let |S| be the cardinality of S. For a finite set S, a probability

distribution on S is a function Pr : S → [0, 1] such that∑

s∈S Pr(s) = 1. We denote

the set of all probability distributions on S by D(S). Let a be an event, and Pr(a) be

the probability of event a happening. For example, if a := x ≥ 5, then Pr(a) is the

probability that x is greater than or equal to 5.

Similar to Chapter 4, we model the system and its environment as players:

Ai = 〈Qi,Σi, Ti,AP i, LBi〉, i = 1, 2. In the case of partial observations, we assume

that the system has a finite number of sensor configurations Θ and there is a surjective

function that maps a system state into a sensor configuration at that state: γ : Q1 → Θ.

We introduce two variables act and sensor. For each σ ∈ Σ, let act = σ be a

predicate, which evaluates true at a state q ∈ Q if and only if the most recent action

before reaching q is σ. For each θ ∈ Θ, let sensor = θ be a predicate, which evaluates

true at q ∈ Q if the current sensor configuration for the system is θ. Then, we augment

the set of predicates AP by AP := AP∪act = σ | σ ∈ Σ∪sensor = θ | θ ∈ Θ∪t.Recall that t is the turn variable and when t = 1 (resp. t = 0) the system (resp.

environment) chooses an action. After augmenting AP , a world state c ∈ C becomes

a conjunction of literals over AP such that an atomic proposition or its negation only

occurs once in c. We assume the values of variable sensor, and turn variable t, and the

set of predicates act = σ | σ ∈ Σ1 are always observable by the system.

With the augmented set of atomic propositions, we define a game graph that

captures the interaction between the system and its environment.

Definition 16. A game graph that captures the interaction of a system

A1 = 〈Q1,Σ1, T1,AP1, LB1〉 and its environment A2 = 〈Q2,Σ2, T2,AP2, LB2〉 is a tuple

P = 〈Q,Σ, δ,AP , LB〉 where the components are defined as follows.

Q = Q1 ×Q2 ×Θ× Σ× 0, 1 the set of states.

Σ = Σ1 ∪ Σ2 the alphabet

83

δ : Q× Σ→ Q transition relation.α, β

AP the set of atomic propositions.

LB : Q→ C the labeling function. γ

α δ((q1, q2, θ, σ, 1), σ′) = (q′1, q2, θ′, σ′, 1) where q′1 = T1(q1, σ

′), LB1(q1) ∧ LB2(q2) =⇒Pre(σ′), γ(q1) = θ and γ(q′1) = θ′.

β δ((q1, q2, θ, σ, 0), σ′) = (q1, q′2, θ, σ

′, 1) where q′2 = T2(q2, σ′), LB1(q1) ∧ LB2(q2) =⇒

Pre(σ′) and γ(q1) = θ.

γ The labeling function is defined such that given q = (q1, q2, θ, σ, x), LB(q) = LB1(q1)∧LB1(q2) ∧ act = σ ∧ sensor = θ ∧ t = x.

We define an observation function obs : C → O where O is the finite set of

observations. An observation is a disjunction of world states, which the system thinks

the true world state may belong to but is not sure exactly which one is the true state.

A sensor model is a tuple Sensor = 〈Θ, C,O, obs〉. The observation function obs(·)is extended to runs. Given ρ = q0q1 . . . ∈ Q∗(Qω), let obsR(ρ) = obs(c0)obs(c1) . . .

where LB(qi) = ci, for 0 ≤ i < |ρ|. Given two runs ρ, ρ′ ∈ Q∗ (Qω), player 1 cannot

distinguish them if and only if obsR(ρ) = obsR(ρ′). In this case we say ρ and ρ′ are

observation-equivalent, denoted ρ ≡ ρ′.

Based on the definition of strategy in Chapter 4, for the partial observation

cases, player 1 is limited to use the following class of strategies.

Definition 17. An observation-based deterministic (resp. randomized) strategy for

player 1 is a function S1 : Q∗ → Σ1(resp. S1 : Q∗ → D(Σ1)) that satisfies: (1) S1

is a deterministic(randomized) strategy of player 1; and (2) for any two runs ρ, ρ′, if

ρ ≡ ρ′, then S1(ρ) = S1(ρ′).

The intuition is that since player 1 cannot distinguish two runs, the actions she

takes, or the probability distributions over the set of actions for these two runs, have

to be the same.

The target problem in this chapter is the following:

84

Problem 4. Given a reactive system (turn-based product) P = 〈Q,Σ, δ,AP , LB〉, the

sensor model Sensor = 〈Θ, C,O, obs〉 and a specification ϕ in the form of a temporal

logic formula, determine whether there exists an observation-based deterministic control

policy S1 such that for any strategy of the environment S2, ϕ is satisfied surely. If

no such controller exists, then determine whether an observation-based randomized

strategy S1 exists such that for any strategy of the environment S2, ϕ is satisfied almost

surely(with probability 1).

We consider reachability and Buchi objectives. The specification ϕ over APcan be translated into an objective automata As = 〈S, C, Ts, Is, Fs〉. For reachability

objective, As is a dfa and for Buchi objective, As is a dba.

5.2.2 Deterministic Finite-memory Strategy

In this section, we develop a synthesis method as a solution to Problem 4, with

respect to reachability and Buchi temporal logic specifications. First, the system and

the interaction it has with its environment, as through partial observations, is captured

by a game graph with complete information, defined as follows.

Definition 18 (Observed game graph). For the game graph P = 〈Q,Σ, δ,AP , LB〉 with

a designated initial state q0, denoted (P, q0), an observed game graph with designated

initial state is a tuple obsP = 〈Qo,O ∪ Σ1, δo, qo0〉 where the components are defined as

follows.

Qo ⊆ 2Q the set of states. α

δo : Qo × (O ∪ Σ1)→ Qo the transition function. β

qo0 = q0 ∈ Qo1 the designated initial state.

α The set of states are partitioned into player 1 and player 2’s states Qo = Qo1 ∪ Qo

2

such that Qo1 = qo ⊆ Q | ∀q ∈ qo, LB(q) =⇒ t = 1 and Qo

2 = qo ⊆ Q | ∀q ∈qo, LB(q) =⇒ t = 0.

85

β Let δo(qo1, a) = qo2 if one of the following condition holds.

• a ∈ Σ1, qo1 ∈ Qo1, qo2 = q′ ∈ Q2 | ∃q ∈ qo1, δ(q, a) = q′. Let qo2 ∈ Qo

2.

• a ∈ O, qo1 ∈ Qo2, if there exists σ ∈ Σ1 and qo ∈ Qo

1 such that T (qo, σ) = qo1,

then qo2 = q ∈ qo1 | obs(LB(q)) = a and qo2 ⊆ qo1. Let qo2 ∈ Qo2; otherwise, any

transition leading to qo1 is labeled with some a ∈ O, let qo2 = q′ ∈ Q | (∃q ∈qo1)[∃σ ∈ Σ2, δ(q, σ) = q′ ∧ obs(LB(q′)) = a∧ (a =⇒ act = σ)]. Let qo2 ∈ Qo

1.

Intuitively, given a state qo1 ∈ Qo1, player 1 chooses an action σ ∈ Σ1 and

hypothesizes that she can possible arrive at any state within qo2 = δo(qo1, σ). Then once

the transition takes place, she obtains an observation o1 and then filters the hypothesis

that she previously made. Since it is the environment’s turn now, the environment

picks an action and as a result, another observation o2 is obtained by the system.

Player 1, based on the new observation o2 and her previous hypothesis of states from

o1, determines the set of states that she can be in and then selects an action admissible

at her current state. It is assumed that for any qo ∈ Qo1, and for any pair q, q′ ∈ qo, the

sets of available actions for player 1 is the same in both q and q′.

We illustrate the construction by a simple example in Fig. 5.1. Here, the states in

the same block have the same image under the observation function, e.g. obs(LB(0)) =

obs(LB(1)) = o1 ∈ O. When player 1 obtains observation o1, she is not certain which

state she is in: it can either be 0 or 1, and the only available action at these two states

is a. Player 1 knows that after taking a she will be in either 2 or 3 but is not sure

which one of the two she lands at before she takes action a. When player 1 takes action

a, and if she observes o2, then she is certain that the state is 2 and the previous state

was 0. Then player 2 may pick an action b or c, either way player 1 receives the same

observation o4 and thus again she is uncertain as to whether she is in 4 or 5.

A run ρ = q0q1q2 . . . qn ∈ Q∗ is related with a run ρo = qo0qo1 . . . q

om ∈ (Qo)∗,

m 6= n as follows: let qo0 = q0. If LB(q1) =⇒ act = σ, obs(LB(q1)) = o1, and

obs(LB(q2)) = o2, then qo1 = δo(qo0, σ), qo2 = δo(qo1, o1) and qo3 = δo(qo2, o2), etc.

86

0

1

2

3

4

5

6

a

a

bo1

o2

o3

o4

o5

c

b

(a) The game graph P .

0

1

2

3

4

5

6

2

3

a

o1

o2

o2

o3o3

o4

o4

o5

o5

(b) The observed game graph obsP .

Figure 5.1: A fragment of a game graph P and the observed game structure obsP .

Lemma 1. For two runs ρ1, ρ2 ∈ Q∗(Qω), if ρ1 ≡ ρ2, then the corresponding runs

ρo1, ρo2 ∈ (Qo)∗((Qo)ω) satisfies ρo1 = ρo2.

Proof. Since ρ1 ≡ ρ2, obsR(ρ1) = obsR(ρ2) and the sequence of player 1’s actions are

the same in both ρ1 and ρ2, a string w ∈ (O ∪ Σ1)∗ (or (O ∪ Σ1)ω ) is generated for

both ρ1 and ρ2. Thus, the runs in obsP generated by the same string w have to be the

same, i.e., ρo1 = ρo2.

For a given objective automaton As = 〈S, C, Ts, Is, Fs〉, with respect to the

observed game graph obsP, we define the following observed game:

obsG = obsPnAs = 〈M,O ∪ Σ1,∆,m0, Fo〉

where

M = M1 ∪M2 Mi = Qoi × S is the set of states for player i.

∆ : M × (O ∪ Σ1)→M the transition function.α

m0 = (q0, Ts(s0, LB(q0)) the initial state. β

Fo = (qo, s) ∈M | s ∈ Fs the winning condition. γ

α Let ∆(m, a) = m′ if one of the following conditions holds.

• a ∈ O, m = (qo1, s) ∈ M2, m′ = (qo2, s′) ∈ M1 ∪ M2, δo(qo1, a) = qo2 and

Ts(s, c) = s′ where c ∈ C, a =⇒ c and for any c′ such that Ts(s, c′) is defined,

either c =⇒ c′ or c ∧ c′ =⊥.

• a ∈ Σ1, m = (qo1, s) ∈M1, m′ = (qo2, s′) ∈M2, δo(qo1, a) = qo2 and s′ = s.

87

β It is assumed that player 1 knows the initial state q0 and the initial world state

LB(q0).

γ If As is a dfa, a run ρ ∈ M∗ is winning for player 1 if and only if last(ρ) ∈ Fo. In

the case where As is a dba, a run ρ ∈ Mω is winning for player 1 if and only if

Inf(ρ) ∩ Fo 6= ∅.Definition 19. With reference to obsG = 〈M,O ∪ Σ1,∆,m0, Fo〉, a finite-memory

controller (i.e. finite-memory winning strategy) for player 1 is a deterministic Moore

machine

M = 〈M,m0,O ∪ Σ1,Σ1,∆,WS1〉

where M is a finite set of memory states, m0 ∈M is the initial memory state, O∪Σ1

is the alphabet, ∆ : M×(O∪Σ1)→M is the state update function(transition function),

and WS1 : M1 → Σ1 is the next-action function if the controller is deterministic or

WS1 : M1 → D(Σ1) where D(Σ1) is a probability distribution over Σ1 if the controller

is randomized.

The semantics of this controller is as follows: after player 1 plays σ at memory

state m, she will receive two consecutive observations o1, o2 ∈ O, and the controller

updates to memory state m′ = ∆(m,σo1o2). In response player 1 selects an action or a

distribution over available actions according to WS1(m′). The deterministic controller,

at memory state m, suggests action WS1(m) = σ ∈ Σ1. The randomized controller

assigns probability WS1(m)(σ) to selecting action σ (Note that WS1(m) is a probability

distribution WS1(m) : Σ1 → [0, 1]). In what follows we compute the next-action

function WS1 to complete the controller M.

Theorem 5. There exists an observation-based deterministic finite-memory controller

M = 〈M,m0,O ∪ Σ1,Σ1,∆,WS1〉 for player 1 in the game graph P with respect to

specification ϕ, if and only if there exists a memory-less deterministic winning strategy

WS1 : M1 → Σ1 for player 1 in the observed game obsG.

88

Proof. The proof follows from existing solutions to games with partial observations

[7, 23].

The game obsG is simply a deterministic game with complete information. We

can directly apply the result in [45] to obtain a deterministic memoryless winning

strategy for player 1 WS1 : M1 → Σ1, if there exists one. We complete the synthesis

for a finite-memory controller by incorporating WS1 as the next-action function inM.

5.2.3 Randomized Finite-memory Controllers

Even if a deterministic finite-memory controller cannot be found, it is still pos-

sible that we can find a randomized finite-memory controller that ensures the specifi-

cation is met with probability 1. For this purpose, we augment (P, q0) with obsG to

obtain the following two-player game:

G = (P, q0)× obsG = 〈V,Σ, T, v0, F 〉

where the components are defined as follow.

V = V1 ∪ V2 Vi ⊆ Qi ×Mi is the states of player i.

T : V × Σ→ V the transition function. α

v0 = (q0,m0) the initial game state.

F = (q,m) | m ∈ Fo the winning condition.

α Let T (v, σ) = v′ if one of the following conditions holds:

• v = (q,m) ∈ V1, δ(q, σ) = q′, ∆(m,σobs(LB(q′))) = m′, and v′ = (q′,m′).

• v = (q,m) ∈ V2, δ(q, σ) = q′, ∆(m, obs(LB(q′))) = m′, and v′ = (q′,m′).

If As is a dfa, the game is a reachability game. If As is a dba, the game is a Buchi

game.

Given a pair of strategies S1, S2 for player 1 and 2 respectively, and an initial

state v ∈ V , the set of runs of the game generated by this pair of strategies is denoted

89

Out(v, (S1, S2)). A run is almost-sure winning for player 1 if it satisfies the task speci-

fication with probability 1. A strategy WS1 for player 1 is almost-sure winning in the

game starting with v0 ∈ V , if and only if for any strategy S2 of player 2, and for any

ρ ∈ Out(v0, (WS1, S2)), ρ is almost-sure winning for player 1. The set of states from

which there exists an almost-sure winning strategy for player 1 is referred to as the

almost-sure winning region of player 1.

Given a state v ∈ V , the projection of v onto the set of memory statesM is π2(v).

Let the set of observation equivalent game states of state v be [v] = v′ ∈ V | π2(v′) =

π2(v). Intuitively, for two runs ρ, ρ′ ∈ V ∗ such that T (v0, ρ) = v1 and T (v0, ρ′) = v2,

if π2(v1) = π2(v2), then the system arrives at the same state in the Moore machine and

thus chooses the same action (or a distribution over actions) according to the next-

action function. Hence, we introduce the definition of observation-based memoryless

deterministic (resp. randomized) strategy WS1 : V1 → Σ1 ( resp. WS1 : V1 → D(Σ1)) in

G which is such that, if [v] = [v′] for any two states v, v′ ∈ V1, then WS1(v) = WS1(v′).

5.2.3.1 Randomized Controllers for Reachability Objectives

Similar to the case of complete observation, where the sure-winning strategy for

player 1 is related to the concept of the attractor, the notion of almost-sure attractor

in the partial observation case can be used to define the almost-sure winning region of

player 1, with respect to both reachability and Buchi objectives.

Definition 20. Consider the game G = 〈V,Σ, T, v0, F 〉, the almost-sure attractor of

X ⊆ V , denoted ASAttr(X), is a set of states such that for any v ∈ ASAttr(X), there

exists an observation-based randomized strategy WS1 : V1 → D(Σ1) that ensures for

any strategy S2 of player 2, for any ρ ∈ Out(v, (WS1, S2)), ρ reaches a state in X

with probability 1. That is, for any player 2’s strategy S2, (∀v ∈ ASAttr(X))[∀ρ ∈Out(v, (WS1, S2)), Pr(last(ρ) ∈ F ) = 1].

For a given set of state X, we provide a procedure for computing ASAttr(X).

90

1. First we introduce the function Allow : V1 × 2V → 2Σ1 defined by

Allow(v, Z) = σ ∈ Σ1 | T (v, σ) ↓ ∧T (v, σ) = v′ ∈ Z]

and set Allow([v], Z) =⋂v′∈[v] Allow(v′, Z). Intuitively, as long as the current

state v′ ∈ [v], an action in Allow([v], Z) will never lead to a state outside Z.

We introduce the function Spre : 2V × 2V → 2V defined by

Spre(Z, Y ) = v ∈ V1 | (∃σ ∈ Allow([v], Z))[T (v, σ) ∈ Y ]

∪ v ∈ V2 | (∀σ ∈ Σ : T (v, σ) ↓)[T (v, σ) ∈ Y ]

Informally, Spre(Z, Y ) is a set of states such that the following conditions are

satisfied. For a player 1’s state in this set, there exists an action for player 1 that

leads to a state in Y ; for a player 2’s state in this set, any action of player 2 will

lead to a state within Y .

2. let Z0 = V and X0 = X, j := 0, i := 0, inductively,

X i+1 = X i ∪ Spre(Zj, X i), and repeat until i = i∗ ∈ N : X i∗+1 = X i∗

Then, let Zj+1 = X i∗ , and repeat until j∗ ∈ N : Zj∗+1 = Zj∗ = Z(5.1)

The fixed point Z in (5.1) is the almost-sure attractor of X: Z = ASAttr(X). The

complexity of computing the almost-sure attractor for a given game automaton

G in this way is polynomial in the size of the game G.

Theorem 6. Given Z = ASAttr(X), the memoryless randomized strategy WS1 : V1 →D(Σ1) is defined by

for each σ ∈ Allow([v], Z), WS1(v)(σ) =1∣∣Allow([v], Z)

∣∣ .where WS1(v)(σ) is the probability of selecting action σ at the state v.

That is, given v ∈ Z, player 1 picks σ ∈ Allow([v], Z) uniformly at random. By

adhering to WS1, from any state v ∈ ASAttr(X), player 1 can force a run in the game

G to reach X with probability 1.

91

Proof. Given Z, one can compute a list of sets of states W0, . . . ,Wk such that W0 = X,

and for 0 < i < k, Wi+1 = Wi ∪ Spre(Z,Wi) where k = mini | Wi = Wi+1. The

sequence of sets of states is essentially the sequence of sets X0, . . . , Xk in (5.1) where

Zj is Z. Since Z is the fixed point, Wk = Z. For each v ∈ Z\W0, there exists an unique

ordinal i such that v ∈ Wi+1 \Wi. By construction, if v ∈ V1∩Z, then there exists σ ∈Allow([v], Z) such that T (v, σ) ∈ Wi. If player 1 selects σ to enter Wi, and inductively,

forces a run ρ with length |ρ| ≤ k such that last(ρ) ∈ W0 = X. However, with a

randomized strategy, σ is chosen with probability WS1(v)(σ) = 1

|Allow([v],Z)| . Hence, after

player 1 makes her move, the probability that v′ ∈ Wi is Pr(v′ ∈ Wi) = 1

|Allow([v],Z)| ≥1n

where n = maxv∈V∣∣Allow([v], Z)

∣∣. When σ′ 6= σ is selected, T (v, σ′) ∈ Wj for some

0 ≤ j ≤ k. Note that T (v, σ′) /∈ V \ Z because any allowed move of player 1 cannot

force the run out of Z.

Let Pr(v,♦kW0) denote the probability of reaching W0 from state v in k turns.

For any v ∈ Z, and all strategies of player 2, since v ∈ Wi for some i ≤ k, when player

1 applies WS1, it is Pr(v,♦kW0) ≥ ( 1n)k > 0 and the probability of not reaching W0 in

k turns is less than or equal to 1− ( 1n)k = r. In this case, from game state v′, which is

in Wj for some 0 < j ≤ k, the probability of not reaching W0 in k turns is still ≤ r,

and inductively, the probability of a path starting from v eventually reaching W0 is

Pr(v,♦W0) = limk→∞(1− rk) = 1 as r < 1.

Since X is reached with probability 1 using WS1, and for any v ∈ X, for any

player 2 strategy S2, the set of runs ρ ∈ Out(v, (WS1, S2)), satisfies Pr(last(ρ) ∈ X) =

1.

For the case when As is an dfa and captures a reachability objective, let us denote

the target set of states X to be F . If v0 ∈ ASAttr(F ), then the computed strategy

WS1 : V1 → D(Σ1) is exactly the randomized finite-memory controller that ensures the

state set F is visited with probability 1.

92

Proposition 5. Given WS1 : V1 → D(Σ1) which is the almost-sure winning strategy

for player 1 in G, if for any two given states v1 = (q1,m1), v2 = (q2,m2) ∈ V , it holds

that m1 = m2, then WS1(v1) = WS1(v2).

Proof. For i = 1, 2, WS1(vi)(σ) = 1

|Allow([vi],Z)| for each σ ∈ Allow([vi], Z). If m1 = m2,

by the definition of observation-equivalence in states, we can infer [v1] = [v2]. Therefore,

WS1(v1) = WS1(v2).

Hence WS1 can also be defined by WS1 : M → D(Σ1). This discussion com-

pletes the description of the finite memory controller — the Moore machine M =

〈M,m0,O ∪ Σ1,Σ1,∆,WS1〉. During game-play, player 1 updates the memory state

using the transition function ∆ based on her observations, and at her turn, selects

an action according to the probability distribution given by WS1(m) where m is the

current memory state. By Theorem 6 we can ensure the task is accomplished with

probability 1. Note that WS1 is memoryless in G but the Moore machine M that

includes WS1 implements a finite-memory strategy.

5.2.3.2 Randomized Controllers for Buchi Objectives

For Buchi objectives, a run ρ is almost-sure winning for player 1 in obsP if and

only if Pr(ρ,♦F ) = 1, which means the set of runs along which F is always reachable

but which F is visited only finite many times is almost empty, i.e. has probability

measure 0. In other words, F is visited infinitely often with probability 1.

Theorem 7. In the two-player turn-based Buchi game G = 〈V,Σ, T, v0, F 〉, there exists

an observation-based randomized strategy WS1 : V1 → D(Σ1), which if player 1 adheres

to, she wins G almost surely if and only if v0 ∈ ASWin1. Here ASWin1 is the winning

region of player 1, obtained as follows: let Z0 := V , j := 0 and define inductively

Xj = ASAttr(Zj), Y j = Spre(Xj, Xj),

Zj+1 = Y j ∩ F, and repeat until j = j∗ ∈ N : Z = Zj∗+1 = Zj∗ ,

ASWin1 := ASAttr(Z).

(5.2)

93

The observation-based randomized strategy WS1, which ensures player 1 almost surely

wins the game, is defined by

for each σ ∈ Allow([v],ASWin1),WS1(v)(σ) =1∣∣Allow([v],ASWin1)

∣∣ .That is, given state v ∈ ASWin1, player 1 picks σ ∈ Allow([v],ASWin1) uniformly at

random.

Proof. Given the fixed point Z in (5.2), one can compute a list of sets of states W where

W0 = Z ⊆ F , and for 0 < i ≤ k, Wi+1 = Wi ∪ Spre(Z,Wi), and k = mini | Wi =

Wi+1. The sequence of sets of states W is essentially the sequence of sets X0, . . . , Xk

in (5.1) in the computation of ASAttr(Z). By construction, for every v ∈ V1 ∩ Z,

there exists σ ∈ Allow([v],ASWin1) such that T (v, σ) ∈ ASAttr(Z) and WS1(v)(σ) > 0.

In addition, for all v′ ∈ [v] and v′ 6= v, T (v′, σ) ∈ ASAttr(Z) by the definition of

Allow([v],ASWin1) (note that ASWin1 = ASAttr(Z)). Intuitively, once v ∈ Z ⊆ F is

visited, by adhering to the randomized strategy WS1 player 1 is ensured to stays in

ASWin1. For a state v ∈ (V1 ∩ ASWin1) \ Z, one can identify an ordinal i ∈ N such

that v ∈ Wi+1 \Wi and there exists σ ∈ Allow([v],ASWin1) such that T (v, σ) ∈ Wi.

According to the randomized strategy WS1, Pr(v ∈ Wi) = 1

|Allow([v],ASWin1)| ≥1n

where

n = maxv∈V∣∣Allow([v],ASWin1)

∣∣. Similar to the proof of Theorem 6, player 1 forces a

run into Z with probability 1 by adhering to WS1. Upon reaching Z, for all actions of

player 2 and actions of player 1 indicated by WS1, the game will reach some state in

ASAttr(Z), from which onward player 1 again forces a run into Z with probability 1

by adhering to WS1. Hence, for all v ∈ ASAttr(Z), the probability of player 1 always

eventually visiting Z ⊆ F is 1.

Similar to the case of reachability objectives, by Proposition 5 we obtain WS1 :

M → D(Σ1) and complete the finite memory controller — the Moore machineM that

ensures the Buchi objective is satisfied with probability 1.

94

5.3 Discussions and Conclusions

In this chapter, we presented automatic synthesis methods for reactive systems

in the presence of incomplete information for temporal logic specifications. To construct

adaptive controllers in the presence of incomplete information, one future direction is

to develop an algorithm that identifies or learns a model for the unknown environment

with data obtained with limited sensing capability, and then to investigate how learning

can be incorporated into control synthesis with partial observations.

The extension of adaptive control synthesis for the partial observation case is

not straightforward. The challenge is that the data presentation we obtained with

respect to the environment behavior is incomplete. So far, there are limited results on

grammatical inference with incomplete or noisy data presentations and to the best of

our knowledge, the methods that work with the definition of limited sensing modality

in reactive synthesis have yet to be developed. Existing work [12, 20, 46] concentrates

on noisy data presentations for the most part, an example of which can be, a string

which does not belong to the target language to be considered as positive data in the

presentation. Probably approximately correct semantics have been extended [73] to the

case of learning concepts (Boolean functions) with incomplete data. On the synthesis

part, [22] established an equivalence relation between games with partial observations

and games with probabilistic uncertainty. This result indicates that the interaction

between the system and its environment under partial observations can be converted

into an equivalent partially observable Markov model. This makes possible to apply

learning algorithms for hidden Markov models, or probabilistic deterministic finite-

state automata [10] to identify an equivalent game with probabilistic uncertainty, from

the original game with partial observation.

95

Chapter 6

OUTLOOK: MULTI-AGENT SYSTEMS

6.1 Overview

In previous chapters, we analyze synthesis problems for a single autonomous

system interacting with a hostile environment by formulating their interaction as a two-

player, zero-sum temporal logic game. In the case of multiple interacting autonomous

agents, where each has its own objective, the approach adopted earlier may not be

applicable or satisfactory anymore. In real world, several instances of multi-agent

interaction that is not always adversarial. Sometimes, a single agent exploits others’

capabilities towards its own objective.

In literature, multi-agent control synthesis with respect to temporal logic speci-

fications is typically done for cooperative systems. Given a global objective, all agents

cooperate to achieve it [25, 100]. In this setting, control synthesis hinges on a task

decomposition problem: how to decompose the global goal into subtasks, in a way

that completion of these subtasks implies that the goal is achieved. In its most general

form, this is a standard problem in supervisory control [104], and concurrency the-

ory [75]. To this point, there is existing work [65] for systems with computational tree

logic (ctl) task specifications. When the global task is in the form of a ltl formula,

methods have been developed [24] to break up the global ltl specification into a set

of control and communication policies for each agent to follow. Centralized controllers

can also be synthesized in the face of modeling uncertainty [101]. Meanwhile, instances

where agents have their own temporal logic specifications and treat each other as part

of some dynamic uncontrollable environment, have also been examined in [63]. When

other agents can be modeled as stochastic processes, probabilistic verification methods

can be applied for controlling for a single deterministic agent [106].

96

However, this line of work does not provide us enough insight to the interaction

of multiple non-cooperative agents. The central problem in mechanism design [78] is

to motivate autonomous interacting agents by giving them the right incentives, so that

some desired behavior emerges as a stable expression of interaction between them. The

reason to consider decentralized control is that a centralized plan/control policy admits

a single point of failure. In contrast, a distributed design affords more robustness for

the entire multi-agent system.

Here we adopt the approach of rational synthesis [37], that poses the control

synthesis problem for this class of multi-agent systems inside a non-zero-sum game

theoretic framework (cf. [63]). Particularly, we offer a game theoretic analysis for the

class of multi-agent systems in which each player is assigned an ω-regular objective

and has a preference over all possible outcomes of their interactions. The approach

builds on the availability of methods for extracting discrete abstractions in the form of

labeled transition systems from continuous or hybrid dynamical systems (see Chapter 2

and [13,92,93,109]).

Capturing the interactions of different agents with independent objectives in

the form of a non-cooperative, concurrent graph game [16], we develop a decision

process for the computation of pure Nash equilibria associated with specified outcomes.

The difference between our work and existing formulations [16], is that in the latter

agents are assigned an ordered set of tasks, whereas in our formulation, each agent

is assigned a single task and has preferences over different game outcomes. We also

analyze the case of ω-regular objectives and then identify the conditions under which

the set of pure Nash equilibria can be computed in polynomial or linear time. In

cases when inter-agent communication [87,107] or learning [39] cannot be realized, we

propose a different solution concept based on security strategies, which ensure a utility

or performance bound for a particular agent, irrespectively of other agent strategies.

We finally present the decision procedure for this type of equilibrium with coalition

games in which groups of agents are allowed to communicate and then team up.

97

6.2 Preliminaries

A deterministic Muller automaton (dma) is an automaton A = 〈Q,Σ, T, I,Acc〉where the acceptance component is expressed as Acc = F ⊆ 2Q, and the machine

accepts a word w ∈ Σω if and only if the run ρ on that word satisfies ρ(0) ∈ I and

Inf(ρ) ∈ F . Given an sa A = 〈Q,Σ, T 〉, a cycle ρ = ρ(0)ρ(1)ρ(2) . . . ρ(n) is a run in A

such that ρ(0) = ρ(n) and for all 1 ≤ i, j ≤ n − 1, ρ(i) 6= ρ(j). We write Cycles(A) to

denote the set of all cycles in A.

The task specifications considered here is given in the form of second-order logic

of the successor (s1s) logical formulas. This logic is an extension of first-order logic,

with quantified variables denoting subsets of the considered relational structures—for

a precise definition of the syntax and semantics of s1s, see [79]. ltl is a fragment of

s1s, and the Buchi theorem [19] establishes the equivalence between an s1s formula

and a dma: for any s1s formula φ over a set of atomic propositions AP , there exists

a dma with the alphabet Σ ⊆ 2AP accepting exactly the set of infinite words over

Σ satisfying φ. It is known that s1s logic is strictly more expressive than ltl. In

addition, a nondeterministic Buchi automaton is more expressive than a deterministic

one, but no more expressive than a deterministic Muller automaton.

6.3 Modeling a Multi-agent Game

6.3.1 Constructing the Game Arena

We capture the interaction between autonomous, independent agents as a con-

current game. Similar to Chapters 4 and 5, let AP be a set of atomic propositions

and C be the set of world states (conjunctions of literals over AP). For each agent we

have a model in the form of a tuple Ai = 〈Qi,Σi, δi,AP i, LBi〉. The conditional effect

of an action σ ∈ Σi is captured by its pre- and post-conditions, denoted Pre(σ) and

Post(σ), respectively. Whenever δi(q, σ) ↓, we have LBi(q) =⇒ Pre(σ); similarly,

when we observe a transition from q to q′ on action σ, compactly expressed as qσ→ q′,

it has to hold that LBi(q′) =⇒ Post(σ).

98

We capture the concurrent interaction between agents by means of the following

construction.

Definition 21 (Concurrent product). For a set of agents Ai, with i ∈ Π = 1, 2, . . . , N,their concurrent product is a tuple A1 A2 . . . AN = 〈Q,ACT , T, LB〉 where

Q ⊆ Q1 ×Q2 × . . . QN the set of states.

ACT = Σ1 × . . .× ΣN the alphabet. α

LB : Q→ C the labeling function. β

T : Q×ACT → Q the transition function. γ

α Each a = (a1, a2, . . . , aN) ∈ ACT is an action profile, encoding the actions played

by all agents simultaneously.

β Given q = (q1, . . . , qN) ∈ Q, LB(q) = ∧i∈ΠLBi(qi) is a logical sentence which is true

at state q.

γ Given q = (q1, . . . , qN) and a = (a1, . . . , aN) ∈ ACT , we have

T (q,a) = T(

(q1, . . . , qN), (a1, . . . , aN))

= (q′1, . . . , q′N)

provided that ∀i ∈ Π, (i) q′i = δi(qi, ai), and (ii) LB(q) =⇒ Pre(ai).

The arena of the game expresses all possible interactions between agents, and

is captured by the concurrent product of Definition 21. The arena A1 A2 . . . AN is

itself an sa, which we denote P = 〈Q,ACT , T 〉. The arena does not incorporate the

agents’ objectives: it just describes what they can do.

6.3.2 Specification Language

The objective of an agent is given as an s1s formula, which is translated into a

language over 2AP . Using the labeling function LB, the objective is translated into a

language Ωi over C, accepted by a total dma Ai = 〈Si, C, Ti, Ii,Fi〉, with sink ∈ Si and

Lω(Ai) = Ωi. The set of all agents’ objectives is denoted ΩiΠ, and the collection of

the deterministic Muller automata that captures these objectives is written AiΠ.

99

6.3.3 The Game Formulation

The concurrent game of a set of agents indexed in Π with objectives ΩiΠ on

the arena P , is denoted G = 〈Π, P,Mov, ΩiΠ〉. In this tuple, Mov : Q × Π → 2Σ,

where Σ = ∪i∈ΠΣi, is a set-valued map, which for state q ∈ Q and agent i ∈ Π, outputs

a set of actions available to agent i at state q. Formally, we write Mov(q, i) = a[i] ∈Σi | T (q,a) ↓, where T is the transition map in P . An initialized game (G, q(0)) is the

game G with a specified initial state q(0) ∈ Q. An initialized arena (P, q(0)) is defined

in a similar fashion: it corresponds to the sa P with a designated initial state q(0).

A play p = q(0)a(0)q(1)a(1)q(2)a(2) . . . in (G, q(0)) is an interleaving sequence of

states and action profiles, such that for all i ≥ 0, we have T(q(i),a(i)

)= q(i+1). A

run ρ = q(0)q(1) . . . is the projection of play p on the set of states. A deterministic

strategy for agent i in (G, q(0)) is a map fi : Q∗ → Σi such that ∀ ρ = q(1)q(2) . . . ∈ Q∗,fi(Pr

=k(ρ)) ∈ Mov(q(k−1), i

)for 1 ≤ k. A deterministic strategy profile f = (f1, . . . , fN)

is a tuple of strategies, with fi being the strategy of agent i. The set of all strategy

profiles is denoted SP . We consider only deterministic strategies. We say that a run

ρ is compatible with a strategy profile f = (f1, . . . , fN) if it is produced when every

agent i adheres to strategy fi. In a particular game (G, q(0)), the set of runs that are

compatible with strategy profile f , is the outcome of the game for this strategy profile,

and is denoted Out(q(0),f). Thus, outcomes are all possible game plays that can result

from the application of a specific strategy profile.

The payoff of agent i is given by a function ui : Q × SP → 0, 1 defined as

ui(q,f) = 1 if and only if for all ρ ∈ Out(q,f), LB(ρ) ∈ Ωi. The payoff vector is the

tuple made of the payoffs of all agents: u(q,f) =(u1(q,f), . . . , uN(q,f)

); then we say

that f yields the payoff vector u(q,f). In game (G, q(0)), the set of all possible payoff

vectors is denoted PV =⋃

f∈SP u(q(0),f).

Definition 22. A preference relation for agent i is a partial order .i defined over PV: for u1,u2 ∈ PV, if u1 .i u2, then agent i either prefers a strategy profile f 2 with

which u(q(0),f 2

)= u2, over a strategy profile f 1 with which u

(q(0),f 1

)= u1, or is at

100

least indifferent between f 1 and f 2.

We say agent i is indifferent between payoff vectors u1,u2 whenever u1 .i u2

and u2 .i u1. In this case we write u1 'i u2.

With the game formulation and the defined agent’s objective and their interets,

we proceed with game theoretic analysis.

6.4 Game Theoretic Analysis

For games that arise as models for interaction in multi-agent systems that consist

of self-interested agents with objectives expressed in the form of a s1s formula, this

section defines two solution concepts: one in the form of a pure Nash equilibrium, and

another in the form of a security strategy.

The former expresses a solution, a tuple of strategies of all agents, that is stable

in the sense that if each player behaves rationally by keeping its own interests in mind

when deciding the actions, any unilateral deviation from this solution will not make

any sense — the agent cannot be better off by such a deviation. Such solutions are

essentially pure Nash equilibria, which emerge in cases of interaction between intelligent

autonomous agents. The second solution concept is a conservative view of a player

who has no knowledge of others’ objectives or levels of rationality; what she tries to

do, therefore, is to find that particular behavior that minimizes her losses under the

worst, arguably irrational, possible scenario.

6.4.1 Pure Nash Equilibria

First we define the notion of pure Nash equilibrium in the multi-agent concurrent

games with individual temporal logic objectives and preference orderings.

Definition 23 (Pure Nash equilibrium). A deterministic strategy profile f is a pure

Nash equilibrium in an initialized multi-agent non-cooperative game(G, q(0)

), if for

any other strategy profile f ′ obtained by agent i ∈ Π unilaterally deviating from one

action profile f , results in u(q(0),f ′

).i u

(q(0),f

).

101

Since we consider only pure Nash equilibria, we will simply refer to them from

now on as just equilibria. Following [16], we employ an alternative procedure that

directly computes a set of pure equilibria in a multi-agent game in the class, through

answering the following question:

Problem 5. For a payoff vector u ∈ PV ⊆ 0, 1N , is there an equilibrium f in the

game(G, q(0)

)such that u

(q(0),f

)= u?

The decision procedure presented below differs from alternative solutions [16] in

that each agent ranks the set of outcomes based on explicit preference relations and has

a single ω-regular objective. The outline of the proposed process is as follows. First the

concurrent multi-player game G is used to construct the arena of a zero-sum two-player

turn-based game H with some fictitious players I and II. This is done incrementally

in two steps: first we form a factor H of H from the arena P of G, and based on H, we

then construct the arena H of the two-player game H using the synchronized product,

which incorporates all player objectives ΩiΠ. The synchronized product operation—

defined more generally in the context of transition systems [8]—is particularized here

for the case of automata:

Definition 24 (cf. [8]). Given a set of automata Ai = 〈Qi,Σi, Ti, q(0)i , Fi〉, with 1 ≤

i ≤ n, their synchronized product is an automaton expressed as A1nA2n . . .nAn =

〈∏ni=1 Qi,

⋃ni=1 Σi, T,

(q

(0)1 , . . . , q

(0)n

),∏n

i=1 Fi〉 where the transition relation T is defined

as follows: for q = (q1, q2, . . . , qn), T (q, σ) = (q′1, . . . , q′n) where q′i = Ti(qi, σ) if Ti(qi, σ)

is defined, otherwise, q′i = qi.

The next step in our procedure (Proposition 6) is to show that all cycles in

the arena H of game H are produced by a subset of agents adhering to a sequence of

action profiles that no one is deviating from. This is important because accepting runs

in a dma are always cycles.1 Then we show (Proposition 7) that all possible payoff

1 In a dma A = 〈Q,Σ, T, I,F〉, the table F is a subset of Cycles(A) [67]. For eachinfinite run ρ, there exists C such that Inf(ρ) = C.

102

vectors which result from adopting some strategy profile f in G, can be enumerated by

looking at the cycles in H. With this at hand, Theorem 8 characterizes the equilibria

in G in the form of particular winning strategies for player I in H, where the winning

conditions are defined with respect to given payoff vectors. These winning plays, which

can be computed using existing methods [76], correspond to (pure Nash) equilibria in

G. We thus have a direct way to determine the set of equilibria for the multi-agent

concurrent game in this class.

The starting point in our process is the same as in existing solutions [16]: we

define a set of suspect agents : Suppose that in G, T(q,a)

= q′. For an action profile

b, the set of suspect agents [16] triggering a transition from q to q′ is

Susp((q, q′), b) = k ∈ Π | ∃σ ∈ Mov(q, k), b[k 7→ σ] = a ∧ T (q,a) = q′ ,

where b[k 7→ σ] is the strategy profile obtained from b when agent k decides unilaterally

to follow some strategy σ instead of b[k]. If, for example b = (b1, b2, . . . , bN), then

b[k 7→ σ] = (b1, . . . , bk−1, σ, bk+1, . . . , bN).

In this context agent k is suspected to have triggered a transition from q to q′

if her unilateral deviation b[k 7→ σ] = a suffices to initiate the transition T (q,a) = q′.

Naturally, when a = b, Susp((q, q′), b) = Π.

To solve Problem 5, the multi-agent concurrent arena P = 〈Q,ACT , T 〉 is trans-

formed to the arena of a two-player turn-based game [16], with two fictitious players:

player I and player II. The two-player turn-based arena factor is a semiautomaton

H = 〈V,ACT ∪Q, Th〉, with components defined as follows:

V = VI ∪ VII the set of states. α

ACT ∪Q the alphabet. β

Th the transition relation.γ

α VI ⊆ Q× 2Π, and VII ⊆ Q× 2Π ×ACT .

103

β ACT = Σ1×· · ·×ΣN represents the available moves for player I, and Q the moves

for player II.

γ given v ∈ V , either

v = (q,X) ∈ VI and if for any a ∈ ACT it is T (q,a) ↓, then Th(v,a) := (v,a) ∈ VII ;or

v = (q,X,a) ∈ VII and if for any q′ ∈ Q it is X ′ = X ∩ Susp((q, q′),a

)6= ∅, then

Th(v, q′) := (q′, X ′) ∈ VI .

In this two-player game the fictitious players alternate: at each turn, one picks a

state in the original game as its move, and the other picks an action profile as its move.

For an initialized concurrent game(G, q(0)

), the initial condition for the associated

two-player game is v(0) =(q(0),Π

), i.e. a couple consisting of the initial game state q(0)

in G and the set of player indices Π. The initial condition of an objective automaton

Ai is s(0)i = Ti

(Ii, LB(q(0))

), and expresses the degree to which the objective of player

i is satisfied when the game is initialized. We write(Ai, s(0)

i

)to emphasize that the

automaton Ai has initial state s(0)i . The objectives ΩiΠ of the concurrent game(

G, q(0))

are incorporated into the initialized two-player arena(H, v(0)

)to yield the

two-player game using the synchronized product:

H =(H, v(0)

)n(A1, T1(I1, π1(v(0)))

)n · · ·n

(AN , TN(IN , π1(v(0)))

)= 〈V ,ACT ∪Q, T , v(0)〉 , (6.1)

with the understanding that the projection operator π1 singles out the first component

in v(0) and gives π1(v(0)) = q(0), while the remaining components are described as


ACT , Q the sets of actions for player I and II respectively.

T the transition function. β

v(0) ∈ VI the initial state. γ

104

α VI = VI × S1 × · · · × SN , with Si the states of Ai, are the states where player I

takes a transition. VII = VII × S1 × · · · × SN are the states where player II moves.

β Given a state v = (v, s1, . . . , sN),

if v ∈ VI and σ ∈ ACT , then T (v, σ) := (v′, s1, . . . , sN) provided that v′ = Th(v, σ);

if, on the other hand, v ∈ VII and σ ∈ Q, then T (v, σ) := (v′, s′1, . . . , s′N), provided

that v′ = Th(v, σ) and for each i ∈ Π it is s′i = Ti(si, LB(σ)).

γ It is a tuple (v(0), s(0)1 , . . . , s

(0)N ) where for each i ∈ Π, s

(0)i = Ti(Ii, LB(π1(v(0)))) =

Ti(Ii, LB(q(0))). It is assumed that player I moves first.

Consider an arbitrary state of the two-player game v = (v, s), and note that v

can be either in VI or in VII . In the first case we have (v, s) =((q,X), s

)with q ∈ Q

and X ∈ 2Π, while in the second, (v, s) =((q,X,a), s

)with a ∈ ACT . Now applying

the projection twice, we get π1(π1(v)) = π1(v) = q as the state— associated to v—of

the underlying multi-agent arena P , and π2(π1(v)) = π2(v) = X as the set of agents

included in the expression of v ∈ V . To directly distinguish the image of these πi(πj)

compositions, we use the more intuitive notation Agt := π2 π1 and State := π1 π1;

thus Agt(v) is the set of agents in v, and State(v) is the state in Q encoded in v. We say

that player II follows player I on a run ρ = v(0)v(1) . . . ∈ V ω if for all i ≥ 0, Agt(v(i)) =

Agt(v(i+1)) = Π. Given ρ = v(0)v(1) . . . ∈ V ω, let Agt(ρ) = Agt(v(0))Agt(v(1)) . . . and

State(ρ) = State(v(0))State(v(1)) . . ..

Problem 5 requires us to find the exact set of possible payoff vectors PV in a

game(G, q(0)

). The following two propositions provide the answer to this question.

Proposition 6. Given a cycle C ∈ Cycles(H), and for each pair (v1, v2) ∈ C × C, we

have Agt(v1) = Agt(v2).

Proof. : For any transition from v ∈ VII where Agt(v) = X, the destination state

v′ ∈ VI contains the intersection of X with the set of suspect agents. Hence, Agt(v′) ⊆Agt(v). Given a run ρ ∈ V ω, for all i ≥ 0 it holds that Agt(ρ(i+1)) ⊆ Agt(ρ(i)); in other

words, the number of agents along a run cannot increase, due to the intersection taken

105

with the set of suspect agents. Thus in a cycle C ∈ Cycles(H), and for v1, v2 ∈ C,

there exists one run from v1 to v2, and another run from v2 to v1. From the first run

we infer Agt(v1) ⊆ Agt(v2), and from the latter Agt(v2) ⊆ Agt(v1). This suggests that

Agt(v1) = Agt(v2).

Proposition 6 implies that for any state v of a cycle C ∈ Cycles(H), Agt(v) is

invariant. This set is thus a feature, a special property of the particular cycle and for

this reason we will denote it Agt(C).

Proposition 7. For an initialized game (G, q(0)), the set of all possible payoff vectors is

PV =⋃C∈Cycles(H) u(C) ⊆ 0, 1N where u(C) = (u1(C), . . . , uN(C)) is a tuple defined

as follows: for i = 1, . . . , N , ui(C) = 1 if⋃v=(v,s)∈Cs[j] ∈ Fj, i.e., the Acc of dma

Aj, and ui(C) = 0 otherwise.

Proof. : It suffices to show that given an initial state q(0) ∈ Q, and for any strategy

profile f , there exists a cycle C ∈ Cycles(H) such that for all i ∈ Π, ui(C) = ui(q(0),f

).

Let r = q(0)q(1)q(2) . . . be a run compatible with f , and generated by input word

w = a(0)a(1) . . . ∈ ACT ω. By construction, there exists a run ρ ∈ V in H such that

State(ρ) = r and for all i ≥ 0, Agt(ρ(i)) = Π. Let τ ∈ Sωj be the run in Aj generated by

r. Since r, ρ and τ are related as shown in Fig. 6.1, we have Inf(τ) =⋃v=(v,s)∈Inf(ρ)s[j].

r,G q(0)a(0)

−→ q(1)a(1)

−→ q(2) . . .

ρ,H ((q(0),Π), s(1))a(0)

−→ ((q(0),Π,a(0)), s(1))q(1)−→ ((q(1),Π), s(2))

a(1)

−→ ((q(1),Π,a(1)), s(2))q(2)−→ ((q(2),Π), s(3)) . . .

τ,Aj Ijq(0)−→ s(1)[j]

q(1)−→ s(2)[j]q(2)−→ s(3)[j] . . .

Figure 6.1: Relating runs r in multi-agent game G, ρ in the two-player turn-based gamegraph H and τ in objective Aj.

If C ∈ Cycles(H) such that Inf(ρ) = C, then Inf(τ) =⋃

(v,s)∈Cs[j]. As defined in

the statement of the proposition, uj(C) = 1 if and only⋃

(v,s)∈Cs[j] ∈ Fj, and we

directly get that in this case Inf(τ) ∈ Fj. Now if Inf(τ) ∈ Fj then uj(q(0),f) = 1,

which means that uj(C) = uj(q(0),f). The proof is completed since f is selected

arbitrarily.

106

So payoff vectors are associated to cycles. For a specific payoff vector, we define

the winning condition that completes the definition of the two-player game through

the definition of a set-valued objective function. The objective function OBJ is a map

from the set of payoff vectors PV , to subsets of cycles in Cycles(H), such that all cycles

C in the image of OBJ(u), for some u ∈ PV , are associated to payoff vectors which

the players in Agt(C) do not prefer over u. Formally, this is expressed as

OBJ(u) , C ∈ Cycles(H) | ∀i ∈ Agt(C),u(C) .i u . (6.2)

The objective function allows us to complete the description of the two-player

Muller game. For each payoff vector u, the Muller game H(u) is the one played in

arena H for which the objective of fictitious player I is OBJ(u). (Since H is a zero-sum

game, player II wins at Cycles(H) \ OBJ(u).) Therefore,

H(u) = 〈 VI ∪ VII ,ACT ∪Q, T , v(0), (OBJ(u),Cycles(H) \ OBJ(u)) 〉 .

As it turns out, given u ∈ PV , whether an equilibrium yielding u exists, can be

determined constructively by computing the winning strategy of player I in the Muller

game H(u):

Theorem 8. Given u ∈ PV—a payoff vector in game (G, q(0)), there exists an equilib-

rium f such that u(q(0),f) = u if and only if the following two conditions are satisfied

(in the order given):

1. there exists a winning strategy for player I in H(u), and

2. there exists a run ρ ∈ V ω in H(u) with ρ(0) = v(0) for which Inf(ρ) ∩ C ∈OBJ(u) | u(C) = u 6= ∅, and for all i ≥ 0, ρ(i) ∈ WinI and Agt(ρ(i)) = Π.

Proof. : Let WSI be the winning strategy of (fictitious) player I given its Muller

objective OBJ(u). Let C be a cycle in OBJ(u) that satisfies condition 2 of the theorem,

namely u(C) = u, Agt(C) = Π, and there exists a run ρ in H that starts at v(0), never

leaves the winning region of player I, and visits C infinitely often. For an odd number

107

m ≥ 1, we obtain a finite prefix of ρ of the form Pr=m(ρ) = ρ(0)ρ(1) . . . ρ(m−1). At

the last state in this prefix, ρ(m−1) =((q,Π), s

), player I can make a move a ∈

WSI(Pr=m(ρ)

)and reach state ρ(m) = T

(ρ(m−1),a

)=((q,Π,a), s

). If every agent

in Π adheres to action profile a, it means that player II selects action qo = T (q,a),

and the next state in H(u) becomes ρ(m+1) =((qo,Π), s′

), with s′[i] = Ti

(s[i], qo

),

for all i ∈ Π. However, if any agent in Π (say j) deviates and unilaterally changes its

action, the action profile in G changes from a to a[j 7→ σ] = a′, with T (q,a′) = q′.

In H(u), this amounts to player II selecting some action q′ 6= qo, that brings H(u) to

state((q′,Π ∩ Susp(q, q′,a)), s′′

)6= ρ(m+1). Agent j is now suspected of triggering the

transition from q to q′: j ∈ Susp(q, q′,a).

Since by assumption ρ(m) ∈ WinI , any action of fictitious player II (including

q′) still keeps the game in the winning region WinI of fictitious player I. If player

I adheres to WSI , game H(u) will still end up in some C ′ ∈ OBJ(u). Since for all

i ∈ Agt(C ′), we know that u(C ′) .i u from the definition of the objective function,

the suspect agent j ∈ Susp(q, q′,a) who is in Agt(C ′), will not be happier with the

outcome of her unilateral deviation once she compares it with the outcome resulting

from sticking to the equilibrium: based on its preference relation, the payoff vector in

visiting C ′ infinitely often is not ranked higher than the one obtained when visiting C

infinitely often.

Since m and j are selected arbitrarily, by Definition 23 there must exist an

equilibrium f associated with payoff vector u. The strategy profile f can be computed

from ρ: if w is the ω-word generating the run ρ, f is the projection of w on the set of

action profiles. In this way, u(q(0),f) = u.

An equilibrium corresponds essentially to an optimal strategy profile—note that

different equilibria are not comparable—for a group of agents with independent objec-

tives and preferences over outcomes.

108

Lemma 2. If there exists u ∈ PV such that for every agent i ∈ Π, the payoff vector u

ranks highest in preference—that is, for any u′ ∈ PV \ u, for any i ∈ Π, u′ .i u—

then there exists a pure Nash equilibrium that yields u.

Proof. : Suppose such a u ∈ PV exists. Given Proposition 7, we have that for any

C ∈ Cycles(H), and for any i ∈ Agt(C), it is u(C) .i u. From (6.2), it follows that

OBJ(u) = Cycles(H). In H(u), player I has objective F1 = OBJ(u) = Cycles(H) and

player II strives for the opposite: F2 = Cycles(H)\OBJ(u) which is empty. Because of

this, player I always wins since every game can only end up visiting cycles in Cycles(H)

infinitely often. (For the computation of the equilibrium f , see the last part of the

proof of Theorem 8; in this case WinI includes all states in H(u).)

An equilibrium f in Lemma 2 does not always exist. As a result, the quest for

equilibria in scenario of multi-agent concurrent interactions cannot always be reduced

to a global optimization problem.

6.4.2 Special Cases — Buchi and Reachability Objectives

In the previous section, the equilibrium or the deterministic security strategy is

found by solving a two-player turn-based Muller game. In general, the operation is of

computational complexity O(3n) [35] where n is the number of game states. In what

follows, we analyze two interesting special cases, where the computational complexity

is significantly smaller. These are the cases when an objective is defined in the form of a

dba or a dfa; then the set of equilibria (and security strategies, for that matter) can be

found in polynomial and linear time in the size of two-player game arena, respectively.

We do not have to turn all type of games into Muller games.

6.4.2.1 Deterministic Buchi Games

Objectives expressed using dbas can also be defined using dmas. In principle,

one can convert a dba to a dma2 and apply the process described in the previous

2 A dba with Acc = F can be converted into a dma by defining Acc′ of the later as:Acc′ = F =

⋃q∈Fq.

109

sections. However, the additional structure of the deterministic Buchi machine allows

for tangible computational gains when searching for equilibria. For the Buchi objective

Ωi of agent i, there is a dba Ai = 〈Si, C, Ti, Ii, Fi〉 that accepts exactly those runs that

satisfy Ωi. With respect to multi-agent game (G, q(0)), the arena H of the two-player

turn-based game is constructed using the methods of Section 6.4.1:

H =(H, v(0)

)n(A1, T1(I1, q

(0)))n · · ·n

(AN , TN(IN , q

(0)))

= 〈V ,ACT ∪Q, T , v(0)〉 ,

where V = VI ∪ VII . For each v ∈ V , a payoff vector u(v) =(u1(v), . . . , uN(v)

)is

computed such that ui(v) = 1 if s[i] ∈ Fi, and ui(v) = 0 otherwise. The set of payoff

vectors is PV =⋃v∈V u(v). The next proposition states that we can determine whether

there exists an equilibrium associated with a given u in the multi-player concurrent

Buchi game (G, q(0)), by solving a two-player turn-based Buchi game H(u). This is

significant because the equilibria in the latter can be obtained in time polynomial in

the size (i.e., the number of game states and transitions) of the Buchi game H(u).

Proposition 8. Given the initialized game (G, q(0)) and a payoff vector u ∈ PV, a

two-player turn-based Buchi game can be constructed as

H(u) = 〈V ,ACT ∪Q, T , v(0), F (u)〉

where F (u) = v ∈ V | ∀i ∈ Agt(v),u(v) .i u. There exists a pure Nash equilibrium

f such that u(q(0),f) = u if and only if the following conditions are satisfied (in the

order given):

1. player I wins the Buchi game H(u);

2. there exists a run ρ ∈ V ∗ in H(u) that satisfies ρ(0) = v(0); ∀i ≥ 0, ρ(i) ∈ WinI ,

Agt(ρ(i)) = Π, Inf(ρ)∩ v ∈ F (u) | u(v) = u 6= ∅ and Inf(ρ)∩ (V \ F (u)) = ∅.

Proof. : It follows directly from the proof for the case of Muller objectives: if player

I wins the game, then the set of states visited infinitely often will be in F (u). That

is, a unilateral deviation made by one of the agents j ∈ Π will not give her a payoff

110

vector better than u. The second condition ensures the existence of an equilibrium

f associated with payoff vector u: if w is an ω-word that can generate this run ρ in

H(u) which satisfies 1) ∀k ≥ 0, ρ(k) ∈ WinI , Agt(ρ(k)) = Π (this means that all rational

agents adhere to this policy); 2) Inf(ρ) ∩ (V \ F (u)) = ∅ and ∃v ∈ Inf(ρ), u(v) = u,

then the equilibrium f is just the projection of w on the set of action profiles.

The computational complexity of solving two-player turn-based Buchi games is

O(n(m+n)), where n is the number of game states and m is the number of transitions

in the game H(u) [45].

6.4.2.2 Reachability Games

Reachability objectives correspond to first-order logic formulas. An automaton

accepting a reachability objective is an fsa. Again, we can view an fsa as a special

case of a dba3. However, as in the case of the previous section, there is a faster way.

In this case, the game equilibria can be computed in linear time, as solutions of a

two-player turn-based reachability game H.

The reachability objective Ωi for agent i ∈ Π is just a formula (can be thought

of as a regular expression) evaluated true only for those strings that are accepted by

the dfa Ai = 〈Si, C, Ti, Ii, Fi〉. In this case, with respect to (G, q(0)), the two-player

turn-based arena H is obtained once again by computing the synchronized product

H =(H, v(0)

)n(A1, T1(I1, q

(0)))n · · ·n

(AN , TN(IN , q

(0)))

= 〈VI ∪ VII ,ACT ∪Q, T , v(0)〉 .

For each v = (v, s) of H, a payoff vector is computed: u(v) =(u1(v), . . . , uN(v)

)where

ui(v) = 1 if s[i] ∈ Fi, and 0 otherwise. The set of payoff vectors is PV =⋃v∈V u(v).

The proposition that follows is in the spirit of Proposition 8 of the previous section,

in the sense that the decision problem of finding an equilibrium in the multi-agent

3 An fsa A can be converted into an dba A′ by adding a self-loop labeled “λ” (theempty string) for each q ∈ F = Acc of A, and letting Acc′ of A′ be F [19].

111

reachability game G is mapped to a corresponding decision problem in a two-player

zero-sum reachability game H.

Proposition 9. Given a game (G, q(0)) and a payoff vector u ∈ PV, a two-player

turn-based game arena can be constructed as H = 〈V ,ACT ∪ Q, T , v(0)〉. If u ∈ PV,

then there exists a pure Nash equilibrium f such that u(q(0),f) = u if the following

conditions are satisfied (in the order given):

1. player I wins the reachability game H(u) = 〈V ,ACT ∪ Q, T , v(0), SafeI〉, where

SafeI is a set of states computed iteratively as follows:

(a) Let Safe(0)I = v ∈ V | ∀i ∈ Agt(v),u(v) .i u;

(b) for i ≥ 1, Safe(i+1)I = Safe

(i)I ∩ NextI(Safe

(i)I ) where NextI(W ) :=

v ∈ VI ∩W | (∃a ∈ ACT )[T (v,a) ∈ W

]∪

v ∈ VII ∩W |(∀q ∈ Q : T (v, q) ↓

) [T (v, q) ∈ W

].

SafeI = Safe(m)I = Safe

(m+1)I is the fixed point.

2. there exists a run ρ = ρ(0)ρ(1) . . . ρ(m),m ∈ N such that ∀i ∈ 0, . . . ,m ρ(i) ∈WinI , where WinI is the winning region of player I, Agt(ρ(i)) = Π, ρ(m) ∈ SafeI

and u(ρ(m)) = u.

Proof. : The set of states SafeI has the following property: for any v ∈ SafeI , if v is

a state where player I moves, then by definition either there exists a move of player I

that keeps the game within SafeI , or, if that state is one where player II moves, any

move player II can make will not take the game outside SafeI . Since SafeI ⊆ Safe(0)I

by construction, we have that for any v ∈ SafeI , it is true that for every i ∈ Agt(v),

u(v) .i u. If a two-player turn-based reachability game is played with SafeI as the

objective of player I, any outcome ρ that is resulted from player I adhering to her

winning strategy WSI , will satisfy Occ(ρ)∩SafeI 6= ∅. Once the game state is in SafeI ,

player I can apply strategy WSI , which at any state v ∈ SafeI presents an action profile

a ∈ ACT that keeps the game within SafeI . Any agent i ∈ Π who unilaterally deviates

from this strategy profile does so at her own cost: she will find that the outcome of

112

the game is not going to be more preferable to her than the outcome associated with

the payoff vector u. In this case we need to verify the existence of an equilibrium f

associated with u: if this equilibrium exists, then we can find a word w that generates

a run ρ in H(u) satisfying: 1) ∀0 ≤ k ≤ |ρ|, ρ(k) ∈ WinI , Agt(ρ(k)) = Π (all agents

must adhere to the strategy) and 2) ∃` ≥ 0, ρ(`) ∈ SafeI with u(ρ(`)) = u. Since this is

the case, the equilibrium will be the projection of w onto the set of action profiles.

The complexity of solving two-player turn-based reachability or safety games is

O(m + n), where n is the number of game states and m is the number of transitions

in the game H(u) [45].

6.4.3 Security Strategies

If game (G, q(0)) is played only once, and agents cannot communicate to decide

on which equilibrium to adopt, the implementation of an equilibrium strategy profile

becomes problematic. This is because each agent has to perform their own calculations,

and in general, there can be several equilibria yielding the same payoff vector. These

equilibria are not comparable nor interchangeable. Even though there exists a set of

pure equilibria, each of which is an optimal solution, if the agents do not agree on

which one to adhere to jointly, the strategy profile that they end up following may be

one that leads to an inferior outcome and may not necessarily be an equilibrium, a

phenomenon known as “thrashing.”

In the face of uncertainty about the behavior of other agents, one reasonable

strategy for agent i is to secure a payoff vector above some specified level, against any

(rational or irrational) behavior of the others. Such a solution concept is similar to the

notion of security strategy in matrix games [87].

Definition 25 (Pure security strategy). A pure security strategy for agent i, denoted

f si : Q∗ → Σi, with respect to a designated security level u ∈ PV, satisfies

∀f ∈ SP such that f [i] = f si , it is u .i u(q(0),f) .

113

That is, by adhering to f si , agent i can ensure a payoff vector ranked at least

as high as u. Intuitively, the pure security strategy is the best choice for agent i in

the absence of information about the objectives, rationality, and preferences of other

agents.

Our approach to analyzing this case is similar in spirit to the treatment of

the previous section. For player i, we take the arena P of the initialized multi-agent

concurrent game (G, q(0)), and construct an arena for an agent-specific two-player turn-

based game (the superscript i is added to stress that this is the particular agent’s

defensive view of the world)

H i = 〈V i,Σi ∪Q, T ih〉

where

V i = V iI ∪ V i

II the set of states. α

Σi the set of actions of player I. β

Q the alphabet of player II.

T ih the transition function. γ

α V iI = (q, i) | q ∈ Q is the set of states where player I makes a move, and

V iII = (q,Π[−i], σ) | q ∈ Q, σ ∈ Σi is the states where player II makes a move.

(Π[−i] ≡ Π \ i.)β It is the alphabet of player i.

γ Given v ∈ V i,

if v = (q, i) ∈ V iI and for any σ ∈ Σi there exists an a ∈ ACT such that T (q,a) is

defined and a[i] = σ, then T ih((q, i), σ

):=(q,Π[−i], σ

);

if v =(q,Π[−i], σ

)∈ V i

II and for any q′ ∈ Q, there exists a ∈ ACT such that

T (q,a) = q′ and a[i] = σ, then T ih((q,Π[−i], σ), q′

):= (q′, i).

In this arena, player I is agent i who at each turn can select an action from its

action set. Player II, who represents the whole collective, implements an action profile

which includes the choice of agent i, and whatever all other agents decide to do. This

114

arena is initialized with (q(0), i), and combined with the other agents’ objectives using

the synchronized product

H i =(H i, (q(0), i)

)n(A1, T1(I1, q

(0)))n · · ·n

(AN , TN(IN , q

(0)))

= 〈V i,Σi ∪Q, T i, v(0)i〉

where

V i = V iI ∪ V i

II is the set of states. α

Σi the set of actions for player I.

Q the set of actions for player II.

T i the transition function. β

v(0)i =((q(0), i), s

(0)1 , . . . , s

(0)N

)the initial state. γ

α V iI = V i

I × S1 × · · · × SN , are the states where player I makes a move (Si are the

states of Ai), and V iII = V i

II × S1 × · · · × SN are the states where player II moves.

β for v = (v, s1, . . . , sN)

if v = (q, i) ∈ V iI and σ ∈ Σi, then T i(v, σ) := (v′, s1, . . . , sN) provided that

v′ = T ih(v, σ);

if, on the other hand, v =(q,Π[−i], σ

)∈ V i

II and q′ ∈ Q, then T i(v, q′) :=

(v′, s′1, . . . , s′N), provided that v′ = T ih(v, q

′) and for each i ∈ Π it is s′i =

Ti(si, LB(q′)).

γ Here, s(0)i = Ti(Ii, LB(q(0))) for i ∈ Π.

For each cycle C ∈ Cycles(H i), the payoff vector is u(C) =(u1(C), . . . , uN(C)

)where for each j ∈ Π, uj(C) = 1 if

⋃(v,s)∈C s[j] ∈ Fj. Note that the two-player game

on arena Hi expresses the particular agent’s defensive view of the game dynamics,

and thus this game’s objective function is slightly different, skewed toward the agent’s

conservative game-play:

OBJi(u) := C ∈ Cycles(H i) | u(C) &i u . (6.3)

In view of (6.3), the security level for agent i in game (G, q(0)) is the specific

u ∈ PV . Playing defensively, agent i wants to end the game on one of the cycles in

115

OBJi(u); this way, the game’s outcome, measured in terms of payoff vectors, is at least

as good as u.

With the definition of the objective function in (6.3), the description of the two-

player turn-based game, related to the security strategy u for agent i, can completed

as follows:

Hi(u) := 〈V iI ∪ V i

II ,Σi ∪Q, T i, v(0)i,(OBJi(u),Cycles(H i) \ OBJi(u)

)〉 .

Given v = (v, s) where v = (q, i) or v = (q,Π[−i], σ), let State(v) = q ∈ Q be

the state—associated to v—of the underlying multi-agent arena P . The following

statement establishes the conditions under which a security strategy that enables agent

i to achieve such a lower bound on the outcomes, exists:

Theorem 9. In the concurrent multi-agent game (G, q(0)), there exists a security strat-

egy f si with respect to the security level u for player i, if agent I wins in game Hi(u).

Proof. : Suppose that the winning strategy for player I in Hi(u), is WSiI : (V i)∗ V iI →

2Σi . At each turn when she moves, player I selects an action σ according to WSiI ,

and subsequently, player II selects a state q ∈ Q which can be reached through an

action profile a ∈ ACT , with entry i of a being σ. Since WSiI is a winning strategy

for player I, by definition, no matter how player II plays, player I can win the game

by visiting infinitely often a cycle in the winning condition OBJi(u). On all cycles in

OBJi(u), agent i in (G, q(0)) achieves a payoff vector which ranks at least as high as

u according to this agent’s preference. The security strategy f si : Q∗ → 2Σi for agent

i in (G, q(0)) is then derived from WSiI as follows: given r = v(0)v(1) . . . v(n) ∈ (V i)∗V iI ,

note that for i = 0, 2, 4, . . . , n, State(v(i)) = State(v(i+1)). Projecting the run r onto

the state set Q of G, and then removing all repeated entries, we obtain a run in G,

ρr = State(v(0)) State(v(2)) . . . State(v(n−2)) State(v(n)). Then f si (ρr) = WSi1(r).

The next statement guarantees the existence of at least one security level for

each agent in the concurrent multi-agent game. The level itself, however, may not

necessarily be quite desirable for the associated agent.

116

Lemma 3. In every game (G, q(0)), the following statements hold:

• For each agent, there exists a security strategy for at least one security level;

• For each agent, either there is a unique highest security level, or there exists a set

of highest security levels, between the elements of which the agent is indifferent.

Proof. : For the first part we can reason as follows. For each agent i, consider the

lowest ranked payoff vector u ∈ PV such that for any C ∈ Cycles(H i), we have

u .i u(C). Then OBJi(u) = Cycles(H i) and the winning condition in Hi(u) becomes

(Cycles(H i),∅). (Player I wins on every cycle, while player II wins nowhere.) In this

game, player I definitely has a winning strategy, because every run in Hi(u) ends

visiting one of the cycles in Cycles(H i) infinitely often. The second part follows from

the definition of preference relation.

6.4.4 Cooperative Equilibria

In the cases considered so far, each agent acts independently, and cooperation

was implicit through the ordering of possible outcomes in the preference relations: agent

i cooperates with agent j if the success of both i and j makes i happier than when

she succeeds alone. In this section, cooperation is considered explicitly. We identify

cooperation as a concurrent deviation from an equilibrium policy for the purpose of

collectively achieving some better outcome.

Our notion of such a cooperative equilibrium is related with stability solutions

in coalition games: the group of agents who form a coalition is not determined a priori

but emerges through the computation of the equilibrium.4

In a concurrent game G, a team is a subset X of the set Π of agents. An

unilateral team deviation by team X ∈ 2Π from an action profile a is denoted a[X 7→σ] = (a′1, a

′2, . . . , a

′N), where σ = (bj)j∈X is the tuple of the actions of all agents in

team X, ordered by their index; we have a′i ≡ ai if i /∈ X and a′i ≡ bi if i ∈ X. The set

4 Recall the prisoner’s dilemma problem and note that the optimal payoff cannot beachieved unless both prisoners deviate together from their lower-payoff equilibriumpolicy.

117

of teams in G is denoted Teams ⊆ 2Π. Note that nothing prevents those agents from

switching teams or breaking up—provided that at any given moment, the teams in the

game belong to Teams. This opens up a realm of possibilities, allowing teams to be

formed and dissolved in an opportunistic way.

In this context, the concept of suspect agents generalizes to that of a suspect

team: suppose T (q,a) = q′ is defined in G; then for an action profile b ∈ ACT , the set

of suspect teams triggering a transition from q to q′ is

SuspTeams((q, q′), b) :=

X ∈ Teams | (∀i ∈ X)[(∃σi ∈ Mov(q, i))[b[X 7→ (σi)i∈X ] = a ∧ T (q,a) = q′] .

Definition 26 (Pure cooperative equilibrium). A strategy profile f is a cooperative

equilibrium in an initialized multi-agent non-cooperative game (G, q(0)) if for any team

X ∈ Teams, and for any strategy profile f ′ obtained from f by an unilateral team

deviation of X, it holds that for all k ∈ X, u(q(0),f ′) .k u(q(0),f).

Just as we did in all previous cases where agents played on their own, we can

still use the multi-player concurrent arena P = 〈Q,ACT , T 〉 to construct a two-player

turn-based arena H in the familiar form

H = 〈V,ACT ∪Q, Th〉

where


ACT ∪Q the alphabet. β

Th the transition function. γ

α VI ⊆ Q× 2Teams and VII ⊆ Q× 2Teams ×ACT .

β ACT = Σ1×· · ·×ΣN represents the available moves for player I, and Q the moves

for player II.

118

γ Given v ∈ V , either

v = (q,S) ∈ VI where S ⊆ Teams and for any a ∈ ACT it is T (q,a) ↓, in which

case Th((q,S),a) := (q,S,a) ∈ VII ; or

v = (q,S,a) ∈ VII and for any q′ ∈ Q it is SuspTeams((q, q′),a

)∩ S 6= ∅, in

which case Th(v, q′) := (q′,S ′) ∈ VI where S ′ = X ∈ Teams | X ⊆ Y ∧ Y ∈

SuspTeams((q, q′),a

)∩S. Intuitively, S ′ not only includes the suspect teams

intersecting with S but also includes the set of teams, each of which is a subset

of one of the suspect teams.

The analysis of equilibria in this game where opportunistic teams of agents can

play against each other if the interests of the teammates align, can be performed in

exactly the same way as in Sections 6.4.1–6.4.2, in fact, the cases considered in these

sections are merely special cases of the one considered here, where each team is a

singleton: Teams = i | i ∈ Π. Since players in Π have their own objectives and

preference relations, teams can form and dissolve in an ad-hoc fashion, depending on

the opportunities of the moment, the interests of the agents, and the agents’ preference

relations over outcomes of the game.

6.5 Case Study

We consider the scenario in which three agents Π = 1, 2, 3 need to visit

different rooms in Fig. 6.2. The rooms in the environment is indexed with A, B,

C, D. Agents can pass through doors a, b, c and d, but only one agent at a time

can go through a given door. Each agent dynamics is modeled in the form of an sa

Ai, depicted graphically in Fig. 6.6a. The concurrent product of the agent dynamics,

A1 A2 A3 yields the concurrent multi-agent game arena P in Fig. 6.3, in which the

constraint that only one agent can pass through a door at any given time is encoded.

In Fig. 6.3, a state (i, j, k) is represented as ijk—agent 1 is in room i, agent 2 in

room j and agent 3 in room k. Transitions are labeled with the action profiles σ1 σ2 σ3

119

that trigger them, and each σi denotes the door through which agent i goes. For all

(ai)Π ∈ ACT , and i 6= j ∈ Π, if ai, aj 6= ε, then ai 6= aj captures the constraint

that two agents cannot pass through the same door simultaneously. AP = α(i,m) :

the robot i is in room m, i ∈ 1, 2, 3;m ∈ A,B,C,D. The set of propositions

evaluated true at q ∈ Q indicates the current locations of agents.

Based on the concurrent product P , we compute the two-player, turn-based

game H in Fig. 6.4. Each state can be a tuple of either two (e.g., (ABC, 1, 2, 3)) or

three components (e.g., (ABC, 1, 2, 3, acb)). In the first case, the state is in VI while

in the second it is in VII . The semantics of a state in VI (e.g., (ABC, 1, 2, 3, acb)) is

that agents are in the rooms marked by the first component (i.e., 1 in A, 2 in B and 3

in C), the agents suspect for triggering the transition there are the ones in the second

component (i.e., all of them), and agents are supposed to execute the actions specified

in the third component (i.e., 1 go through a, 2 go through c and 3 go through b). The

semantics of a state in VII (say, (BDD, 3)), is that agents are now where the first

component says (i.e., 1 in B, 2 in D and 3 also in D) and that for this state to have

been reached, the agents in the second component (i.e., 3) are suspect of triggering

the transition: the action plan that was actually implemented to reach that particular

state in VII is acd. By comparing acd with acb, it is clear that 3 deviates.

a

b c

d

A B

DC

Figure 6.2: A partitioned rectangular environment in which three agents roam.

6.5.1 Reachability Objectives

Let us consider three different combinations of preference relations and rules for

team formation, and see what type of interaction behaviors can emerge as a result. We

will assume that the objective of agent i, with i ∈ Π, is a reachability objective Ωi,

120

ACC

BDD

ABCstart BDA

BDC

ACB

BBA

acd

acε

acd

εcbacε

adc

acb adb

adbacb

adc

εcb

Figure 6.3: A fragment of the multi-agent arena P = 〈Q,ACT , T 〉.

(ABC, 1, 2, 3)

(ABC, 1, 2, 3, acd)(ABC, 1, 2, 3, acb)

(BDA, 3)(BDA, 1, 2, 3)

(BDA, 3, acb)(BDD, 3, acd)

(BDD, 3)

(BDA, 1, 2, 3, acb) (BBD, 2, aεd)

(BBD, 2)acb aεd

BDA BDABDD

acb acd

BBD

acb acd

Figure 6.4: A fragment of the two-player turn-based game arena H.

BBA

acε

BDA

caεaεb

BDC

acbεca

ABC

cdb

BDA

((Ih, 1, 2, 3), 0, 0, 0)

((ABC, 1, 2, 3, aεb), A, 0, C)

((ABC, 1, 2, 3), A, 0, C)

((BDA, 2, acε), AB,D,AC)((BDA, 2), AB,D,AC)

((BBA, 1, 2, 3, caε), AB, 0, AC)((BBA, 1, 2, 3), AB, 0, AC)

((BDA, 1, 2, 3), AB,D,AC)((ABC, 1, 2, 3, acb), A, 0, C)

((BDC, 3), AB,D,C)

((BDA, 1, 2, 3, εca), AB,D,AC)

((BDC, 3, cdb), AB,D,C)

Figure 6.5: Fragment of the partial synchronization product H

A

CB

D

ε

d

εa

ε

b

ε

a

cd

b

c

(a) The sa modelingagent dynamics. Tran-sition labeled ε meansthe agent stays in thesame room.

A

0start

B

AB

A,C,D

A,B,C,DA

B,C,DB

BC,D

A

(b) A1: visit roomsA,B in any order.

A

ABCD

AC

AB

ABD

......ACD0start

ABC

D

A

B

A,B

D

A,B,C

A

A,B,C

A,C,D

A,B,C,D

BA,C

B

CD

C

C

(c) A fragment of A3: visit allrooms in any order. A transitionlabel A,B stands for the worldstate c such that c = (A∧¬(B ∨C ∨D))∨ (B ∧¬(A∨B ∨C)).

Figure 6.6: The fsas representing the agent objectives.

equivalent to an fsa denoted Ai. The objective of agent 1, Ω1, is to visit rooms A and

B; the associated fsa A1 is shown in Fig. 6.6b. The objective of agent 2, Ω2, is to

121

visit rooms C and D, and its associated fsa is obtained from the fsa of Fig. 6.6b by

relabeling the rooms as follows: A 7→ C, B 7→ D, C 7→ A and D 7→ B. The objective

of agent 3, Ω3, is to visit all rooms, and a fragment of the associated fsa appears in

Fig. 6.6c.

Case 1: Everyone for themselves.

Agents are selfishly focusing on achieving their own objective, and there are no

teams. Formally is expressed in the form of preference relations as follows: u i u′

with i ∈ Π, if u[i] = 0 and u′[i] = 1; similarly, u 'i u′ if u[i] = u′[i]. With the

arena of the two-player reachability game being the sa H shown in Fig. 6.4, and the

objective fsas Ai of Fig. 6.6, the synchronization product gives us H, a fragment of

which is shown in Fig. 6.5.

The set of payoff vectors in this two-player reachability game arena H is PV =

0, 13, with elements of the form u(v) in which the argument is a game state v =

(v, s) ∈ VI ∪ VII . We can see the structure of state v in the graph of Fig. 6.5. The

first component in each of these states, v, is essentially one of the states of H, which

show as labels in the nodes of the graph in Fig. 6.4. The second component of v, s,

(see Fig. 6.5) is a tuple of three states of the agents’ objective automata. Clearly, s

keeps track of what each agent has achieved so far in terms of their objectives: the first

element relates to agent 1, and if s[1] = AB then agent 1 achieves its goal, the second

element relates to agent 2 and when s[2] = CD, agent 2 accomplish its task, the third

element relates to agent 3 and reads s[3] = ABCD if agent 3 completes its task.

Let us single out a payoff vector u = (0, 0, 1) in PV . We can verify that

the designated initial state of the game((ABC, 1, 2, 3), A, 0, C

)is in the winning

region WinI of player I, by computing the equilibrium associated with (0, 0, 1) in the

reachability game H(u). This equilibrium corresponds to a sequence of action profiles,

a strategy profile f , from the state ABC of the multi-player arena P which suggests a

sequence of (concurrent) moves for each agent:

f = (εaε)(εab)(baε)(dba)(dbc)(bac)(εac) .

122

For example, according to this strategy, in the opening of the game agent 1 and agent

3 are to remain still, while agent 2 is to go through door a; then while agent 1 is still

at rest, agent 2 crosses door a again (in the opposite direction) and agent 3 springs

into action going through door b.

Following the strategy profile f , agents 1, 2, and 3 eventually find themselves

in rooms A, A, and D, while having already visited rooms A,C,D, A,B,C,A,B,C,D, respectively. Agent 3 has achieved its goal, but agents 1 and 2 have

not: agent 1 really wanted to visit A and B, while agent 2 needed to go to C and D.

Agents 1 or 2 cannot achieve a better payoff by unilaterally deviating from f . However,

if they deviate together, e.g. defect and instead of implementing baε in the third action

profile do aεε instead, then at least one of the two (in this case, agent 1) accomplishes

its goal.

An exhaustive analysis of all possible payoff vectors in PV , reveals that there

exists an equilibrium for each of them (first row of Table 6.1).

Table 6.1: Nash equilibria for all payoff vectors in concurrent game G with reachabilityobjectives

PV000 001 010 100 110 011 101 111

case 1 X X X X X X X Xcase 2 7 7 7 X X 7 X Xcase 3 X 7 X X X 7 7 X

Case 2: Selfish individuals in teams.

Agents in teams can deviate concurrently from an equilibrium policy. In this

case, consider the set of possible teams Teams =1, 2, 1, 2, 3

in game G;

agents 1 and 2 can work together if doing so serves their common interests. Note that

while they can form a team and cooperate in an ad-hoc way, they can still perform

unilateral deviations as individuals.

123

Solving the reachability game from the same initial condition for all payoff vec-

tors in PV , produces the second row of equilibria in Table 6.1. Now only half of the

possible payoff vectors are associated with equilibria. Those that are, appear biased

toward solutions that yield higher payoffs for agents 1 and 2.

Case 3: Teaming against others.

In this case, we do not explicitly define possible team groupings. Instead, we

prescribe preference relations over the set of possible payoff vectors, and let the agents

choose how to team up. What is of interest is the kind of gameplay strategies emerge

as stable equilibria.

The preference relation of agent 1 explicitly defines the following order among

eight possible outcomes:

(0, 0, 1) 1 (0, 1, 1) 1 (1, 0, 1) 1 (1, 1, 1)

1 (0, 0, 0) 1 (0, 1, 0) 1 (1, 0, 0) 1 (1, 1, 0) , (6.4)

which reads “ideally I want myself and agent 2 to achieve our goal but not agent 3,

and if I cannot have that I would rather win alone; if this is not possible I can let agent

2 win, but under no circumstances do I let 3 get his way—if 3 really has to win, then

my preferences are the same as in the case where she loses.” Agent 2 similarly prefers

the outcomes in the order

(0, 0, 1) 2 (1, 0, 1) 2 (0, 1, 1) 2 (1, 1, 1)

2 (0, 0, 0) 2 (1, 0, 0) 2 (0, 1, 0) 2 (1, 1, 0) . (6.5)

Agent 3 plays selfishly, with its mind setting at achieving her own objective. Note that

here agents 1 and 2 are radicalized: they prefer failure to letting agent 3 achieve her

objective.

In this scenario (see Table 6.1), the payoff vector 010, which does not correspond

to an equilibrium when agent 1 and 2 play selfishly, now it does, because agent 1 lets

agent 2 succeed as long as agent 3 loses. An implication of this observation is that by

124

simply redefining the preference relations, an opportunistic alliance between agents 1

and 2 can emerge.

6.5.2 Buchi Objectives

In this section we incorporate Buchi objectives into the the multi-agent game:

the objectives of agent 1 and 2 is to visit room A and B, and C and D, respectively,

infinitely often. Agent 3 needs to visit rooms A, B, and D infinitely often. The

temporal logic formulae5 are

Ω1 :(♦(A ∧ ♦B))

Ω2 :(♦(C ∧ ♦D))

Ω3 :(♦(A ∧ ♦(B ∧ ♦D))) .

Any of the three objectives can be accepted by a suitable dba. It turns out that when

ε ∈ Σi for all i ∈ Π, meaning that agent i can remain stationary at any time step, in all

three cases we can find a pure Nash equilibrium for each pay-off vector. However, if we

restrict the behavior of agents by defining ε /∈ Σi for all i ∈ Π, then we observe some

interesting behaviors as indicated in Table 6.2. In case 3, when agents 1 and 2 team up

with the preferences indicated in (6.4) and (6.5), there does not exist an equilibrium

that ensures the success of agent 3.

Table 6.2: Nash equilibria for all payoff vectors in concurrent game G with Buchiobjectives (in the case when ε /∈ Σi, for all i ∈ Π)

PV000 001 010 100 110 011 101 111

case 1 X X X X X X X Xcase 2 X X X X X X X Xcase 3 X 7 X X X 7 7 7

5 For semantics of temporal logic formulae, see [33].

125

6.5.3 Strategy Alternatives for Agent 3

In case 3 of section 6.5.1, if agent 3 knows that the other two agents can team

up against her and thus prevent her from completing her task, she may consider the

following two options: she can either announce to the other players that she is willing to

play fair, and call everyone6 to follow the strategy associated with payoff (1, 1, 1) which

allows everyone to win, or, if this is not an option—e.g., she cannot communicate—she

can simply plan for the worst. Planning for the worst case amounts to searching for a

security strategy.

Using Lemma 3, we can compute the best security level for every agent by solving

a two-player turn-based game Hi(u) for i = 1, 2, 3 each time, with respect to different

payoff vectors u. Figure 6.7 shows a fragment of the two-player turn-based arena factor

H1. It turns out that for both agents 1 and 2, the highest security level is (1, 0, 0); for

agent 3, there is a set of highest security levels (0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0) which

are indifferent to agent 3 based on her preference—it is a lost case. No matter how she

plays, she cannot meet her own objective if both other two agents prefer she loses.

The existence of an equilibrium associated with payoff vector (1, 1, 1), does not

necessarily ensure that agent 3 has a security strategy that can force these payoffs.

This (Nash) equilibrium is stable under the implicit assumption that players behave

strictly rationally, that is, they will always prefer to improve the outcome with respect

to their own preferences if they have the choice.

bCAC

DCC

ACBbab

AADd

CDD

(ABC,Π[−1], a)(ACB, 1)(ABC, 1)

(CAC, 1)(ABC,Π[−1], b)

(CAC,Π[−1], d)

(CDD, 1) (CDD,Π[−1], b)

(CAC,Π[−1], b) (AAD, 1)

(DCC, 1)

Figure 6.7: Fragment of the two-player turn-based arena H1.

6 The utterance of a strategy profile in this case is different from the one found whencommunication in game theory is considered [87], because here it is supposed to bebounding for all players.

126

6.6 Conclusions

This chapter suggests a game-theoretic approach to decentralized planning for

the class of multi-agent systems with independent ω-regular objectives. The analysis

of Nash equilibria in such collections of autonomous, rational systems that do not have

the same objectives, has not received enough attention in temporal logic control and

discrete event systems literature. A method is developed for constructing a multi-agent

concurrent game which captures the interaction of multiple non-cooperative systems.

Each subsystem is assigned an individual task defined using an ω-regular logical formula

and has his own preference over all possible outcomes of their interactions. A pure

Nash equilibrium in the resulting game is a collection of control strategies for the set

of agents. We also analyze security strategies as an alternative solution concept for

this special class of multi-agent concurrent games, and then introduce the notion of

cooperative equilibrium in coalition games, that allows the overall system to produce

behaviors with implicit or explicit cooperation between subsystems.

127

Chapter 7

CONCLUSIONS AND FUTURE WORK

In this thesis, the main focus is on optimal planning and adaptive control design

for hybrid systems that interact with changing environments. The specifications and

control objectives for such systems are given in terms of logical formulas over predicates

which capture the interesting behavior of the systems when operating in their dynamic

environments. We solve the problems within a hierarchical framework. By obtaining

an abstraction for the class of hybrid systems considered, the planning and control

synthesis problems can then be lifted into the purely discrete level. Controllers and

plans synthesized at the abstraction level are implementable in the original systems of

hybrid dynamics due to the establishment of special simulation relations linking the

concrete system and the abstract one.

For making the abstraction of hybrid systems scalable and computationally

feasible, the methodology adapts a bottom-up approach. In this approach, we show

that a special class of hybrid systems where the continuous dynamics are convergent

can afford a partition of the continuous state space based on the asymptotic properties

of the vector fields. In addition, the system is capable of re-parametrizing its continuous

controllers. This partition gives rise to purely discrete abstractions, which are weakly

simulated by the underlying concrete hybrid dynamics. Solutions obtained through

this process are in general suboptimal, unless under some special conditions which are

identified.

In the presence of unknown but rule-governed environment, grammatical infer-

ence can be incorporated as an identification method along a path toward building

128

robust, reactive and adaptive systems. Starting with an incomplete model of the envi-

ronment, a system iteratively updates the model based on observations of its environ-

ment behavior, using an appropriate grammatical inference algorithm selected based

on whatever prior knowledge is available. If none is available, a hypothesis about the

class of models which the adversary dynamics belongs to is made. If the hypothesis is

correct, and a characteristic sample of the opponent’s behavior (language) is observed,

the learned model converges to the actual environment model in finitely many steps.

Then combining ideas from action model learning with grammatical inference, it is

shown that with the learning component, eventually we can construct a game equiva-

lent to the true game actually being played by the system and its environment. Due

to this equivalence, a winning strategy computed on the hypothesized game, is just as

effective as the true winning strategy computed on the game with complete informa-

tion. In the proposed adaptive control architecture, learning and control are combined

in a modular way, in the sense that a range of different control synthesis methods can

be adapted and used, in conjunction with a variety of different grammatical inference

algorithms.

Although reactive synthesis produces correct-by-construction controllers, the as-

sumption on complete information is in general hard to realize due to limited sensing

capabilities. In the case of partial observation, we defined a sensor model and formu-

lated the interaction between the system and its environment as a two-player game with

incomplete information. Control methods are developed with respect to sure-winning

and almost-sure winning criteria. From a practical point of view, controllers that do

not require extra memory to keep track of a history are more desirable. However, we

show that in the case of partial observation, it is not possible to obtain a controller

that is memoryless. Randomized control policies can be found to ensure the task is

accomplished with probability 1, in case where a deterministic controller may not exist.

Treating the environment as an adversary can be unnecessarily defensive when

the environment of a system is in fact a collection of other systems, each of which has its

own task specification in the form of a temporal logic formula and preference over their

129

interaction. We formulated the interaction of multiple individual rational systems as

a multi-agent noncooperative game and adapted results from algorithmic game theory

to compute (pure Nash) equilibria, security strategies and cooperative equilibria in

this multi-agent system. The analysis of equilibria and cooperative equilibria can be

applied to decentralized planning and control design of multi-agent systems. When

giving the right incentives to individual agents, a global desired behavior can emerge

from this interaction. Moreover, this emergent behavior is robust in the sense that once

a single agent fails, the interaction of the rest can converge to another desired stable

point (infinite sequences of concurrent actions). The analysis of security strategies can

be used for control synthesis of a system in the presence of a dynamic environment, in

which both the system and the environment act concurrently.

Future work can be focused on extending the abstraction method for the special

class of hybrid systems in Chapter 3 to stochastic systems, or on adaptive control for

partial observation case and decentralized planning and control design of multi-agent

systems with respect to a set of temporal logic specifications.

The current bottom-up abstraction method is restricted to the class of hybrid

systems where each low-level controller is deterministic, in the sense that for a given

state within its region of attraction, after initiating the controller, the state of the

system will certainly satisfy the predicates that characterize its limit set. There are

many hybrid systems with existing low-level controllers that are probabilistic. For

example, due to exogenous disturbances and unmodeled dynamics, the convergence

of a controller can be given as a probability distribution over a set of state sets. It is

meaningful to extend the proposed abstraction method for the class of stochastic hybrid

systems. In addition, in this thesis we consider qualitative reachability properties. It

is also a promising direction to consider quantitative measures (positive probability)

for abstraction-based optimal planning of stochastic hybrid systems, as well as to allow

more general temporal logic properties such as liveness and safety.

For the adaptive control in the case of partial observations, there is a need

to develop a learning algorithm that identifies a model of the environment, or some

130

model that is observation-equivalent to it. To identify a model of the environment,

it is necessary to incorporate a filtering method which removes the noise from the

observed environment behavior in a computationally efficient way. It is also important

to ensure that the data presentation contains a characteristic sample after the removal

of noisy information. If none of the above conditions can be fulfilled due to the sensing

uncertainty, a promising direction is to identify an equivalent game based on the notion

of observation equivalent for the model of the environment. The intuition is that when

the system only observes partial behavior of its environment, it suffices to identify a

model which, based on the sensing uncertainty, exhibits the same observable behaviors

to the system as the true environment does.

The work on game theoretic modeling and analysis of multi-agent systems pro-

vides theoretical results on how to compute the set of pure Nash or cooperative equi-

libria in the system. Although it can be adapted to decentralized control design of

multi-agent systems (by assigning different preference orderings for outcomes for dif-

ferent agents, and computing the set of equilibria by resolving different resulted games),

we still need to design a communication protocol to realize an equilibrium strategy pro-

file and also quantify the preference orderings and task specification by means of utility

functions. A possible direction along these lines is to extend the decentralized control

methods based on solution concepts applicable to multi-agent finite-stage games to the

case of infinte-stage games with winning conditions expressed in temporal logic.

131

BIBLIOGRAPHY

[1] Eric Aaron, Harold Sun, Franjo Ivancic, and Dimitris Metaxas. A hybrid dynam-ical systems approach to intelligent low-level navigation. In IEEE Proceedings ofComputer Animation, pages 154–163, 2002.

[2] Rajeev Alur, Costas Courcoubetis, Thomas A. Henzinger, and Pei-Hsin Ho. Hy-brid automata: An algorithmic approach to the specification and verification ofhybrid systems. In Robert L. Grossman, Anil Nerode, Anders P. Ravn, andHans Rischel, editors, Hybrid Systems: Computation and Control, volume 736 ofLecture Notes in Computer Science, pages 209–229. Springer Berlin Heidelberg,1993.

[3] Rajeev Alur, Thao Dang, and Franjo Ivancic. Reachability analysis of hybridsystems via predicate abstraction. In Claire J. Tomlin and Mark R. Greenstreet,editors, Hybrid Systems: Computation and Control, volume 2289 of Lecture Notesin Computer Science, pages 35–48. Springer Berlin Heidelberg, 2002.

[4] Rajeev Alur, Thao Dang, and Franjo Ivancic. Predicate abstraction for reach-ability analysis of hybrid systems. ACM Transactions on Embedded ComputingSystems, 5(1):152–199, 2006.

[5] Rajeev Alur, Thomas A. Henzinger, Gerardo Lafferriere, and George J. Pappas.Discrete abstractions of hybrid systems. Proceedings of the IEEE, 88(7):971 –984,july 2000.

[6] Krzysztof R. Apt and Erich Gradel. Lectures in game theory for computer sci-entists. Cambridge University Press, 2011.

[7] A. Arnold, A. Vincent, and I. Walukiewicz. Games for synthesis of controllerswith partial observation. Theoretical Computer Science, 303(1):7 – 34, 2003.

[8] Andre Arnold. Synchronized products of transition systems and their analysis.In Jorg Desel and Manuel Silva, editors, Application and Theory of Petri Nets,volume 1420 of Lecture Notes in Computer Science, pages 26–27. Springer BerlinHeidelberg, 1998.

[9] Tomas Babiak, Mojmır Kretınsky, Vojtech Rehak, and Jan Strejcek. Ltl to buchiautomata translation: Fast and more deterministic. In Tools and Algorithms forthe Construction and Analysis of Systems, pages 95–109. Springer, 2012.

132

[10] Raphael Bailly. Quadratic weighted automata: Spectral algorithm and likelihoodmaximization. Journal of Machine Learning Research, 20:147–162, 2011.

[11] Sai K. Banala, Sunil K. Agrawal, Seok Hun Kim, and John P. Scholz. Novelgait adaptation and neuromotor training results using an active leg exoskeleton.IEEE/ASME Transactions on Mechatronics, 15(2):216–225, 2010.

[12] Leonor Becerra Bonache, Colin Higuera, Jean-Christophe Janodet, and FredericTantini. Learning balls of strings with correction queries. In JoostN. Kok, JacekKoronacki, RaomonLopezde Mantaras, Stan Matwin, Dunja Mladenic, and An-drzej Skowron, editors, Machine Learning: ECML 2007, volume 4701 of LectureNotes in Computer Science, pages 18–29. Springer Berlin Heidelberg, 2007.

[13] Calin Belta, Antonio Bicchi, Magnus Egerstedt, Emilio Frazzoli, Eric Klavins,and George Pappas. Symbolic planning and control of robot motion. IEEERobotics Automation Magazine, 14(1):61–70, 2007.

[14] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Two VolumeSet. Athena Scientific, 2nd edition, 2001.

[15] Dietmar Berwanger and Lukasz Kaiser. Information tracking in games on graphs.Journal of Logic, Language and Information, 19(4):395–412, 2010.

[16] Patricia Bouyer, Romain Brenguier, Nicolas Markey, and Michael Ummels. Con-current games with ordered objectives. In Lars Birkedal, editor, Foundations ofSoftware Science and Computational Structures, volume 7213 of Lecture Notesin Computer Science, pages 301–315. Springer Berlin Heidelberg, 2012.

[17] Mireille Broucke. A geometric approach to bisimulation and verification of hybridsystems. In Frits W. Vaandrager and Jan H. Schuppen, editors, Hybrid Systems:Computation and Control, volume 1569 of Lecture Notes in Computer Science,pages 61–75. Springer Berlin Heidelberg, 1999.

[18] Janusz A. Brzozowski. Derivatives of regular expressions. Journal of the ACM,11:481–494, October 1964.

[19] Julius R. Buchi. On a decision method in restricted Second-Order arithmetic. InInternational Congress on Logic, Methodology, and Philosophy of Science, pages1–11. Stanford University Press, 1962.

[20] John Case, Sanjay Jain, and Frank Stephan. Vacillatory and BC learning onnoisy data. Theoretical Computer Science, 241(1–2):115 – 141, 2000.

[21] Christos Cassandras and Stephan Lafortune. Introduction to Discrete Event Sys-tems. Kuwer, 1999.

133

[22] Krishnendu Chatterjee, Martin Chmelık, and Rupak Majumdar. Equivalence ofgames with probabilistic uncertainty and partial-observation games. In SupratikChakraborty and Madhavan Mukund, editors, Automated Technology for Verifi-cation and Analysis, Lecture Notes in Computer Science, pages 385–399. SpringerBerlin Heidelberg, 2012.

[23] Krishnendu Chatterjee, Laurent Doyen, Thomas A. Henzinger, and Jean-Francois Raskin. Algorithms for omega-regular games with imperfect informa-tion. In Zoltan Esik, editor, Computer Science Logic, volume 4207 of LectureNotes in Computer Science, pages 287–302. Springer, 2006.

[24] Yushan Chen, Xu Chu Ding, and Calin Belta. Synthesis of distributed controland communication schemes synthesis of distributed control and communicationschemes from global ltl specifications. In IEEE Conference on Decision andControl, pages 2718–2723, Orlando FL, 2011.

[25] Yushan Chen, XuChu Ding, Alin Stefanescu, and Calin Belta. A formal approachto deployment of robotic teams in an urban-like environment. In A. Martinoli,F. Mondada, N. Correll, G. Mermoud, M. Egerstedt, M. A. Hsieh, L. E. Parker,and K. Støy, editors, Distributed Autonomous Robotic Systems, volume 83 ofSpringer Tracts in Advanced Robotics, pages 313–327. Springer Berlin Heidelberg,2013.

[26] Alongkrit Chutinan and Bruce H. Krogh. Verification of polyhedral-invarianthybrid automata using polygonal flow pipe approximations. In Frits W. Vaan-drager and Jan H. Schuppen, editors, Hybrid Systems: Computation and Control,volume 1569 of Lecture Notes in Computer Science, pages 76–90. Springer BerlinHeidelberg, 1999.

[27] Edmund Clarke, Ansgar Fehnker, Zhi Han, Bruce Krogh, Joel Ouaknine, OlafStursberg, and Michael Theobald. Abstraction and counterexample-guided re-finement in model checking of hybrid systems. International Journal of Founda-tions of Computer Science, 14(04):583–604, 2003.

[28] Edmund M. Clarke Jr., Orna Grumberg, and Doron A. Peled. Model checking.MIT Press, 1999.

[29] Satyaki Das and David L Dill. Counter-example based predicate discovery inpredicate abstraction. In Formal Methods in Computer-Aided Design, pages 19–32. Springer, 2002.

[30] Colin de la Higuera. Grammatical Inference: Learning Automata and Grammars.Cambridge University Press, 2010.

[31] Aldo De Luca and Antonio Restivo. A characterization of strictly locally testablelanguages and its application to subsemigroups of a free semigroup. Informationand Control, 44(3):300–319, March 1980.

134

[32] Xu Chu Ding, Stephen L Smith, Calin Belta, and Daniela Rus. MDP optimalcontrol under temporal logic constraints. In 50th IEEE Conference on Decisionand Control and European Control Conference, pages 532–538. IEEE, 2011.

[33] E Allen Emerson. Temporal and modal logic. Handbook of Theoretical ComputerScience, Volume B: Formal Models and Sematics (B), 995:1072, 1990.

[34] Georgios E. Fainekos, Savvas G. Loizou, and George J. Pappas. Translatingtemporal logic to controller specifications. In 45th IEEE Conference on Decisionand Control, pages 899–904, 2006.

[35] John Fearnley and Martin Zimmermann. Playing Muller games in a hurry. InA. Montanari, M. Napoli, and M. Parente, editors, Proceedings of the First Sym-posium on Games, Automata, Logic, and Formal Verification, volume 25, pages146–161, 2010.

[36] Diego Figueira, Piotr Hofman, and Slawomir Lasota. Relating timed and registerautomata. In Proceedings of the 17th International Workshop on Expressivenessin Concurrency, pages 61–75, 2010.

[37] D. Fisman, O. Kupferman, and Y. Lustig. Rational synthesis. Tools and Algo-rithms for the Construction and Analysis of Systems, pages 190–204, 2010.

[38] Jie Fu and Herbert G. Tanner. Optimal planning on register automata. InAmerican Control Conference, pages 4540 –4545, June 2012.

[39] Drew Fudenberg and David K. Levine. The Theory of Learning in Games, vol-ume 1 of MIT Press Books. The MIT Press, 1998.

[40] Pedro Garcia, Enrique Vidal, and Jose Oncina. Learning locally testable lan-guages in the strict sense. In Proceedings of the Workshop on Algorithmic Learn-ing Theory, pages 325–338, 1990.

[41] Paul Gastin and Denis Oddoux. Fast LTL to Buchi automata translation. InGerard Berry, Hubert Comon, and Alain Finkel, editors, Proceedings of the13th International Conference on Computer Aided Verification (CAV’01), vol-ume 2102 of Lecture Notes in Computer Science, pages 53–65, Paris, France,July 2001. Springer.

[42] Antoine Girard and George J. Pappas. Hierarchical control system design usingapproximate simulation. Automatica, 45:566–571, 2009.

[43] William Glover and John Lygeros. A stochastic hybrid model for air traffic controlsimulation. In Rajeev Alur and George J. Pappas, editors, Hybrid Systems:Computation and Control, volume 2993 of Lecture Notes in Computer Science,pages 372–386. Springer Berlin Heidelberg, 2004.

135

[44] E. Mark Gold. Language identification in the limit. Information and Control,10(5):447–474, 1967.

[45] Erich Gradel, Wolfgang Thomas, and Thomas Wilke, editors. Automata logics,and infinite games: a guide to current research. Springer-Verlag New York, Inc.,New York, NY, USA, 2002.

[46] Amaury Habrard, Marc Bernard, and Marc Sebban. Improvement of the statemerging rule on noisy data in probabilistic grammatical inference. In NadaLavrac, Dragan Gamberger, Hendrik Blockeel, and Ljupco Todorovski, editors,Machine Learning: ECML 2003, volume 2837 of Lecture Notes in Computer Sci-ence, pages 169–180. Springer Berlin Heidelberg, 2003.

[47] Jeffery Heinz. Inductive Learning of Phonotactic Patterns. PhD thesis, Universityof California, Los Angeles, 2007.

[48] Jeffrey Heinz. String extension learning. In Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguistics, pages 897–906, Uppsala,Sweden, July 2010.

[49] Jeffrey Heinz, Anna Kasprzik, and Timo Kotzing. Learning with lattice-structured hypothesis spaces. Theoretical Computer Science, 457:111–127, Octo-ber 2012.

[50] Thomas A. Henzinger. The theory of hybrid automata. In Proceedings of EleventhAnnual IEEE Symposium on Logic in Computer Science, pages 278–292, 1996.

[51] Thomas A Henzinger, Ranjit Jhala, and Rupak Majumdar. Counterexample-guided control. Springer, 2003.

[52] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction toAutomata Theory, Languages, and Computation (3rd Edition). Addison-Wesley,2006.

[53] Roberto Horowitz and Pravin Varaiya. Control design of an automated highwaysystem. Proceedings of the IEEE, 88(7):913–925, 2000.

[54] Jianghai Hu, John Lygeros, and Shankar Sastry. Towards a theory of stochastichybrid systems. In Nancy Lynch and Bruce H. Krogh, editors, Hybrid Systems:Computation and Control, volume 1790 of Lecture Notes in Computer Science,pages 160–173. Springer Berlin Heidelberg, 2000.

[55] Sanjay Jain, Daniel Osherson, James S. Royer, and Arun Sharma. SystemsThat Learn: An Introduction to Learning Theory: Learning, Development andConceptual Change. The MIT Press, 2nd edition, 1999.

136

[56] Gabriel Kalyon, Tristan Le Gall, Herve Marchand, and Thierry Massart. Sym-bolic Supervisory Control of Infinite Transition Systems under Partial Observa-tion using Abstract Interpretation. Discrete Event Dynamic Systems, 22(2):121–161, 2012.

[57] Michael Kaminski and Nissim Francez. Finite-memory automata. TheoreticalComputer Science, 134(2):329–363, 1994.

[58] Hassan Khalil. Nonlinear Systems. Prentice Hall, third edition, 2002.

[59] Marius Kloetzer and Calin Belta. A fully automated framework for control of lin-ear systems from temporal logic specifications. IEEE Transactions on AutomaticControl, 53(1):287–297, 2008.

[60] Xenofon D. Koutsoukos, Panos J. Antsaklis, James A. Stiver, and Michael D.Lemmon. Supervisory control of hybrid systems. Proceedings of the IEEE,88(7):1026–1049, 2000.

[61] S. Kowalewski, S. Engell, J. Preußig, and O. Stursberg. Verification of logic con-trollers for continuous plants using condition/event-system models. Automatica,35:505–518, 1999.

[62] Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Where’s waldo?sensor-based temporal logic motion planning. In IEEE International Conferenceon Robotics and Automation, pages 3116–3121, 2007.

[63] Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Temporal-logic-based reactive mission and motion planning. IEEE Transactions on Robotics,25(6):1370–1381, 2009.

[64] Ratnesh Kumar, Vijay Garg, and Steven I. Marcus. Predicates and predicatetransformers for supervisory control of discrete event dynamical systems. IEEETransactions on Automatic Control, 38:232–247, 1995.

[65] Bruno Lacerda and Pedro U. Lima. Linear-time temporal logic control of discreteevent models of cooperative robots. Journal of Physical Agents, 2(1):53–61, 2008.

[66] H. Lamouchi and J. Thistle. Effective control synthesis for DES under partialobservations. In Decision and Control, 2000. Proceedings of the 39th IEEE Con-ference on, volume 1, pages 22 –28 vol.1, 2000.

[67] Helmut Lescow and Jens Voge. Minimal separating sets for muller automata. InDerick Wood and Sheng Yu, editors, Automata Implementation, volume 1436 ofLecture Notes in Computer Science, pages 109–121. Springer Berlin/Heidelberg,1998.

137

[68] Carolos Livadas, John Lygeros, and Nancy A Lynch. High-level modeling andanalysis of the traffic alert and collision avoidance system. Proceedings of theIEEE, 88(7):926–948, 2000.

[69] John Lygeros and Shankar Sastry. Hybrid systems: modeling, analysis and con-trol. preprint, 1999.

[70] Matthew R. Maly, Morteza Lahijanian, Lydia E. Kavraki, Hadas Kress-Gazit,and Moshe Y. Vardi. Iterative temporal motion planning for hybrid systemsin partially unknown environments. In Proceedings of the 16th InternationalConference on Hybrid Systems: Computation and Control, pages 353–362, NewYork, NY, USA, 2013. ACM.

[71] Zohar Manna and Amir Pnueli. The temporal logic of reactive and concurrentsystems. Springer-Verlag New York, Inc., New York, NY, USA, 1992.

[72] Robert McNaughton and Seymour Papert. Counter-Free Automata. MIT Press,1971.

[73] Loizos Michael. Learning from partial observations. In Proceedings of the 20thinternational joint conference on Artifical intelligence, pages 968–974, San Fran-cisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.

[74] Stefan Mitsch, Sarah M Loos, and Andre Platzer. Towards formal verification offreeway traffic control. In IEEE/ACM Third International Conference on Cyber-Physical Systems, pages 171–180, 2012.

[75] Madhavan Mukund. From Global Specifications to Distributed Implementations,pages 19—34. Kluwer Academic Publishers, 2002.

[76] Daniel Neider, Roman Rabinovich, and Martin Zimmermann. Down the borelhierarchy: Solving muller games via safety games. In Proceedings Third Interna-tional Symposium on Games, Automata, Logics and Formal Verification, pages169–182, 2012.

[77] Frank Neven, Thomas Schwentick, and Victor Vianu. Finite state machinesfor strings over infinite alphabets. ACM Transactions on Computational Logic,5(3):403–435, 2004.

[78] Noam Nisan and Amir Ronen. Algorithmic mechanism design. In Proceedings ofthe 31st ACM Sympocium on Theory of Computing, pages 129–140, 1999.

[79] Dominique Perrin and Jean Eric Pin. Infinite words: automata, semigroups, logicand games. Elsevier, 2004.

[80] Nir Piterman and Amir Pnueli. Synthesis of reactive(1) designs. In Proceed-ings of Verification, Model Checking, and Abstract Interpretation, pages 364–380.Springer, 2006.

138

[81] Amir Pnueli. The temporal logic of programs. In Foundations of ComputerScience, 1977., 18th Annual Symposium on, pages 46–57, 1977.

[82] J. Raisch and S.D. O’Young. Discrete approximations and supervisory control ofcontinuous systems. IEEE Transactions on Automatic Control, 43(4):569–573,1998.

[83] Velupillai Sankaranarayanan. Ravi N. Banavar. Switched finite time control ofa class of underactuated systems, volume 333 of Lecture Notes in Control andInformation Sciences. Springer, 2006.

[84] Raymond Reiter. Knowledge in Action: Logical Foundations for Specifying andImplementing Dynamical Systems. MIT Press, 2001.

[85] James Rogers and Geoffrey Pullum. Aural pattern recognition experiments andthe subregular hierarchy. Journal of Logic, Language and Information, 20:329–342, 2011.

[86] Jonathan Schiff, Philip S Li, and Marc Goldstein. Robotic microsurgical vaso-vastomy and vasoepididymostomy: a prospective random study in a rat model.Journal of Urology, 171:1720–1725, 2004.

[87] Yoav Shoham and Kevin Leyton-Brown. Multiagent Systems - Algorithmic,Game-Theoretic, and Logical Foundations. Cambridge University Press, 2009.

[88] George F. Simmons. Introduction to Topology and Modern Analysis. KriegerPublishing Company, 2003.

[89] K. Sreenath, H.-W. Park, I. Poulakakis, and J. W. Grizzle. A compliant hybridzero dynamics controller for stable, efficient and fast bipedal walking on MABEL.International Journal of Robotics Research, 30(9):1170–1193, 2011.

[90] Colin Stirling. Modal and temporal logics for processes. In Faron Moller and Gra-ham Birtwistle, editors, Logics for concurency: structure vs automata. Springer,1996.

[91] Colin Stirling. The joys of bisimulation. In Proceedings of 23rd InternationalSymposium of Mathematical Foundations of Computer Science, volume 1450,pages 142–151, 1998.

[92] Paulo Tabuada. Approximate simulation relations and finite abstractions ofquantized control systems. In A. Bemporad, A. Bicchi, and G. Buttazzo, ed-itors, Hybrid Systems: Computation and Control, volume 4416 of Lecture Notesin Computer Science, pages 529–542. Springer-Verlag, 2007.

[93] Herbert Tanner, Jie Fu, Chetan Rawal, Jorge Piovesan, and Chaouki Abdallah.Finite abstractions for hybrid systems with stable continuous dynamics. DiscreteEvent Dynamic Systems, 22:83–99, 2012.

139

[94] Y. Tazaki and J. Imura. Finite abstractions of discrete-time linear systems and itsapplication to optimal control. In Proceedings of the 17th IFAC World Congress,pages 4656–4661, 2008.

[95] Sylvie Thiebaux, Charles Gretton, John Slaney, David Price, and Froduald Ka-banza. Decision-theoretic planning with non-markovian rewards. Journal ofArtificial Intelligence Research, 25(1):17–74, January 2006.

[96] J. G. Thistle and H. M. Lamouchi. Effective control synthesis for partially ob-served discrete-event systems. SIAM J. Control Optim., 48(3):1858–1887, June2009.

[97] Ashish Tiwari and Gaurav Khanna. Series of abstractions for hybrid automata. InClaire J. Tomlin and Mark R. Greenstreet, editors, Hybrid Systems: Computationand Control, volume 2289 of Lecture Notes in Computer Science, pages 465–478.Springer Berlin Heidelberg, 2002.

[98] C. Tomlin, G.J. Pappas, and S. Sastry. Conflict resolution for air traffic manage-ment: a study in multiagent hybrid systems. IEEE Transactions on AutomaticControl, 43(4):509–521, 1998.

[99] Jana Tumova, Gavin C. Hall, Sertac Karaman, Emilio Frazzoli, and Daniela Rus.Least-violating control strategy synthesis with safety rules. In Proceedings of the16th International Conference on Hybrid Systems: Computation and Control,HSCC ’13, pages 1–10, New York, NY, USA, 2013. ACM.

[100] A. Ulusoy, S.L. Smith, Xu Chu Ding, C. Belta, and D. Rus. Optimal multi-robot path planning with temporal logic constraints. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems, pages 3087–3092, 2011.

[101] Alphan Ulusoy, Stephen L Smith, Xu Chu Ding, and Calin Belta. Robust multi-robot optimal path planning with temporal logic constraints. In 2012 IEEEInternational Conference on Robotics and Automation, pages 4693–4698. IEEE,2012.

[102] Luis Valbuena and Herbert G. Tanner. Hybrid potential field based control ofdifferential drive mobile robots. Journal of Intelligent & Robotic Systems, 68(3-4):307–322, 2012.

[103] Eric R Westervelt, Jessy W Grizzle, Christine Chevallereau, Jun Ho Choi, andBenjamin Morris. Feedback control of dynamic bipedal robot locomotion. CRCpress, Boca Raton, 2007.

[104] Y. Willner and M. Heymann. Supervisory control of concurrent discrete-eventsystems. International Journal of Control, 54(5):1143–1169, 1991.

140

[105] Eric M Wolff, Ufuk Topcu, and Richard M Murray. Efficient reactive controllersynthesis for a fragment of linear temporal logic. In Proceedings of InternationalConference on Robotics and Automation, 2013 (in press).

[106] T. Wongpiromsarn, U. Topcu, and R.M. Murray. Receding horizon temporal logicplanning. IEEE Transactions on Automatic Control, 57(11):2817–2830, 2012.

[107] Michael Woolridge and Michael J. Wooldridge. Introduction to Multiagent Sys-tems. John Wiley & Sons, Inc., New York, NY, USA, 2001.

[108] S. Xu and R. Kumar. Discrete event control under nondeterministic partialobservation. In IEEE International Conference on Automation Science and En-gineering, 2009., pages 127–132. IEEE, 2009.

[109] M. Zamani, G. Pola, M. Mazo, and P. Tabuada. Symbolic models for nonlinearcontrol systems without stability assumptions. IEEE Transactions on AutomaticControl, 57(7):1804–1809, 2012.

[110] V. I. Zubov. Mathematical Methods for the Study of Automatic Control Systems.Pergamon Press/Macmillan, 1963.

141

Appendix A

ASYMPTOTIC (T,D) EQUIVALENCE CLASSES

Denote dist(x,A

)the distance between point x and set A and let dist

(x,A

) def=

infy∈A ‖x− y‖.

Theorem 10 (Zubov [110]). The set Ω is the region of attraction of a periodic orbit

x = ϕ(t) with period T , if and only if there exist two functions V (x) and W (x) defined

on Ω satisfying: 1) V (x) is continuous on Ω and the domain of W (x) can be extended

to entire X , 2) V (x) ∈ (0, 1) ∀x ∈ Ω\ϕ, and V (x) = 0 for dist(x, ϕ

)= 0, 3) W (x) > 0

for dist(x, ϕ

)> 0 and W (x) = 0 for dist

(x, ϕ

)= 0, 4)

∇V Tf(x) = −W (x)√

1 + ‖f‖2(1− V ) , (A.1)

5) limx→∂Ω V (x) = 1.

Proposition 10. Consider a system x = f(x) and assume that its trajectories remain

inside a compact set Ω and the (attractive) limit set L+ of the trajectories contain a

single, isolated component. Denote φ(t;x(0)) the trajectory of f starting at x(0), and

let V (x) be a solution to (A.1) that satisfies the requirements of Theorem 10. Then

the trajectories of f starting in Ω will enter an ε-neighborhood of L+, in finite time at

most T = ln(

1−c1−C

), where1 c

def= minx∈clL+⊕Bε(0) V (x) and C

def= maxx∈Ω V (x).

Proof. Pick W (x) = dist(x, ϕ

). This choice trivially satisfies the requirements of

Theorem 10. Now let V (x) be a solution of (A.1) that conforms with the conditions of

Theorem 10. Then from (A.1), for x ∈ Ω\L+⊕Bε(0) it follows that V ≤ −d(1−V )

and applying the Comparison Lemma with C = maxx∈Ω V (x) one obtains V (x(t)) ≤

1 cl is used to denote set closure.

142

1 − (1 − C)edt. Let V (x) = c be the largest level set of V included in the closure of

L+⊕Bε(0). Then, the following upper bound for the time required for the flows starting

within the level set V (x) = C to reach L+ ⊕ Bε(0) can be obtained: t ≤ 1d

ln(

1−c1−C

).

Setting T , ln(

1−c1−C

), the proof is completed.

143

Appendix B

LEARNING ALGORITHM FOR THE CLASS OF STRICTLY K-LOCALLANGUAGES.

A string u is a factor of string w iff ∃x, y ∈ Σ∗ such that w = xuy. If in addition

|u| = k, then u is a k-factor of w. The k-factor function factork : Σ∗ → 2Σ≤k maps a

word w to the set of k-factors within it if |w| > k; otherwise it maps w to the singleton

set w. This function is extended to languages as factork(L) :=⋃w∈L factork(w). A

language L is Strictly k-Local (SLk) iff there exists a finite set G ⊆ factork(]Σ∗]), such

that L = w ∈ Σ∗ | factork(]w]) ⊆ G, where ] is a special symbol indicating the

beginning and end of a string. The set G is the grammar that generates L.

A language is called Strictly Local if it is Strictly k-Local for some k. There are

many distinct characterizations of this class. For example, they are equivalent to the

languages recognized by (generalized) Myhill graphs, to the languages definable in a

restricted propositional logic over a successor function, and to exactly those languages

which are closed under suffix substitution. [31, 72, 85]. Furthermore, there are known

methods for translating between automata-theoretic representations of Strictly Local

languages and these others.

Theorem 11 ( [40]). For known k, the Strictly k-Local languages are identifiable in

the limit from positive data.

Readers are referred to the cited papers for a proof of this theorem. We sketch

the basic idea here with the grammars for Strictly k-Local languages defined above.

Consider any L ∈ SLk. The grammar for L is G = factork(]·L ·]),1 and G contains

only finitely strings.

1 The operator · : Σ∗×Σ∗ → Σ∗ concatenates string sets: given S1, S2 ⊆ Σ∗, S1 · S2 =xy | x ∈ S1, y ∈ S2.

144

A poly-time, incremental and set-driven learning algorithm for strictly k-local

language is GIM defined by [48]: 1) i = 0: GIM(φ[i]) := ∅. 2) φ(i) = #: GIM(φ[i]) :=

GIM(φ[i− 1]). 3) Otherwise, GIM(φ[i]) := GIM(φ[i− 1]) ∪ factork(]φ(i)]).

There will be some finite point in every data presentation of L such that the

learning algorithm converges to the grammar of L (because the cardinality of G is

finite). This particular algorithm is analyzed by [48], and is a special case of lattice-

structured learning [49].

Example 2. Consider a strictly 2-local language L = (Σ∗aaΣ∗) ∪ (Σ∗baΣ∗), for

Σ = a, b, where S be the complement of the set S with respect to Σ∗. In other

words, L is the set of strings that don’t have aa and ba factors. We have the gram-

mar G = factork(L) = ]a, ]b, ab, bb, b], a]. Obviously, aaa /∈ L because F2(]aaa]) =

]a, aa, a] * G.

Learning proceeds as follows: given a positive presentation φ where φ(1) = ab,

φ(2) = bb, φ(3) = a, applying the learning algorithm GIM(φ[1]) = factork(]ab]) =

]a, ab, b]; GIM(φ[2]) = GIM(φ[1])∪F2(]bb]) = ]a, ]b, ab, bb, b]; GIM(φ[3]) = GIM(φ[2])∪F2(]a]) = G. The learner converges after only having observed 3 strings.

This learning algorithm does not output finite-state automaton, but sets of

factors. However, there is an easy way to convert any grammar of factors into an

acceptor which recognizes the same Strictly Local language. This acceptor is not the

canonical acceptor for this language, but it is a normal form. It is helpful to define

a function sufk(L) = v ∈ Σk | (∃w ∈ L)[(∃u ∈ Σ∗)[w = uv]]. Given k and a set of

factors G ⊆ factork(]·Σ∗·]), construct a finite-state acceptorAG = 〈Q,Σ, T, I,Acc〉as follows.

• Q = sufk−1(Pr(L(G)))

• (∀u ∈ Σ≤1)(∀σ ∈ Σ)(∀v ∈ Σ∗)[T (uv, σ) = vσ ⇔ uv, vσ ∈ Q]

• I = λ if L(G) 6= ∅ else ∅

• Acc = sufk−1(L(G))

The proof that L(AG) = L(G) is given in [47, p.106].

145

symbolic control and adaptive systemsjfu2/files/papers/fu-thesis-2013.pdf · 2013-12-03 · agent...

Documents