symbolic control and adaptive systemsjfu2/files/papers/fu-thesis-2013.pdf · 2013-12-03 · agent...
TRANSCRIPT
BOTTOM-UP SYMBOLIC CONTROL AND ADAPTIVE SYSTEMS:
ABSTRACTION, PLANNING AND LEARNING
by
Jie Fu
A dissertation submitted to the Faculty of the University of Delaware in partialfulfillment of the requirements for the degree of Doctor of Philosophy in MechanicalEngineering
Fall 2013
c© 2013 Jie FuAll Rights Reserved
BOTTOM-UP SYMBOLIC CONTROL AND ADAPTIVE SYSTEMS:
ABSTRACTION, PLANNING AND LEARNING
by
Jie Fu
Approved:Suresh G. Advani, Ph.D.Chair of the Department of Mechanical Engineering
Approved:Babatunde A. Ogunnaike, Ph.D.Dean of the College of Engineering
Approved:James G. Richards, Ph.D.Vice Provost for Graduate and Professional Education
I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.
Signed:Herbert G. Tanner, Ph.D.Professor in charge of dissertation
I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.
Signed:Ioannis Poulakakis, Ph.D.Member of dissertation committee
I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.
Signed:Joshua L. Hertz, Ph.D.Member of dissertation committee
I certify that I have read this dissertation and that in my opinion it meets theacademic and professional standard required by the University as a dissertationfor the degree of Doctor of Philosophy.
Signed:Jeffrey Heinz, Ph.D.Member of dissertation committee
ACKNOWLEDGEMENTS
First of all, I would like to express my deepest appreciation for the support from
my advisor, Dr. Herbert Tanner. His guidance and inspirational ideas have made my
Ph.D. study a rewarding, thoughtful and joyful journey. I would like to thank Dr.
Jeffrey Heinz for his constant help, patience and encouragement. I would like to thank
my committee members, Dr. Ioannis Poulakakis and Joshua Hertz, for their valuable
advices in regard to my research and dissertation.
Over the course of my doctoral program, I have collaborated with many re-
searchers from the Linguistic department. I would like to thank all the members in
the CPS group, without the insightful discussions we had many progresses in this work
could not have been made. In addition, I would like to thank Dr. Jim Rogers and Dr.
John Case for generously sharing their ideas and knowledge. I also wish to thank all
the members of the Cooperative Robotic Laboratory for their friendship and help dur-
ing my stay at UDel. I would like to thank the funding source NSF Career #0907003,
NSF CPS #1035577, and ARL MAST CTA #W911NF-08-2-0004.
Last but not least, I want to thank my family: my parent Yingxue Wu and
Yongming Fu, and my husband Juannan Zhou, for their support, understanding and
love.
iv
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.5 Technical Objectives and Contributions . . . . . . . . . . . . . . . . . 101.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 LITERATURE REVIEW . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Hybrid Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 Formal Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 BOTTOM-UP SYMBOLIC PLANNING . . . . . . . . . . . . . . . 19
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Automata and Transition Systems . . . . . . . . . . . . . . . . . . . . 21
3.2.1 Automata and their Semantics . . . . . . . . . . . . . . . . . . 213.2.2 Transition Systems . . . . . . . . . . . . . . . . . . . . . . . . 213.2.3 Register Automata and its Semantics . . . . . . . . . . . . . . 22
3.3 Hybrid Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . 24
v
3.3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.1 Predicate Abstraction and the Induced Register Automata . . 283.4.2 Weak (Bi)simulations . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Time-optimal Planning . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5.1 Searching for Candidate Plans . . . . . . . . . . . . . . . . . . 37
3.5.1.1 A Graph Representation . . . . . . . . . . . . . . . . 373.5.1.2 Finding Walk Candidates . . . . . . . . . . . . . . . 39
3.5.2 Dynamic Programming — a Variant . . . . . . . . . . . . . . 42
3.6 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.1 Control Mode a: Nonholonomic Control . . . . . . . . . . . . 453.6.2 Control Modes b and c: Catch and Release . . . . . . . . . . . 463.6.3 The System Model . . . . . . . . . . . . . . . . . . . . . . . . 473.6.4 Task Specifications . . . . . . . . . . . . . . . . . . . . . . . . 503.6.5 Solving the Planning Problem . . . . . . . . . . . . . . . . . . 51
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4 ADAPTIVE CONTROL SYNTHESIS— WITH PERFECTINFORMATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.2 Grammatical Inference and Infinite games . . . . . . . . . . . . . . . 58
4.2.1 Infinite Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.2.2 Grammatical Inference . . . . . . . . . . . . . . . . . . . . . . 594.2.3 Specification Language . . . . . . . . . . . . . . . . . . . . . . 594.2.4 Infinite Games . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 System Behavior as Game Play . . . . . . . . . . . . . . . . . . . . . 61
4.3.1 Constructing the Game . . . . . . . . . . . . . . . . . . . . . . 61
vi
4.3.2 Game Theoretic Control Synthesis . . . . . . . . . . . . . . . 64
4.4 Integrating Learning with Control . . . . . . . . . . . . . . . . . . . . 65
4.4.1 Languages of the Game . . . . . . . . . . . . . . . . . . . . . . 664.4.2 Learning the Game — a First Approach . . . . . . . . . . . . 664.4.3 Learning an Equivalent Game . . . . . . . . . . . . . . . . . . 70
4.4.3.1 Equivalence in Games . . . . . . . . . . . . . . . . . 704.4.3.2 Learning an Equivalent Game from Positive
Presentations . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 CONTROL SYNTHESIS WITH PARTIAL OBSERVATIONS . . 81
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.2 Symbolic Synthesis with Partial Observations . . . . . . . . . . . . . 82
5.2.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.2 Deterministic Finite-memory Strategy . . . . . . . . . . . . . 855.2.3 Randomized Finite-memory Controllers . . . . . . . . . . . . . 89
5.2.3.1 Randomized Controllers for Reachability Objectives . 905.2.3.2 Randomized Controllers for Buchi Objectives . . . . 93
5.3 Discussions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 95
6 OUTLOOK: MULTI-AGENT SYSTEMS . . . . . . . . . . . . . . . 96
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 966.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.3 Modeling a Multi-agent Game . . . . . . . . . . . . . . . . . . . . . . 98
6.3.1 Constructing the Game Arena . . . . . . . . . . . . . . . . . . 986.3.2 Specification Language . . . . . . . . . . . . . . . . . . . . . . 996.3.3 The Game Formulation . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Game Theoretic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.4.1 Pure Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . 101
vii
6.4.2 Special Cases — Buchi and Reachability Objectives . . . . . . 109
6.4.2.1 Deterministic Buchi Games . . . . . . . . . . . . . . 1096.4.2.2 Reachability Games . . . . . . . . . . . . . . . . . . 111
6.4.3 Security Strategies . . . . . . . . . . . . . . . . . . . . . . . . 1136.4.4 Cooperative Equilibria . . . . . . . . . . . . . . . . . . . . . . 117
6.5 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.5.1 Reachability Objectives . . . . . . . . . . . . . . . . . . . . . 1206.5.2 Buchi Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 1256.5.3 Strategy Alternatives for Agent 3 . . . . . . . . . . . . . . . . 126
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . 128
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Appendix
A ASYMPTOTIC (T,D) EQUIVALENCE CLASSES . . . . . . . . . 142B LEARNING ALGORITHM FOR THE CLASS OF STRICTLY
K-LOCAL LANGUAGES. . . . . . . . . . . . . . . . . . . . . . . . . . 144
viii
LIST OF TABLES
3.1 Pre and Post maps for the control modes of the hybrid agent. . 49
6.1 Nash equilibria for all payoff vectors in concurrent game G withreachability objectives . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Nash equilibria for all payoff vectors in concurrent game G with Buchiobjectives (in the case when ε /∈ Σi, for all i ∈ Π) . . . . . . . . . . 125
ix
LIST OF FIGURES
3.1 An example of a 2-register automaton. . . . . . . . . . . . . . . . . 24
3.2 The transformation semiautomaton TR(H) of hybrid agent H, forthe task specification considered. . . . . . . . . . . . . . . . . . . . 50
3.3 Discretized workspace for the mobile manipulator and optimal path.The two concentric collection of points mark parameter classrepresentatives around the object and user positions. . . . . . . . . 53
4.1 The architecture of hybrid planning and control with a module forgrammatical inference . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Learning and planning with a grammatical inference module. . . . . 68
4.3 The environment and its abstraction. . . . . . . . . . . . . . . . . . 78
4.4 A fraction of (G, v0) where v0 = ((1, c,1), 1). A state ((q1, q2, t), qs)means the robot is in q1, the recent consecutively closed (at mosttwo) doors are q2, t = 1 if player 1 is to make a move, otherwiseplayer 2 is to make a move and the visited rooms are encoded in qs,e.g. 12 means rooms 1, 2 have been visited. . . . . . . . . . . . . . . 78
4.5 Convergence of learning L2(G, v0): the ratio between the size of the
grammar inferred by the GIM and that of L2(G, v0), in terms of number of
moves made . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 A fragment of a game graph P and the observed game structure obsP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 Relating runs r in multi-agent game G, ρ in the two-player turn-basedgame graph H and τ in objective Aj. . . . . . . . . . . . . . . . . 106
6.2 A partitioned rectangular environment in which three agents roam. 120
6.3 A fragment of the multi-agent arena P = 〈Q,ACT , T 〉. . . . . . . . 121
x
6.4 A fragment of the two-player turn-based game arena H. . . . . . . 121
6.5 Fragment of the partial synchronization product H . . . . . . . . . 121
6.6 The finite state automaton (fsa)s representing the agent objectives. 121
6.7 Fragment of the two-player turn-based arena H1. . . . . . . . . . . 126
xi
ABSTRACT
This thesis develops an optimal planning method for a class of hybrid sys-
tems, and introduces machine learning with reactive synthesis to construct adaptive
controllers for finite-state transition systems with respect to high-level formal specifi-
cations in the presence of an unknown, dynamic environment.
For a class of hybrid systems that switches between different pre-defined low-
level controllers, this thesis develops an automated method that builds time-optimal
control mode sequences that satisfy given system specifications. The planning algo-
rithm proceeds in a bottom-up fashion. First, it abstracts a hybrid system of this class
into a special finite-state automaton that can manipulate continuous data expressing
attributes of the concrete dynamics. The abstraction is predicate-based, enabled by
the convergence properties of low-level continuous dynamics, and encompasses existing
low-level controllers rather than replacing them during synthesis. The abstraction is
weakly simulated by its concrete system, and thus behaviors planned using the ab-
straction are always implementable on the concrete system.
The procedure of abstraction bridges the hybrid dynamical systems with formal
language and automata theory and enables us to borrow concepts and methodologies
from those fields into control synthesis. In the presence of unknown, dynamic and po-
tentially adversarial environment, this thesis develops an adaptive synthesis method,
and advocates the integration of ideas from grammatical inference and reactive synthe-
sis toward the development of an any-time control design, which is guaranteed to be
effective in the limit. The insight is that at the abstraction level, the behavior of the
unknown environment exhibited during its interaction with the system can be treated
as an unknown language, which can be identified by a learning algorithm from finite
xii
amount of observed behaviors provided some prior knowledge is given. As the fidelity
of the environment model improves, the control design becomes more effective.
The thesis then considers reactive synthesis in the case of partial observation
(not all environment actions can be completely observed by the system) and multi-
agents. Reactive control synthesis methods are developed for systems with incomplete
information, that ensure the specifications are satisfied surely, or almost surely (with
probability 1). For the synthesis of controllers for multiple concurrent systems with
individual specifications, an approach in which each treats the others as adversarial
can be too restrictive and unnecessary. This thesis presents a decision procedure for
agent behaviors that is based on the solution of a concurrent, multi-agent infinite game.
Depending on the context in which interaction takes place, solutions can come in the
form of pure (deterministic) Nash equilibria, security strategies, or cooperative pure
Nash equilibria. The analysis and methods presented can be extended to the case of
decentralized control design for multiple reactive systems. The thesis concludes with a
brief overview and possible future research directions.
xiii
Chapter 1
INTRODUCTION
The concept of hybrid systems provides a rich modeling framework as it captures
the interaction of discrete and continuous dynamics. The discrete dynamics can express
the finite operational modes which the system can be in, for example, the flight modes
of an aircraft such as hovering, descending, etc. The continuous dynamics is determined
by the physics of evolution of continuous states such as position, velocity, acceleration,
etc., in each mode.
Many application domains for such systems are found in safety critical systems,
such as automated highway systems [53, 74], air-traffic management systems [43, 68],
robotics [1,103], etc. For such systems, the specifications and performance requirements
can be more general than maintaining stability. Examples of such specifications can be
liveness (something good will always eventually happen), safety (nothing bad will ever
happen), fairness (all constituent processes will evolve and none will starve). While the
first can somehow be captured by notions of convergence and invariance in a classical
continuous dynamical systems framework, it is not so clear how to handle the other
two.
Take the air traffic control (ATC) system for example, an ATC system has
to keep track of several aircraft simultaneously, and has to be robust to the delays,
conflicts, weather, etc. The control objective of such a system is to ensure safety and
fairness, which concerns about scheduling take-off and landing in a timely fashion. This
objective cannot be expressed in traditional specification language using the terms such
as asymptotically stable, settling time, etc. For this purpose, temporal logic [81] can
be used as the specification language, not only because it can express high-level control
1
objectives, but also because it allows us to reason about the ordering of events without
introducing time explicitly.
When the specification is given in logical formula, which is of discrete nature,
there arises a heterogeneity between the hybrid dynamics in the system and its dis-
crete specification. One needs to bring both the system and its specification in the
same formalism framework to allow for analysis. This is what motivates abstraction-
based symbolic synthesis as it enables us to perform control design with temporal logic
specifications for reactive hybrid systems that continuously interact with the dynamic
environment. In the ATC system, to coordinate effectively the aircraft, a symbolic
approach can be developed by defining a set of conflict resolution maneuvers [98]. At
the low-level design, each maneuver is a finite sequence of flight modes such as heading,
altitude and speed change for aircraft, for the purpose of avoiding a conflict between
two aircraft. With these pre-defined maneuvers, one can synthesize a symbolic control
policy to ensure the satisfaction of all specifications, without having to explicitly reason
about the continuous dynamics of each aircraft. Hereby, with abstraction and sym-
bolic synthesis, we can alleviate the difficulty in control design for hybrid systems and
reconcile the heterogeneity between the system and their high-level logic specifications.
Furthermore, through discrete abstractions, many methods for analyzing discrete sys-
tems can be made applicable to hybrid systems. In this dissertation, it is shown that
a branch of machine learning — grammatical inference [30] can be introduced as a
system identification module at the abstraction level, which is a fundamental key in
constructing adaptive controllers for hybrid systems.
This chapter presents some motivation for the research described in this disser-
tation and gives a general description of the problem on which it focuses. Following
that, an overview of the approach and a brief description of the technical contributions
of this research are provided.
2
1.1 Motivation
Significant part of the formal analysis and synthesis for hybrid systems is driven
by the thought of lifting the analysis and control problems in systems with both con-
tinuous and discrete dynamics into systems with purely discrete dynamics, for which
computationally efficient methods can be applied. Discrete representations of systems
accommodate formal, high-level logic specifications, which are easy to understand by
system designers and also sufficient for expressing many complex requirements in sys-
tems. Generally, these methods are implemented in a two-step, hierarchical framework:
a discrete controller is synthesized with respect to the control objective of the high-
level discrete system and then implemented by the low-level dynamics of the underlying
hybrid system.
This hierarchical framework relies on an abstraction procedure, which constructs
a discrete, finite-state transition system from the hybrid, infinite-state system, and
permits analysis for the hybrid system to be equivalently performed on the discrete
system, with efficient computational methods. In general, the major difference between
various abstraction methods lies in their definitions of a state-equivalence relation. This
relation induces partitions of the continuous state space such that each block contains
states which behave in a “similar” manner. Then different blocks are mapped into
different abstract states. Currently, many abstraction methods are difficult to scale
to real-world applications because of state-explosion: the number of abstract states
is typically exponential in the number of variables used for defining the equivalence
relation. To combat this, one direction is to explore a bottom-up abstraction method:
for complex dynamical systems, one can design a finite set of low-level controllers
for simple objectives, and then reduce complex problems into problems of finding a
concatenation sequence of existing low-level controllers.
Abstraction techniques for hybrid systems existing, logical analysis can be used
for abstraction-based verification and control design with respect to high-level system
specifications and control objectives. In verification of hybrid systems, model checking
and theorem proving in separation tackle different aspects of verification problems.
3
Comparatively, model checking provides systematic ways to explore the state space of
the abstract system and verify if a given property holds in the system. It can facilitate
the detection of errors and guide the system design by generating a counterexample
explaining why a given property fails to hold. Theorem proving, on the other hand,
provides a constructive proof in showing why the correct behavior is exhibited by the
system.
Intuitively, formal verification treats the system as a program, and tests or
proves whether it can perform as desired. Formal control synthesis is analogous to de-
signing a program, which meets a given specification. However, considering a system in
isolation is always problematic due to the exogenous inputs from the environment. Our
goal is to synthesize controllers such that no matter how the uncontrollable, dynamical
environment behaves, the system can still accomplish the assigned task. Assuming the
worst case behavior for the environment, we formulate the interaction between a system
and its environment as a two-player, zero-sum game: the system player aims to satisfy
the desired specification while the environment player interferes to violate it. With
this game formulation, we are able to convert the control design problem into finding
a winning strategy for the system player against the environment. In this respect, al-
gorithmic game theory is employed to synthesize controllers which autonomously react
to the external inputs from the environments in real-time.
This game theoretic approach, also known as reactive synthesis method, pro-
duces correct-by-construction controllers, provided the environment dynamics and the
initial condition of the system satisfy some known assumptions in the form of logic
formulas. However, when these assumptions mistakenly or incompletely capture the
dynamics of the environment, the synthesis may lead to errors and poor system per-
formance. One approach to tackle this problem [70], is to combine iterative motion
planning with reactive synthesis. This method, while capable of discovering the changes
in the environment model, it cannot do so for the underlying dynamics that generate
the changes. As a consequence, the iterative planning approach in [70], which no
longer guarantees the satisfiability of the original specification, strives to satisfy some
4
specification close to it.
Limitations in these aforementioned synthesis methods are caused by the in-
completeness in the system’s knowledge with respect to its environment. With insight
from model-based adaptive control design in traditional control theory, and machine
learning from artificial intelligence, we are interested in the following question: is there
a method to complete our knowledge of the environment through the observed behavior
of it? The rationale is that once a correct model of the environment is obtained, the
problem is then reduced into a typical reactive synthesis problem, solvable with vari-
ous existing methods. The question itself points towards a direction for us to answer
it: the process of questing knowledge from experiences, is what we call learning. If a
system learns about its unknown environment from past observations, then it can im-
prove its current controller autonomously, adapting to the knowledge obtained about
its adversary.
So far, we have only considered the case which involves a single system and its
dynamic environment. For this case, it is reasonable to treat the latter as an opponent.
However, when the environment is composed of multiple autonomous systems, with
each having its own objective and preferences over the outcomes of their interaction,
the above view of the environment can be quite conservative: it will not allow the
possibility of cooperation for mutual benefit. On the other hand, the central problem
in mechanism design [78] is to motivate autonomous interacting agents by giving them
the right incentives, so that some desired behavior emerges as a stable expression
of interaction between them. To this end, this thesis studies a range of problems
in reactive synthesis for hybrid systems: abstraction, control design with incomplete
information, and analysis of multi-agent systems. We show that the proposed solutions
and methods can be incorporated into a coherent framework for symbolic control of
hybrid systems.
5
1.2 Problem Description
In reactive synthesis for hybrid systems, controllers are synthesized through
discrete abstractions, and then implemented with the low-level concrete dynamics.
Similarly, adaptive symbolic control synthesis emphasizes the incorporation of learning
into synthesis at the discrete level, through appropriate abstractions for hybrid systems.
This thesis mainly aims to solve the following problems:
1. Given a system that is capable of switching between different parameterizable
control modes, each of which has well-defined convergent continuous dynamics,
determine an optimal plan in the form of a sequence of parameterized control
modes such that by executing the plan, the system is steered to a desired final
state from any given initial state.
2. In the presence of an unknown, dynamic, adversarial environment, a system aims
to accomplish a task specified with high-level logic formula. Assuming that some
knowledge of the environment is given as a prior, construct a controller that is
adaptive to the dynamics of the environment and eventually converges to the
one that allows the system to accomplish the task, whenever such an outcome is
possible.
3. When a system interacting with a dynamic environment has limited sensor capa-
bilities, the control design can be unrealizable if we require the task is completed
with certainty. Instead, we consider, with some probabilistic measures of success
for control design, i.e., the task is completed with probability 1, can we obtain
more permissive control strategies for partial observation cases?
4. For the special class of multi-agent games that arise as models for interaction
among self-interested agents with objectives expressed in the form of a high-level
logic formula, what are the decision procedures for solution concepts in traditional
game theory, such as Nash equilibria, security strategies, cooperative equilibria
etc.?
6
1.3 Challenges
While control synthesis at the high level based on abstraction is attractive both
from theoretical and practical perspectives, challenges also remain both in the design of
a scalable abstraction method and the incorporation of learning into symbolic control
synthesis. Particularly, what exactly is meant by learning at this abstract level is rarely
formalized, while in fact there is a clear conceptual link between what we call adaptive
symbolic control and model-based adaptive control in traditional control theory.
The problem of scalability when performing abstraction and synthesis is press-
ing. Significant amount of work [27, 51] is based on improving the scalability of ab-
straction focuses on the counterexample-guided abstraction refinement (CEGAR) ap-
proach [29]. This method allows us to compute an abstract system which omits enough
detail to overcome the state-explosion problem, while still keeping sufficient informa-
tion to ensure that the specified objective can be met. The limitation of this method
is that, by construction, the abstract model generated by CEGAR is task-oriented:
once the objective is changed, the abstract system needs to be recomputed. So far, an
abstraction method which shares the advantages of CEGAR but is not task-oriented
has yet to be developed. Here, we take a first step along this direction by focusing on
a specific subclass of hybrid systems. For this class we aim to develop an abstraction
method fulfilling the requirements on scalability and sufficiency, while being indepen-
dent in the task specifications.
The other main challenge in hierarchical control design is to ensure that the sys-
tem satisfies the task specification in the presence of disturbances and exogenous input
from a dynamic environment. Current limitations of reactive synthesis for symbolic
control are primarily caused by having incomplete knowledge about this environment.
Because of this, introducing learning into symbolic control seems almost natural. How-
ever, among the different learning paradigms in machine learning literature, it is not
clear which one can interface well with symbolic synthesis. The choice of a learning
paradigm not only depends on the information accessible by the system during its in-
teraction with its environment, but also on what we decide that we want to learn. In
7
cases when the system interferes the behavior of its environment, it is more meaning-
ful to learn the interaction, rather than the environment itself. Then, in order to be
able to adapt the controller in real-time, we need the learning algorithm to update the
environment model in an efficient way. Determining appropriate learning paradigms
is thus a crucial step. Eventually, our goal is to establish a complete framework that
combines learning with symbolic synthesis seamlessly.
1.4 Approach Overview
Let us first describe the class of hybrid systems we are focusing on. A dy-
namical system in this class is capable of switching between different continuous-state
controllers, each of which yields some convergent closed-loop continuous dynamics. In
particular, these low-level controllers are parameterizable, and the parameters allow
the designer to ensure that at steady state, the state of the system is within a desired
set.
When we say that we plan the behavior of such a system, we mean that we
determine a temporal sequence of parameterized low-level controllers. Due to the
underlying convergent dynamics of low-level controller, restrictions have to be enforced
on the concatenation of two control modes. We approach an optimal planning problem
for this class of systems in a bottom-up way: firstly, we propose an abstraction method
to generate a discrete finite-state abstract model, which, exposes and offer access to
these (continuous) parameters. This abstract model accepts data words, which are
sequences of tuples consisting of a label, identifying the control mode that is to be
activated and a set of parameters for this mode. By proving that the concrete hybrid
system simulates the abstract one, it is guaranteed that a data word accepted in the
abstract model, can always be translated into an implementable plan in the original
system.
After translating the original specification into a specification on the abstract
model, we develop an optimal planning method at the abstract level. This method
includes two steps: the first step is to compute a finite sequence of control modes on
8
a finite-state transition system induced from the abstract model. In the second step,
a variant of dynamic programming is developed to determine the parameterization of
the control modes found in this sequence. The plan may eventually be sub-optimal due
to the over-approximation of continuous dynamics during the abstraction procedure.
With such an abstraction method at hand, we consider the control synthesis
problem in the presence of unknown, dynamic environments purely at the abstraction
level. We assume both the system and the environment admit discrete abstractions
in the form of some finite-state transition systems. With these abstract models, we
describe the interaction between the system and its environment as a two-player, turn-
based game. Assuming some prior knowledge is given to the system about the class of
models that its adversary may admit, a particular machine learning approach known
as grammatical inference (GI), is introduced and utilized as a system identification
method at the abstraction level. The adaptive control design proposed here brings
together exploration and exploitation: when executing the current controller, the sys-
tem explores the environment and collects the information in the form of sensory data,
which then serve as inputs to the GI module. With this information, the system re-
fines the abstract model of its environment, and subsequently the game it is involved
in through an algorithm which learns the discrete dynamics of the environment (or
their interaction) under some learning criterion. The controller is then adapted to this
new model generated by this learning algorithm. The convergence and correctness of
this adaptive controller is guaranteed by the convergence of learning. It represents a
method for reactive control synthesis which is, in the limit, correct-by-construction.
In the case where the system has only partial observations of what goes on
around it, its interaction with the environment takes the form of a two-player, turn-
based game with incomplete information. We approach the synthesis problem using a
knowledge-based subset construction and distinguish two different success criteria for
control design: one is the notion of a sure-winning control strategy, which ensures the
goal is satisfied certainly, and the other is that of an almost-sure control strategy, which
ensures the goal is satisfied with probability 1. We show that when the information is
9
complete, there is no difference between these two types of controllers. However, when
one allows probabilistic measures of success, we can find solutions that almost-surely
succeed in cases where sure-winning control strategies do not exist.
We extend formal synthesis methods for multi-agent systems with control spec-
ifications expressed in some formal logic. We model multiple system interaction as a
multi-agent, concurrent, infinite game. A play in such a game is a possibly infinite
sequence of interleaving states and concurrent actions of all systems involved. Solu-
tion concepts from algorithmic game theory are employed to compute any emergent
stable interaction between these rational systems, as it comes in the form of a (Nash)
equilibrium. We propose a decision procedure for equilibria in this class of games by
converting the multi-agent, concurrent infinite game into a two-player, turn-based in-
finite game. The winning strategy of one player in the latter can be translated into an
equilibrium in the former. Then we examine coalition games, which often arise when
individual systems are allowed to team-up for their mutual benefit.
1.5 Technical Objectives and Contributions
This dissertation integrates abstraction, optimal planning, symbolic reactive
synthesis, and machine learning to build a coherent framework for abstraction-based
adaptive control design for hybrid systems in the presence of unknown, dynamic en-
vironments. The major technical objectives and contributions of this dissertation are
the following.
1. A novel abstraction method for a class of hybrid systems. The method is scal-
able, technically sound and independent of the choice of the control objectives.
We show that with this abstraction method, an optimal planning problem can
be solved using the abstract model, and the solution is guaranteed to be imple-
mentable in the original hybrid system.
2. An adaptive symbolic control framework that integrates learning with reactive
10
synthesis. With correct elementary prior knowledge about the class of the un-
known, dynamic and adversarial environment the system may be interacting
with, the framework allows an automatic adaptation of the control design with
respect to any new knowledge acquired about the environment. We ensure that
the adaptive controller eventually meets the given control objective, whenever
possible.
3. A reactive synthesis method for systems with partial observation, with two differ-
ent measures of success (sure and almost-sure) for control design. We show that
the control design can be made more permissive by allowing the specification is
satisfied almost surely (with probability 1).
4. Decision procedures for stable equilibrium behaviors during the interactions of
multiple systems, where each has independent control objectives. Concepts from
algorithmic game theory are brought to bear and adapted for the stability analysis
of this class of multi-agent systems. The results can be applied to the design of
decentralized multi-agent systems with control objectives, expressed in some class
of formal logic languages.
1.6 Thesis Outline
This dissertation is organized as follows. In Chapter 2 we provide a literature
review that offers background behind most of the theoretical results of this disserta-
tion, and we motivate the approach followed here. Chapter 3 presents a method for
abstraction and optimal planning applicable to a class of hybrid systems. We introduce
a new discrete model — the register automaton — as an abstraction model and provide
a bottom-up, hierarchical, optimal planning method with respect to given reachability
specifications. Adaptive symbolic control design methods for systems operating in an
unknown, dynamic environment are presented in Chapter 4, for the case of complete
information (every state variables and environment action is observable). Experimen-
tal results that show the convergence of adaptive controller are presented. For systems
11
with sensing uncertainties, Chapter 5 treats the control synthesis problem with respect
to different measures of success. We show that for a temporal logic specification such
as liveness, the controller must have finite memory and may need to be randomized.
Chapter 6 focuses on game-theoretic modeling and analysis for multi-agent concurrent
systems, in which each agent has its own task specification and preference over the
outcomes. We present decision procedures for equilibria and security strategies, and
discuss the possibility of utilizing the developed methods in decentralized control de-
sign for multi-agent systems with respect to temporal logic specifications. Chapter 7
concludes the dissertation and focuses on possible directions for future work.
12
Chapter 2
LITERATURE REVIEW
2.1 Hybrid Systems
As embedded computing becoming more pervasive, many engineering systems
are designed to incorporate both discrete and continuous dynamics, giving rise to
dynamical systems that are known as hybrid.
A hybrid automaton [50] is a mathematical model for a hybrid system, which
facilitates analysis and control design by combining concepts from formal language the-
ory and dynamical systems into a single theoretical framework. A hybrid automaton
includes a set of discrete system states, or modes, and transition relations between
them. In each discrete mode, the continuous state of the hybrid system is governed by
flow conditions specified by differential equations. Transitions in a hybrid automaton
can be of two types: one is the transition which continuous states undergo, determined
by the dynamics in a given control mode; and the other is the transition between dis-
crete modes, determined by some logical condition called the guard of that transition.
Hybrid automata can capture many characteristics of systems in practice. There are
even stochastic hybrid automata [54] which incorporates randomness into both the
discrete transitions and the continuous dynamics of discrete modes, and are capable
of expressing some of the inherent uncertainty in the system and its environment in
many real world applications.
Analysis and control of hybrid automata has been utilized in several safety
critical applications [69]. A safety property may require the state trajectories to reach
a specified set while avoiding unsafe regions. In robotic systems, which can make a
nontrivial class of hybrid systems, liveness properties are often important. For instance,
13
in a surveillance task, a robot or a team of robots may have to ensure that some critical
regions are visited indefinitely often. These types of system specifications, in general,
cannot be translated into traditional control objectives. For this purpose, temporal
logic [71] is employed to specify these high-level requirements [2, 26]. The downside
is that the introduction of such high-level requirements renders analysis and control
design problems for hybrid systems even more difficult.
To tackle these challenges, a multi-disciplinary approach to hybrid systems has
emerged, that borrows methods from computer science, control engineering and applied
mathematics. It fosters a large and growing body of work on hierarchical control design
and formal methods for hybrid systems.
2.2 Abstraction
One critical step in formal methods is the procedure of abstraction. Abstraction
can map a hybrid system into a purely discrete model of computation, which approx-
imate the dynamics of the original system. The purpose of abstraction is to lift the
verification and control synthesis problems from the original continuous/discrete space
to a discrete space, where some of these problems can be solved with smaller analytical
and computational cost.
To solve in discrete and transfer to hybrid, a formal relation between the ab-
stract system and its concrete hybrid system has to be established. To this end, a
variety of abstraction methods have been developed, based on simulation, bisimulation
relations [17], or approximate bisimulation relations [42, 92]. Depending on how a set
of continuous states is grouped into a single abstract state, an abstraction method can
be predicate-based or discretization-based, or a mix of these two. In predicate-based
methods [3], a set of predicates is selected to partition the state space, and the size
of abstract systems is exponential in the number of predicates. In discretization-based
methods such as [60, 61, 82, 94, 97], the size of the abstract model depends on the
resolution of discretization at choice.
14
For abstractions to be useful, we need to omit some detail of concrete dynamics
of the systems. On the other hand, we also need to keep enough information so that
the important part of the behavior of the hybrid system can be correctly captured in its
abstract model. To strike a balance, CEGAR for program design has been introduced
for predicate abstraction of hybrid systems [4, 27]. The method starts with a coarse
abstract model of the hybrid system and refines it iteratively until a specified safety
property is verified to be satisfiable in the abstract model. Hence, the abstract system
retains just enough information to exhibit the satisfiability of the safety property. In
this way, CEGAR-based methods resolve the state-explosion problem to some extend.
However, there are two limitations in this method, one is that the abstract model
obtained for one specification may need to be further refined in order to exhibit the
satisfiability of another; another limitation is that this approach has only been applied
to safety properties so far, and the extension to abstraction with respect to liveness is
not clear.
Once the appropriate abstraction methods are obtained, verification and syn-
thesis for a hybrid system can proceed using its abstract model, with tools and methods
from model checking [28], theorem proving and algorithmic game theory [45].
2.3 Formal Synthesis
Due to the complexity and increased scale of most engineering systems, re-
searchers are interested in applying computational, algorithmic approaches to analysis
and control design of hybrid systems. For this purpose, model checking and reactive
synthesis have been introduced.
Formal synthesis of control protocols can be performed both in a top-down and
in a bottom-up fashion, or even some variant of both. Top-down approaches [34] take
the specification as an input and directly construct a hybrid automaton that satisfies
the specification. Bottom-up approaches [59], alternatively, start with the abstraction
of a given system, and attempts to investigate how to satisfy a given specification in
the system at hand. For stochastic systems, which can be abstracted or modeled as
15
Markov decision process (mdp), with temporal logic constraints, optimal control design
has been developed in a bottom-up approach [32]. An integration of both top-down
and bottom-up approaches is introduced [62] for solving temporal logic robot motion
planning problems. In this mixed approach, the control protocol is designed at the
abstract level, and then each discrete output indicated by the discrete controller is
refined into a sequence of implementable low-level controllers in the hybrid system.
The environment dynamics can be captured by introducing uncontrollable or
probabilistic transitions into the system’s dynamics. Alternatively, reactive systems
captures the interaction between a system and its environment as a two-player, zero-
sum game, in which the environment is assumed to satisfy a logical formula. Reactive
synthesis provides us both a decision procedure and a control design algorithm, which,
given a system, a desired property and an assumption on the dynamics of its envi-
ronment, will verify whether there exists a realizable controller that ensures that the
property holds in the system. The application of reactive synthesis in hybrid systems
is frequently encountered in robot motion planning. Formulas from General Reactivity
(1) (GR(1)) formulas [80], which is a fragment of Linear Temporal Logic (ltl), have
been used in [63] to express the dynamics of the system, the environment behavior,
and the task specification. The satisfiability of a GR(1) formula, once verified, leads
to an automaton that implements a discrete control protocol. This discrete solution is
then refined on the concrete hybrid system and ensures the completion of the task. To
alleviate the computational complexity of ltl synthesis, one can employ receding hori-
zon control [106]. Meanwhile, to improve the efficiency of synthesis, a compositional
method can be brought to bear [105]: given a set of specifications capturing various
safety, liveness constraints in a system, the controllers are synthesized separately for
individual specifications and then combined in their parallel executions. The synthesis
method is also developed for unfeasible tasks in [99]. An algorithm is developed that
synthesizes a plan that is allowed to violate only lowest priority rules and for a shortest
amount of time.
16
Correctness of the controller in reactive synthesis can be guaranteed if the as-
sumptions on the environment, i.e., the environment formula, is satisfied during the
interaction between system and environment. For partially unknown environment, a
multi-layer synergistic framework that implements an iterative planning strategy has
been proposed [70]. When the original specification becomes unrealizable due to new,
just-discovered environment constraints, a new plan is synthesized so as to satisfy the
specification as close as possible, according to a pre-defined metric of proximity to
satisfaction for the specifications. Although discovered constraints are due to the en-
vironment dynamics, however the dynamics itself is not identified. In order to identify
the dynamics of the environment, learning should be introduced. So far, the use of
machine learning in temporal logic control in the presence of unknown environments
has been limited to some application of reinforcement learning [95]. In that work,
a gradient-based approach to relational reinforcement-learning of control policies is
developed, for temporal logic planning in a stochastic environment.
2.4 Perspectives
When the environment is not stochastic, but rather dynamic and adversarial,
we need to develop a way of knowledge acquisition, so as to construct robust, adaptive
systems. Yet, except for some limited application of reinforcement learning, existing
work has not explored full potential of different machine learning paradigms.
A key observation is that once abstracted, the behavior of both system and
environment can be viewed as a formal object (automaton, language, grammar, etc.),
and the identification of the environment model, and subsequently all possible inter-
actions with it, becomes essentially a process of inference—to generalize any formal
object that can describe this model based on the finite amount of observed behaviors.
With this insight, we introduce formal learning as the inference method of choice. In
formal learning theory we find several criteria and algorithms for learning formal ob-
jects such as languages from data presentations. Unlike reinforcement learning, formal
learning is decoupled from control design: it is essentially system identification. Once
17
the dynamics of the system is identified, any appropriate control design method of
choice can implement a control strategy based on the identified model. This structure
is reminiscent of the synergy between adaptation and control in continuous dynamical
systems. Essentially, learning from observations enables us to reduce the problem of
control design with unknown environment into a problem well-examined in reactive
synthesis.
To this end, our goal is to develop a coherent framework that integrates formal
learning and reactive synthesis. Further, we aim to extend the solution we obtained
under the assumption of perfect observation to the case of partial observation, and from
single-agent and environment interaction to multi-agent interactions, in the context of
formal analysis and control design.
18
Chapter 3
BOTTOM-UP SYMBOLIC PLANNING
3.1 Overview
Many engineering systems can switch between different operating modes, and
often the user does not have access to implement or change pre-existing low-level control
behaviors in these systems. For example, industrial manipulators are equipped with
built-in PID joint controllers, which the user can set the PID gains to but not able to
modify. On the other hand, manufactors do not allow the access to built-in controllers
to avoid the unsafe operation as well as liability issues.
For planning and control design purpose, we model these systems into a class of
hybrid dynamical systems. A hybrid system in this class is capable of switching between
given operating modes, each with well defined pre-conditions and post-conditions. Pre-
conditions determine when a certain mode can be activated, and post-conditions de-
scribe the guaranteed steady-state behavior. The control modes are parameterizable in
the sense that the set of states satisfying the pre- or post-conditions of a control mode
is determined by parameters. Besides some industrial robotic systems, we can find
many other dynamical systems fall into this class such as those found in applications
of legged locomotion [89] and devices that interact with human subjects [11,86]
Given the special structure in this type of systems, planning problems can be
solved by sequencing and parameterizing the system’s existing low-level controllers in-
stead of designing a new controller from scratch. Also from the system design prospec-
tive, it is still meaningful to build simple controllers and reuse them for complex control
objectives. However, for this class of hybrid systems there is no decision procedure yet
on how to generate a feasible, optimal sequence of low-level controllers with respect to
19
a given objective. For this purpose, we consider a hierarchical framework, in which a
plan can be synthesized using the abstract model of the hybrid system and then be
implemented with the existing control modes in the concrete system.
Existing models for discrete abstraction, such as finite-state transition systems,
are incapable of manipulating continuous information, which may come in the form
of parameters. These parameters are exactly the type of information to keep track of
since they determine the sequencing of transitions in the discrete abstraction.
Hence, we adopt and adapt a new computational model, called register automa-
ton [57, 77], as the model for the abstract systems. The size of the abstract model
obtained through our method is not dependent on the discretization resolution on the
continuous state space. Instead, continuous states are grouped together according to
the convergence properties of each individual low-level controller.
The proposed abstraction links the concrete system and its abstract model
through a weak simulation relation, which ensures the plan generated using the ab-
stract model is feasible in the original hybrid system. We propose a planning method
with the abstract model that utilizes graph search algorithms and a variant of dynam-
ical programming. In principle, suboptimal solutions to a planning problem can be
obtained using its abstraction. One can even obtain optimal ones if the underlying
continuous dynamics and cost functions are simple enough [38].
In Section 3.2, we introduce the models used for the abstractions and their se-
mantics. Section 3.3 introduces a model for the class of hybrid systems. Section 3.4
presents the new abstraction method and establish a (weak) simulation between the
hybrid system and its abstract model. Section 3.5 outlines a new planning algorithm
using this abstract model. Section 6.5 presents a numerical case study, with a mobile
manipulator tasked with grasping and delivering an object through a sequence of ma-
neuvers implemented through given control laws. Section 3.7 concludes this chapter
with a summary of the results and thoughts for future extensions.
20
3.2 Automata and Transition Systems
3.2.1 Automata and their Semantics
In this section we review some background material about automata and formal
language theory [52].
Let Σ denote a fixed, finite alphabet, and let Σ∗, contain finite strings over this
alphabet. The empty string is denoted λ, and for a string w, its length is denoted |w|.A string v is a prefix (suffix ) of another string w if there exists a sequence x ∈ Σ∗ such
that w = vx (respectively, w = xv).
A semiautomaton is a tuple A = 〈Q,Σ, T 〉, where Q is a set of finite states, Σ
is a finite alphabet, and T : Q× Σ→ Q is the transition function. The mapping from
(q1, σ) to q2 via T is also written as q1σ→ q2, and can be expanded recursively in the
usual way, i.e., given u = w1w2, and w1, w2 ∈ Σ∗, we have T (q1, u) = T (T (q1, w1), w2).
In this context, semiautomata are assumed deterministic in transitions. If T (q, σ) is
defined for a given (q, σ) ∈ Q× Σ, we write T (q, σ) ↓.We think of a deterministic finite state automaton (dfa) as a quintuple A =
〈Q,Σ, T, I, F 〉 where 〈Q,Σ, T 〉 is a semiautomaton deterministic in transitions, I is the
initial state, and F is the set of final states. A word w is accepted by A if T (I, w) ∈ F .
The language L(A) is the set of words accepted by A.
3.2.2 Transition Systems
Definition 1. [28] A labeled transition system is a tuple TS = 〈Q,Σ, T 〉 with com-
ponents
Q a set of states;
Σ a set of labels;
T ⊆ Q× Σ×Q a transition relation.
The transition (q1, σ, q2) ∈ T is commonly denoted q1σ→ q2.
21
A transition system differs from a semiautomaton because in a transition system
the set of states and the set of transitions may not be finite, or even countable.
3.2.3 Register Automata and its Semantics
Given a finite alphabet Σ, and a subset D ⊆ Rk, pairs of the form wi = (σi, di) ∈Σ×D are called atoms. Concatenations of atoms form finite sequences w = w1 · · ·wnover Σ×D called data words. Let dom(w) be the index set 1, . . . ,|w| of the positions
of the atoms wi = (σi, di) in w. For i ∈ dom(w), the data projection valw(i) = di
gives the data value associated with the symbol σi. Similarly, the string projection
strw(i) = σi gives the symbol associated with a data value in atom wi. The following
computational machine 1 operates on a data word w ∈ (Σ×D)∗.
Definition 2 (Register Automaton cf. [36, 57]). A nondeterministic two-way register
automaton is a tuple R = 〈Q, q0, F,Σ, D, k, τ,∆〉, in which
Q a finite set of states;
q0 ∈ Q the initial state;
F ⊆ Q the set of final states;
Σ a finite alphabet;
D a set of data values;
k ∈ N the number of registers;
τ : 1, . . . , k → D ∪ ∅ the register assignment;a
∆ a finite set of read,b or writec transitions.
1 In [57] the set of logical tests that the machine can perform does not explicitly appearin the definition, because they are assumed to be only equality tests. In [36], however,the definition of register automata explicitly includes a set of register tests in the formof logical propositions.
22
a When τ(i) = ∅, this means that register i is empty. The initial register assignment
is denoted τ0. Given (σ, d) ∈ Σ×D, a register can perform a test in the form of a
first-order logical formula ϕ, constructed using the grammar ϕ ::= d ≤ τ(i) | d <τ(i) | ¬ϕ | ϕ ∧ ϕ. The set of all such formulae is denoted Test(τ).
b Read transitions are of the form (i, q, ϕr)σ→ (q′, δ), where q and q′ belong in Q, i
ranges in 1, . . . , k, σ in Σ, ϕr in Test(τ), and δ ∈ right,stay,left, respectively.
c Write transitions are of the form (q, ϕw)σ→ (q′, i, δ), where ϕw ∈ Test(τ), q and q′
are in Q, and all other elements range in the same sets as in read transitions.
Given a data word w, a configuration γ of R is a tuple [j, q, τ ], where j is a position
in the input data word, q is a state, and τ the current register content. Configurations
γ = [1, q0, τ0] and γ = [j, q, τ ] with q ∈ F , are initial and final, respectively. Given
γ = [j, q, τ ] and input (σj, dj), the transition (i, p, ϕr)σj→ (p′, δ) applies to γ if, and only
if, p = q and ϕr is true, while (p, ϕw)σj→ (p′, i, δ) applies to γ if, and only if, p = q and
ϕw is true.
The semantics of this machine is as follows. At configuration [j, q, τ ], the ma-
chine is in state q, the input read head is at position j in the data word, and the
contents of the registers are expressed by a vector τ . Upon reading wj = (σj, dj), if
ϕr is true and (i, q, ϕr)σj→ (q′, δ) ∈ ∆, then R enters state q′ and the read head moves
in the direction of δ, i.e., j′ = j + 1, j′ = j, j′ = j − 1 for δ ∈ right,stay,left. The
configuration is now [j′, q′, τ ]. If ϕw is true and (q, ϕw)σj→ (q′, i, δ) ∈ ∆, then R enters
state q′, dj is copied to register i, and the read head moves in the direction δ (in this
order). The configuration is now [j′, q′, τ ′], where the updated register assignment τ ′
is such that for κ = 1, . . . , i − 1, i + 1, . . . , k, it is τ ′(κ) = τ(κ) and τ ′(i) = dj. The
automaton is deterministic if at each configuration for a given data atom there is at
most one transition that applies. If there are no left-transitions it is called one-way.
In which follows, a simple example of register automaton is provided to illustrate the
definition.
23
Example 1. Consider a language over Σ×D of the following property: the data value
in an atom that immediately follows an atom containing symbol a, has to be the same as
the data value in the atom with the symbol a. This language is recognized by a one-way,
2-register automatonR2 = 〈Q, q0, F,Σ, 2, τ,∆〉 = 〈q0, q1, q2, q0, q0, q1, a, b, c, 2, τ :
2 → R2 ∪ ∅,∆〉 where τ0(1) = τ0(2) = ∅, ϕr ⇔ d = τ(i), ϕw ⇔ d 6= τ(i) , and ∆:
(q0, ϕw)b,c→ (q0, 1, right), (q0, ϕw)
a→ (q1, 2, right), (q1, ϕw)a,b,c−−→ (q2, 2, right), (2, q1, ϕr)
b,c→(q0, right), (2, q1, ϕr)
a→ (q1, right), (2, q2, ϕr)a,b,c−−→ (q2, right) , (1, q2, ϕr)
a,b,c−−→ (q2, right),
(q2, ϕw)a,b,c−−→ (q2, 2, right).
q0 q1 q2a
b, c
b, c
a
a, b, c
a, b, c
Figure 3.1: An example of a 2-register automaton.
Intuitively, on receiving initially an atom with symbol a, the machine stores the
data value in τ(2) and enters q1 which indicates “I have just seen an a.” If the data
value of the next atom doesn’t equal τ(2), then machine will enter q2 and stay there
forever, otherwise, depending on the symbol in the atom and the machine returns to q0
(if b or c) or stays in q1 (if a). The values associated with b or c are stored τ(1).
3.3 Hybrid Agents
This section we present the mathematical model for a class of hybrid systems,
referred to as a hybrid agent. Then we state the planning problem.
3.3.1 Mathematical Model
Our special class of hybrid systems is a hybrid agent :
Definition 3 (Hybrid Agent). The hybrid agent is a tuple
H = 〈Z,Σ, ι,P , πi,AP , f,Pre,Post, s,∆H〉
24
where the components are defined as follows.
Z = X × L a set of continuous and Boolean states;a
Σ a finite set of control modes;b
ι : Σ→ 1, . . . , k indices for the elements of Σ;
P ⊆ Rm a vector of control parameters;
πi : Rm → Rmi a set of canonical projections;c
AP a set of atomic propositions over Z × P; d
fσ : X × L× P → TX a finite set of parameterized vector fields;e
Pre: Σ→ C the pre-condition of σ ∈ Σ and C is defined next;f
Post: Σ→ C the post-condition of σ ∈ Σ;g
s :Z×P→ 2P is the parameter reset map;h
∆H :Z×P×Σ→Z×P×Σ is the transition map.i
a Here, X ⊂ Rn is a compact set, and L ⊆ 0,1r, with n, r ∈ N. A state z ∈ Z is
called a composite state.
b The symbols in Σ label the different closed-loop continuous dynamics.
c For i = 1, . . . , k, we write p = (π1(p)ᵀ, . . . , πk(p)ᵀ)ᵀ.
d AP is a set of atomic propositions, denoted α. A literal β is defined to be either
α or ¬α, for some α ∈ AP. Set C is a set of logical sentences, each of which is a
conjunction of literals, i.e., C = c = β1∧β2 . . .∧βn | (∃α ∈ AP)[βi = α∨βi = ¬α]and for any c ∈ C, a proposition in AP appears at most once [84].
e For each σ ∈ Σ, fσ is parametrized by p ∈ P and ` ∈ L. The set X is positively
invariant [58] under fσ. Due to the compactness and invariance of X , each fσ has
a compact, attractive limit set parametrized by p ∈ P, denoted L+(p, σ) [58].
f Pre(σ) maps mode σ to a logical sentence over Z × P that needs to be satisfied
whenever the machine switches to mode σ from any other mode.
g Post(σ) maps mode σ to a logical sentence over Z × P that is satisfied when the
trajectories of fσ reach an ε-neighborhood of its limit set.
25
h The reset map assigns (z, p) ∈ Z × P to a subset of P which contains parameter
values p′ for which there is a mode σ, with pre-condition Pre(σ) satisfied by (z, p′).
i The transition map sends (z, p, σ) to (z, p′, σ′) if (z, p) satisfies Post(σ) and (z, p′)
satisfies Pre(σ′) with p′ ∈ s(z, p).
The configuration of H is a tuple [z, p, σ]. A transition from σi to σi+1 (if any)
is forced and can occur once the trajectory of fσi (z, p) hits an ε-neighborhood of its
limit set.2
This model describes a continuous dynamical system that switches between
different control laws based on some discrete logic. The discrete logic is a formal
system consisting of the atomic propositions in AP together with the logical connec-
tives ¬ and ∧. The semantics of the set of logical sentences C generated, expresses
the convergence guarantees available for each component vector field fσ in the form
Pre(σ) =⇒ Post(σ). (Formally, it is Pre(σ) ∧ ♦Post(σ), where ♦ is the temporal
logic symbol for eventually ; however time here is abstracted away.) The switching
conditions, on the other hand, depend not only on the continuous variables, but also
on the discrete control modes: a transition may, or may not be triggered, depending
on which mode the hybrid system is in. Control over H is exercised by selecting a
particular sequence of parametrized control modes. Resetting in the parameters of
the system, activates transitions to specific modes, which in turn steer the continuous
variables toward predetermined limit sets.
Compared to the definition for a hybrid system given in [69], H is special because
it does not involve jumps in the continuous states, its discrete transitions are forced,
and the continuous vector fields converge. The model for H, however, also allows the
system evolution to be influenced by, possibly externally set, continuous and discrete
variables (p and `). In addition, initial and final states are not explicitly marked,
2 Written L+(p, σ)⊕Bε(0), where ⊕ denotes the Minkovski (set) sum and Bε(x) is theopen ball of radius ε centering at x.
26
allowing the machine to accept a family of input languages instead of a single one as
that of [69].
Let us now describe more formally the limit sets of the continuous dynamics fσ,
and highlight their link to the Pre and Post conditions of each mode. To this end,
let φσ(t;x0, `, p) denote the flow of vector field fσ(x; `, p) passing from x0 at time t = 0.
The positive limit set in control mode σ, when parametrized by p is expressed as
L+(p, σ) =y | ∃tn : lim
n→∞tn =∞,
φσ(tn;x0, `, p)→ y as n→∞, ∀x0 ∈ Ω(p, σ),
where Ω(p, σ) ⊆ X is the attraction region of control mode σ parametrized by p. We
assume that L+(p, σ), for a given σ and for all p ∈ P , is path connected.3 If it is not,
and there are isolated components L+i (p, σ) for i = 1, . . . , B(σ), one can refine a control
mode σ into σ1, . . .σB(σ), one for each L+i (p, σ). For simplicity, we assume that for H
of Definition 3, Σ does not afford any further refinement. For each discrete location
σ, the formulae Pre(σ) and Post(σ) are related to the limit sets and their attraction
regions in that location as follows:4
(z, p) ≡ (x, `, p) |= Post(σ) ⇐⇒ (x, `, p) ∈
(x, `, p) | x ∈ L+(p, σ)⊕ Bε(0), ` ∈ L
(z, p) ≡ (x, `, p) |= Pre(σ) ⇐⇒ (x, `, p) ∈
(x, `, p) | x ∈ Ω(p, σ), ` ∈ L.
A state z, which together with some parameter p satisfy Pre(σ), can evolve
along φσ(t; z, p) to some other composite state z′ for which (z′, p) satisfies Post(σ),
and we write zσ[p]→ z′. A sequence of the form (σ1, p1) · · · (σN , pN) is an input to H,
specifying how control modes are to be concatenated and parametrized in H. The
input sequence is a data word. We say that a data atom (σ1, p1) is admissible at the
initial setting (z0, p0) of H if p1 = p0 and (z0, p1) satisfies Pre(σ1), or if p1 ∈ s(z0, p0)
3 A set is path connected if any two points in the set can be connected with a path (acontinuous map from the unit interval to the set) [88].
4 Symbol |= is read “satisfies,” and we write (z, p) |= c if the valuation of logicalsentence c ∈ C over variables (z, p) is true.
27
and (z0, p1) satisfies Pre(σ1). A data atom (σ′, p′) is admissible in H at configuration
[z, p, σ], if there is a [z, p′, σ′] ∈ Z × P × Σ such that T ([z, p, σ]) = [z, p′, σ′]. A pair
of data atoms (σj, pj)(σj+1, pj+1) is admissible at configuration [z, p, σ] if (σj, pj) is
admissible at [z, p, σ], and there is a composite state z′ ∈ Z to which z evolves to
under σj parameterized by pj (i.e. zσj [pj ]→ z′), giving a configuration [z′, pj, σj] where
the second input atom (σj+1, pj+1) is also admissible. A data word w is admissible in
H if every prefix of w is admissible.
3.3.2 Problem Statement
The planning problem addressed in this chapter is a reachability problem: for a
given Spec ∈ C, the goal is to design a control policy that drives the system from its
initial configuration to a configuration where Spec is satisfied.
Problem 1. Given a hybrid agent H at an initial configuration satisfying a formula
Init ∈ C, find an admissible sequence (σ1, p1) · · · (σN , pN) so that the configuration of
H after N transitions, for some N ∈ N, satisfies Spec ∈ C.
3.4 Abstraction
In this section we employ predicate abstraction to induce a discrete, finite-state
model of the concrete hybrid agent. Then we show that the hybrid agent weakly
simulates its abstract model.
3.4.1 Predicate Abstraction and the Induced Register Automata
Each hybrid agent H can be associated to a special one-register automaton.
Since we do not mark initial and final states in H, the discrete system is a semiau-
tomaton. We say that this one-register semiautomaton is induced by H. The relation
between the state-parameter pairs of H, and the states of the register semiautomaton
is expressed by a map.
Definition 4 (Valuation map). The valuation map VM : Z × P → Q ⊆ 1,0|AP| is
a function that maps a state-parameter pair (z, p), to a binary vector q ∈ Q of length
28
|AP|. The entry at position i in q, denoted q[i], is 1 or 0 depending on whether αi
in AP evaluated at (z, p) is true or false, respectively. For q ∈ Q, we denote this
valuation αi(z, p) = q[i],
With reference to H and VM(·), a set valued map λ : P ×Q×Q× Σ → 2P is
defined as
λ(τ ; q, q′, σ′) 7→p′ | (∀z : VM(z, τ) = q)[
p′ ∈ s(z, τ) ∧ (z, p′) |= Pre(σ′) ∧ VM(z, p′) = q′]. (3.1)
Note that λ may not be defined for every q, σ and q′.
The register semiautomaton R(H) which serves as an abstraction of H is now
defined as follows.5
Definition 5 (Induced register semiautomaton). The deterministic finite one-way reg-
ister semiautomaton induced by hybrid agent H (with reference to Definition 3), is a
tuple R (H) = 〈Q,Σ,P , 1, τ,∆R〉, with
Q a finite set of states;a
Σ the alphabet (same as that of H);
P the data set (same as that of H);
1 an m-dimensional array register;
τ :1 7→P∪∅ the register assignment; b
∆R a finite set of read, c and write d transitions.
a The set of states is defined as
Q =q ∈ 0,1|AP| : ∃ (z, p) ∈ Z × P : VM(z, p) = q
.
5 This machine has only one register, so to lighten notation we drop the argumentfrom the current assignment of the register.
29
b Given input data atom (σ, p) ∈ Σ×P, the set Test(τ) consists of formulae defined
by the grammar ϕ ::= p = τ | πj(p) = πj(τ) | p ∈ λ(τ ; q, q′, σ) | ¬ϕ | ϕ ∧ ϕ, where
q, q′ ∈ Q and j ∈ 1, . . . , |Σ|.c A read transition (q, ϕr)
σj→ (q′, right) where ϕr is τ = pj, is defined if for all z such
that VM(z, τ) = q, the pair (z, τ) satisfies Pre(σj) and there exists a continuous
evolution zσj [pj ]→ z′ such that VM(z′, pj) = q′.
d A write transition (q, ϕw)σj→(q′, stay
)where ϕw is p ∈ λ(τ ; q, q′, σj) ∧¬
[πι(σj)(pj) =
πι(σj)(τ)], is defined if there exists a parameter p in P such as the set
p′ ∈ P |
p′ ∈ λ(p; q, q′, σj) ∧ πι(σj)(p′) 6= πι(σj)(p)
is not empty.
With the machine at configuration [j, q, τ ], and upon receiving input wj = (σj, pj), if
pj = τ , the read transition (q, ϕr)σj→ (q′, right) applies as long as it is in ∆R. In this
case, the machine moves to state q′, and the input read head advances one position.
If, on the other hand, πι(σj)(pj) 6= πι(σj)(τ), while data value pj belongs to the set
λ(τ ; q, q′, σj) for some q′ ∈ Q, then the write transition (q, ϕw)σj→ (q′, stay) applies as
long as it is in ∆R. Then the machine reaches q′ without moving the input read head,
and overwrites the content of its register with pj.
A data atom (σ, p) is admissible at configuration [j, q, τ ] if there is transition
in ∆R that applies to [j, q, τ ] when (σ, p) appears at the input. A pair of data atoms
(σ1, p1)(σ2, p2) is admissible if there is a transition in ∆R that applies to some configu-
ration [j, q, τ ] on input (σ1, p1), taking R(H) to configuration [j+ 1, q′, τ ′], where some
other transition in ∆R applies on input (σ2, p2). A data word w is admissible if every
prefix of w is admissible. Compared to the register automaton of Definition 2, the
construction of Definition 5 differs. First, there are no initial and final states—it is a
semiautomaton—and second, there is a single register that stores an array rather than
a single variable. Register tests, though, are performed element-wise on the register.
For the special class of hybrid systems considered here, the registers of such a
30
model turn out to be adequate for capturing the continuous behavior, up to the reso-
lution allowed by the given set of atomic propositions. The only change we introduce
to the standard register automaton model is the capacity to perform inequality tests
on the data; however, given that the most basic logical operation is set inclusion, and
in order for a machine to be able to do any equality test, it has to do so by means of
a conjunction of inequalities. Thus, the extension we propose does not fundamentally
require any additional computational power on the part of the machine.
The write transitions in R(H) are silent, in the sense that they do not advance
the read head of the machine and thus do not produce any observable change. Read
transitions, on the other hand, are observable. A concatenation of any number of silent
transitions with a single observable transition triggered by input atom wj, taking the
machine from state q to state q′ is denoted qwj q′, and we refer to this transition
sequence as a composite transition. Since only one observable transition is taken in a
composite transition, the read head advances only one step. A composite transition is
maximal if the machine cannot make another transition without reading a new data
atom.
Proposition 1. Let w = w1 · · ·wn be an admissible input sequence for R(H). Then
any maximal composite transition from state q to state q′ contains either a single read
transition, or a write transition followed by a read transition.
Proof. Let R(H) be at configuration [j, q, τ ]. Suppose for the data atom wj = (σj, pj),
R(H) takes a composite transition, qwj q′. If pj = τ then the machine jumps from
q to q′ and advances the read head by one position. In this case, the configuration
changes from [j, q, τ ] to [j + 1, q′, τ ]. If pj 6= τ , no read transition applies, which
means that a write transition must have taken place. Once this write transition is
completed, τ has the value of pj. The machine still reads wj = (σj, pj) on the input
tape, since the read head has not advanced. But now, upon reading wj again, the
machine finds τ = pj. A read transition is triggered and the read head advances one
position forward. Configuration [j, q, τ ] changes first to some intermediate [j, qt, τ ′]
31
after the write transition, and then to the final [j + 1, q′, τ ′] after the read transition.
In any case, a composite transition either includes a single read transition or a write
transition followed by a read transition—the latter referred to as a write-read transition
pair.
3.4.2 Weak (Bi)simulations
To ensure any plan generated by the abstract model is feasible in the concrete
hybrid agent, two systems have to be related formally.In our case, this relation is a
weak (bi)simulation.
In a transition system TS = 〈Q,Σ, T 〉, the set of alphabet can be partitioned
into two subsets: Σε ⊆ Σ and Σ \ Σε. We call a transition that is labeled with a label
from Σε, silent; otherwise observable. We write q ; q′ to denote that q′ is reachable
from q with an arbitrary number of silent transitions, and qσ; q′ if q′ is reachable from
q a composite transition containing one observable transition labeled σ.
Definition 6 (Weak (observable) simulation [90]). Consider two (labeled) transition
systems over the same input alphabet Σ: TS1 = 〈Q1,Σ, T1〉 and TS2 = 〈Q2,Σ, T2〉. Let
Σε ⊂ Σ be a set of labels for silent transitions. An ordered binary relation R ⊆ Q1×Q2
is a weak (observable) simulation if: (i) R is total, i.e., for any q1 ∈ Q1 there exists a
state q2 ∈ Q2 such that (q1, q2) ∈ R, and (ii) for every ordered pair (q1, q2) ∈ R, if there
exists a state q′1 ∈ Q1 which the machine can reach with a composite transition from
q1, i.e., q1σ;1 q
′1, then there also exists q′2 ∈ Q2 that can be reached with a composite
transition from q2, i.e., q2σ;2 q
′2, and (q′1, q
′2) ∈ R. Then TS2 weakly simulates TS1
and we write TS2 & TS1.
In other words, TS2 weakly simulates TS1 if any input admissible in TS1 is also
admissible in TS2. In that sense, a hybrid agent that weakly simulates its induced
register semiautomaton can implement every input sequence admissible in the register
semiautomaton. Indeed, we show it is the case:
32
Theorem 1. Hybrid agent H weakly simulates its induced register semiautomaton
R(H) in the sense that the ordered total binary relation R defined as (q, z) ∈ R ⇔∃ p ∈ P , VM(z, p) = q, satisfies
(q, z) ∈ R and qwj q′ with wj = (σj, pj) =⇒
∃ z′ ∈ Z : zσj [pj ]→ z′ with
(q′, z′
)∈ R . (3.2)
Proof. First note that relation R is total by construction, since any state q ∈ Q is by
definition the image under the valuation map VM of some (z, p) ∈ Z ×P . To establish
that R is a weak simulation, let the register semiautomaton R(H) be at configuration
[j, q, τ ], with a state q for which we can find a state z ∈ Z in H to related it with:
(q, z) ∈ R. Suppose now that R(H) takes a (composite) transition wj; then according
to Proposition 1, this composite transition consists of either a single read transition,
[j, q, τ ]wj→ [j + 1, q′, τ ], or a write-read pair: [j, q, τ ]
wj→[j, qt, τ ′
] wj→ [j + 1, q′, τ ′]. The
mere existence of a transition originating from q on input (σj, pj) ensures that for any
z that satisfies VM(z, τ) = q, it holds that either (z, τ) satisfies Pre(σj), with τ = pj (if
we have a single read transition), or that the ι(σj) components of the control parameter
and register do not match, meaning πι(σj)(τ) 6= πι(σj)(pj), and there is some qt ∈ Q,
such that VM(z, pj) = qt and pj ∈ λ(τ ; q, qt, σj) (if we have a write-read transition
pair). In the latter case, by the definition of λ, we know that (z, pj) must satisfy
Pre(σj). If wj triggers a single read transition (the case τ = pj), then there must exist
a continuous evolution in H in control mode σj parameterized by pj, taking z to z′
(namely, zσj [pj ]→ z′) at which VM(z′, pj) = q′; it follows that (q′, z′) ∈ R. If, instead, wj
triggers a write-read transition pair, then after updating its register by setting τ = pj,
R(H) still reads (σj, pj) as the input. Since (z, pj) satisfies Pre(σj) and now τ = pj,
R(H) has to take a read transition to reach q′. The argument of the previous case
applies and completes the proof.
Theorem 1 suggests that while all admissible input sequences in R(H) will be
also admissible in H, it is also the case that a control policy that takes H from its
33
present state into another that satisfies Spec, might not have a matching run in R(H).
This is not necessary for the purposes of planning, but it is essential for verification. To
ensure a matching run we need to strengthen the link between the two models. Theorem
2 gives sufficient conditions for a weak bisimulation to be established between H and
R(H).
Theorem 2. Given the hybrid agent H and its induced register semiautomaton R(H),
the binary relation R defined as (q, z) ∈ R ⇐⇒ ∃p ∈ P , VM(z, p) = q, is a weak
bisimulation relation under the following conditions:
1) given p ∈ P, for any two z1, z2 ∈ Z, if VM(z1, p) = VM(z2, p) = q, then
whenever (z1, p) satisfies Pre(σ) we have (z2, p) also satisfying Pre(σ). In addition,
the parametrized control mode σ[p] that takes z1 to z′1, takes z2 to some z′2 for which
VM(z′1, p) = VM(z′2, p).
2) given p ∈ P, for any two z1, z2 ∈ Z, for which VM(z1, p) = VM(z2, p) = q, if
p′ ∈ s(z1, p) and VM(z1, p′) = q′, then p′ ∈ s(z2, p) and VM(z2, p
′) = q′.
3) given z ∈ Z, and any p1, p2 ∈ P, if (z, p1) satisfies Pre(σ) and (z, p2) does
not satisfy Pre(σ), then the ι(σ) components of p1 and p2 do not match: πι(σ)(p1) 6=πι(σ)(p2).
Proof. Since we know from Theorem 1 that H weakly simulates R(H), we only need
to show the implication (3.2) in the opposite direction: if the conditions above are
satisfied, then given (q, z) ∈ R ⊆ Q × Z, zσ[p]→ z′ =⇒ ∃ q′ ∈ Q : q
(σ,p) q′ ∧ (q′, z′) ∈
R. To this end, select any po ∈ P such that q = VM(z, po), and examine the two
possibilities:
Case 1, where po = p
The evolution zσ[p]→ z′, implies that the pair (z, p) satisfies Pre(σ). Given
condition a), we know that any other z1 such that VM(z1, p) = q (and is therefore
related to q), will also make a pair (z1, p) that satisfies Pre(σ). This means that any
such z1 will evolve to some z′1 in mode σ parameterized by p. We can collect all these
34
limit points z′1 to a set Z ′(p) , z′ | z σ[p]→ z′, for some z : VM(z, p) = q. Condition a)
also ensures that any z′1, z′2 ∈ Z ′(p), VM(z′1, p) = VM(z′2, p) = q′ for some state q′ ∈ Q.
Based on Definition 5, there exists a read transition in R(H) taking q to q′ upon input
(σ, p). All z′ ∈ Z ′(p) give VM(z′, p) = q′ and thus (q′, z′) ∈ R.
Case 2, where po 6= p
Without loss of generality assume that (z, po) does not satisfy Pre(σ); other-
wise, we can have zσ[po]→ z′, which reduces this case to Case 1. Condition c) then
requires that πι(σ)(po) 6= πι(σ)(p). Since we are given that z
σ[p]→ z′, we can conclude that
(z, p) satisfies Pre(σ). The definition of the reset map then suggests that p ∈ s(z, po).Let VM(z, p) = qt. Condition b) ensures that for any state z1 that makes a pair (z1, p
o)
such that VM(z1, po) = VM(z, po) = q, it is p ∈ s(z1, p
o) and VM(z1, p) = VM(z, p) = qt.
Let the set of all such states z1 be Z1. Since (z, p) satisfies Pre(σ) and z ∈ Z1,
using condition a) we have that for all z1 ∈ Z1, the pair (z1, p) satisfies Pre(σ).
Recall that λ(po; q, qt, σ) = p ∈ P | (∀z : VM(z, po) = q)[p ∈ s(z, po) ∧ (z, p) |=Pre(σ) ∧ VM(z, p) = qt], and note that this set is nonempty since it always contains
p. Therefore, a write transition (q, ϕw)σ→ (qt, stay) applies on input atom (σ, p) with
formula ϕw expressed as p ∈ λ(po; q, qt, σ) ∧ πι(σ)(po) 6= πι(σ)(p). This write transition
takes q to qt and updates the the register with p. Now we have Case 1.
Supported by Theorem 1 and 2, we proceed with abstraction-based time-optimal
planning, knowing that a plan generated by the induced register automaton is always
implementable in the hybrid agent.
3.5 Time-optimal Planning
We propose a two-step procedure for abstraction-based time-optimal planning.
The first step is to determine a sequence of symbols in Σ, that is, a sequence of
controllers, such that with some parameterization of this sequence, the control objective
can be satisfied. The second step takes this sequence and determine the parameters for
each individual controller, such that the goal state is reached with the optimal cost.
35
Any transition in R(H) may incur a cost, but in this context we assume only ob-
servable transitions do so. The cost of an observable transition in R(H), corresponding
to a continuous evolution in H, is determined by the component continuous dynamics
active during that time period, the initial conditions for the continuous states, and the
assignment of parameters. The component dynamics when H is at control mode σ is
expressed as x = fσ (x, `, p), with σ ∈ Σ, p ∈ P , ` ∈ L, and x ∈ X . An incremental
cost function F : X × R+ → R+ is used to define the atomic cost gσ(x, `, p) for H
evolving in control mode σ along flow φσ(t;x, `, p) for t ∈ [t0, tf ]:
gσ(x, `, p)def=
∫ tf
t0
F(x(t)
)dt .
We define the incremental cost using the indicator function:
F (x(t))def= 1L+(p,σ)⊕Bε(0)c
(x(t)
)where 1A denotes the indicator function of set A, ⊕ the Minkovski (set) sum, Bε(x) is
the open ball of radius ε of appropriate dimension centered at x, and ·c denotes set
complement. Other choices are of course possible; however, this choice of R yields an
atomic cost gσ which measures the time it takes the flow of vector field fσ (x, `, p) to
hit an ε-neighborhood of L+(p, σ):
gσ(x, `, p) =
∫ ∞0
F(φσ(t;x, `, p)
)dt . (3.3)
In an admissible data word w = (σ1, p1) . . . (σN , pN), for any σi−1, σi appearing
consecutively in str(w), the data value pi−1 that comes along with σi−1 should match
with some state z ∈ Z, in a way that the pair (z, pi−1) satisfies Post(σi−1). In addition,
for that same z, (z, pi−1) either also satisfies Pre(σi), or its image under the reset map
s contains some other pi 6= pi−1, for which (z, pi) satisfies Pre(σi).
We can thus eliminate the dependence of gσ on z = (x, `) in (3.3) by conserva-
tively over-approximating the atomic cost for a transition σi ∈ str(w), using a function
of parameters:
gσi(pi−1, pi)
def= max
z:(z,pi−1)∈S
∫ ∞0
F(φσi(t; z, pi)
)dt (3.4)
36
where
Sdef=
z | (z, pi−1) |= Pre(σi) ∧Post(σi−1), pi = pi−1
z | (z, pi−1) |= Post(σi−1), pi ∈ s(z, pi−1), (z, pi) |= Pre(σi), otherwise.
The integral in (3.4) does not always have to be computed explicitly. This is because
the time required for a continuous state x ∈ X to converge under controller σ to an ε
neighborhood of L+(p, σ) can be over-approximated using Lyapunov-based techniques,
discussed briefly in Appendix A.
The accumulated cost Jw for executing data word w = (σ1, p1) . . . (σN , pN) from
configuration [z, p, σ], assuming that w is admissible at [z, p, σ], is upper bounded by
Jw(z, p) ≤ Jw(z, p)def= gσ1(z, p1) +
N∑i=2
gσi(valw(i − 1), valw(i)
).
The optimization problem can then be stated as follows:
Problem 2. With the hybrid agent H at an initial configuration [z0, p0, σ0], where
(z0, p0) satisfies Init ∈ C and σ0 ∈ Σ, find out of all admissible sequences w =
(σ1, p1) · · · (σN , pN) solving Problem 1, the one that achieves minpjNj=1Jw(z0, p0).
3.5.1 Searching for Candidate Plans
Due to the existence of registers, register automata do not have a standard
graphical representation, so it is not clear how reachability analysis can be performed
using graph search methods. In such a machine, a state may be reached either by
a read, or a write transition; however, the nature of the incoming transition matters
when it comes to reasoning as to what happens next. Configurations, on the other
hand, cannot be enumerated due to the inclusion of the continuous data in τ .
3.5.1.1 A Graph Representation
For the purpose of planning using graph search algorithms, we suggest an embed-
ding of R(H) into a labeled transition system, hereby referred to as the transformation
37
semiautomaton, which brings out some information about register updates and the
nature of transitions.
Definition 7. The transformation semiautomaton of R(H) is a tuple TR(H) = 〈Q, Σ, ∆R〉consisting of:
Q ⊆ Q × p, p′ a finite set of states;a
Σ = Σ ∪ Λ ∪ θ a set of transition labels;b
∆R a set of transitions of four types.c–f
a Q contains couples where the first element is a state of R(H) and the second element
is a symbol, either p or p′. Whenever a state in R(H) is reached with a write
transition, its corresponding state in TR(H) is marked with a p′.
b Subset Λ contains labels indexing all different possible write transitions in R(H),
each write transition assigned to a unique λ in Λ. The singleton θ contains an
auxiliary label marking trivial write transitions (write self-loops) in R(H) which do
not modify the register content.
c One type is (q, p)λi99K (q′, p′), defined if q′ is accessible from q in R(H) via a write
transition (q, ϕw)σ→ (q′, stay).
d Another type is (q, p)θ99K (q, p′), defined if q is accessible from any q′ ∈ Q via a
write transition.
e A third type is (q, p)σ→ (q′, p), defined if there exists a read transition (q, ϕr)
σ→(q′, right) and q is not accessible via a write transition from any other state in R(H).
f The last type is (q, p′)σ→ (q′, p), defined if there exists a read transition (q, ϕr)
σ→(q′, right) and q is accessible via at least one write transition from some state in
R(H).
We define the injective function Λ : Λ→ Q×Q that singles out the transition
of TR(H) that is labeled by the particular label in Λ. Consequently, λ(τ ; q, q′, σ) ≡λ(τ ; Λ(λ), σ).
It is straightforward to show that TR(H) and R(H) are weakly bisimilar; intu-
itively, one merges any pair of states of the form (q, p) and (q, p′) in TR(H).
38
Proposition 2. The transformation semiautomaton TR(H) and the induced register
semiautomaton R(H) are weakly bisimilar: there exists an ordered binary relation R
on Q × Q such that: (i) R is total, and (ii) whenever 6 (q, (q, ∗)) ∈ R there exists a
read or write transition from q to some q′ in R(H) for some σ ∈ Σ, then there exists
a composite transition in TR(H), (q, ∗) a (q′, ∗) with a ∈ Σ ∪ Λ and (q′, (q′, ∗)) ∈ R.
Conversely, if there is a transition in TR(H) taking (q, ∗) to (q′, ∗), then there exists
a composite transition in R(H) taking q to q′ while (q, (q, ∗)) ∈ R and (q′, (q′, ∗)) ∈ R.
Proof. Define R implicitly as a partition on Q in which (q, p) and (q, p′) belong in the
same block and the equivalence class is labeled by q. Note that TR(H) is constructed
in a way that guarantees R to be total. First take the case of a read transition
(q, ϕr)σ→ (q′, right) in R(H). By construction, TR(H) can either take (q, p)
σ→ (q′, p)
or (q, p′)σ→ (q′, p), and obviously (q′, (q′, p)) ∈ R. Any transition (q, ϕw)
σ→ (q′, stay)
in R(H) can be matched by the transition (q, p)λ99K (q′, p′) in TR(H), and since
(q′, (q′, p′)) ∈ R, it follows that TR(H) & R(H).
The other direction is shown as follows: consider any (q, ∗) a (q′, ∗), with
a ∈ Σ ∪ Λ. If a ∈ Σ, three possible cases arise: a) (q, p)a→ (q′, p), b) (q, p′)
a→ (q′, p),
and c) (q, p)θ→ (q, p′)
a→ (q′, p), for some θ ∈ Θ. In all three cases, the end state (q′, p)
is related to q′ via R and there is always a transition of the form (q, ϕr)a→ (q′, right) in
R(H) by construction of TR(H). If a ∈ Λ, then by construction there is a transition in
R(H): (q, ϕw)σ→ (q′, stay) and since q′ can be reached by a write transition, there exists
(q, p)a→ (q′, p′) ∈ ∆R. Since both (q, (q, p)) and (q′, (q′, p′)) belong in R, we conclude
that it is also the case that R(H) & TR(H).
3.5.1.2 Finding Walk Candidates
Let us define a set of ternary vectors q ∈ 0,1, ∗|AP|, where the semantics of
∗ at location i within a vector q, is that atomic proposition αi can be either true or
false we do not know. In that sense, a ternary vector q can be identified with a set of
6 ∗ stands for either p or p′.
39
binary vectors, and thus we may write q ∈ q. Recalling formula Spec in Problem 1,
we represent the set of all binary vectors q for which a pair (z, p) with VM(z, p) = q
satisfies Spec, by a single ternary vector qSpec. If qSpec[i] = ∗, this means that αi does
not appear in Spec.
Now we recast problem 2, which is given on H, as a problem defined on R(H)
and TR(H):
Problem 3. For a given Spec ∈ C and a pair (z0, p0) satisfying Init, and for any
qf ∈ qSpec, find a data word w = (σ1, p1) · · · (σN , pN) for which
1. there exists a walk w in TR(H) from (q0, p) to (qf , p) with q0 = VM(z0, p0) and
qf ∈ qSpec, such that its projection to Σ, denoted w Σ, satisfies w Σ= str(w),
and
2. Jw(z0, p0) is minimized with respect to pjNj=1, where pN = pf as specified in
Spec.
Condition 2) restates the optimality requirement of Problem 2. Theorem 1
ensures that w is a solution to Problem 1.
For the first part, because of the restrictions imposed by the set-valued map
λ, the run on the transformation automaton might have to revisit some state in Q a
few times in order to bring the parameter to the value specified in Spec. Thus, we
need to search for walks, 7 instead of simple paths. To find the walks we augment
TR(H) by adding the initial and desired final states based on Problem 3, and obtain
a dfa. Then we generate a regular expression (regex) of this dfa.8 From this regex
we can construct successively longer walks satisfying condition 1) of Problem 3, and
then optimize them using a modified version of dynamic programming discussed next.
7 A walk is a path which may include cycles.
8 A regex is defined recursively as follows [52]: (1) empty string ε and all σ ∈ Σ areregexs (2) If r and s are regexs, then rs (concatenation), (r + s) (union) and r∗, s∗
(Kleene-closure) are regexs; (3) There are no regex other than those constructed byapplying rules (1) and (2) above a finite number of times.
40
With cycles allowed there is no bound on the length of admissible strings in the dfa,
and thus we limit the number of walks that can be checked for optimality by setting
an upper bound on the cost, based on an assumed maximum affordable cost, and the
cost of the least expensive observable transition.
With initial state (q0, p) and final state (qf , p), we obtain the dfa
〈TR(H), (q0, p), (qf , p)〉 = 〈Q, Σ, ∆R, (q0, p), (qf , p)〉
and find an regex, denoted RE(H), associated with this dfa using known methods [18].
Replacing every occurrence of the Kleene star ∗ in RE(H) with a natural number,
gives a set W(m) of all admissible walks of length m in the dfa, W(m)def= w|w ∈
RE(H), |w| = m. Any walk in W(m) has a matching admissible input data word on
R(H) (Theorem 1). However, TR(H) has no information on specific register values,
and thus the corresponding admissible data word may not comply with the requirement
for p0 and pf . To remove inadmissible walks we develop a procedure for translating
a walk in TR(H) to a family w of data words in R(H), in which all individual words
w have the same symbol string str(w) but different the data value assignments. The
domains of possible data value assignments is specified by a sequence of set-valued
maps:
Given a walk w = u1 · · ·um, set i := 1, j := 1, and for 1 ≤ i ≤ m, distinguish
three cases:
1. ui ∈ Σ: then, set σj := ui, wj := (σj, pj), Mj(·) := idP (·), j := j + 1, i := i+ 1;
2. ui ∈ Λ: then, set σj := ui+1, Mj(·) := λ(· ; Λ(ui), σj), wj := (σj, pj), j := j + 1,
i := i+ 2;
3. otherwise, set σj := ui+1, Mj(·) := idP (·), wj := (σj, pj), j := j + 1, i := i+ 2.
In the above, idP : p 7→ p is the identity map on P . A walk w = u1 . . . um is thus
translated into a family of data words w = (σ1, p1) · · · (σN , pN), and a sequence of
set-valued maps Mi(·) : P → 2P , for i ∈ 1, . . . , N.
41
To check whether a walk w generates a data word w ∈ w that can match the pa-
rameter specifications, we use the sequence of set-valued maps Mi(·)Ni=1 constructed
by this procedure and verify the consistency condition
p∈P |∃z∈Z :VM(z, p) = qf ∩MN · · · M1(p0) 6= ∅. (3.5)
3.5.2 Dynamic Programming — a Variant
We start with the best case—the shortest possible walk found—and test locally
for optimality. If an optimal solution is encountered, the search stops.
Let the maximal allowable cost for any solution to Problem 3 be Jmax. Then,
if the minimum cost of executing an observable transition labeled with σ ∈ Σ is some
Jmin > 0, an upper bound U on the length of data words translated from walks is
Udef=⌈Jmax
Jmin
⌉.
If (3.5) holds, then there exists a sequence of N parameter values piNi=1 with
|w Σ | = N ≤ m such that w = (σ1, p1) · · · (σN , pN) is an admissible input for
R(H). Input w takes the register semiautomaton from configuration [1, q0, p0] to some
configuration[N + 1, qf , pf
]where qf ∈ qSpec. Among all walks which pass the test
(3.5), we pick the shortest one as the candidate most likely to yield the optimal solution
to Problem 3.
In the case a candidate walk is found, we modify the standard dynamic pro-
graming (DP) algorithm of [14], and apply it in its new form to obtain an optimal
sequence of parameters. We first obtain a set of subsets of P , denoted Pii∈dom(w),
where each Pi consists of all parameter values that can be used to parametrize a control
mode (data atom) at stage i of execution of an input data word in w, and is found as
Pi def= Mi . . . M1(p0) ∩
(MN . . . Mi+1
)−1(S) .
where S = p | ∃z ∈ Z : VM(z, p) = qf. Closed-form expressions for the optimal
values of parameters and the accumulated cost can be obtained in the special case
where the continuous dynamics of each control mode associated with a data atom
in the input w is linear, and the related atomic cost is quadratic. In more general
42
(nonlinear) cases, sets Pi, for i ∈ dom(w) may have to be discretized. Naturally, the
resolution of this discretization affects the optimality of the solution obtained.
Assuming a general case where closed form solutions for the optimal parame-
ters are impractical, consider a partition of Pi into Ki blocks, enumerate the blocks,
and let pi[k] denote the representative of the parameter values belonging to block
k ∈ 1, . . . , Ki. The DP algorithm selects the optimal sequence of parameter repre-
sentatives p1∗, . . . , pN
∗ in the family w as follows.
Let i = N , and for each pN−1[k] ∈ PN−1 for k = 1, . . . , KN−1, set
PN∗(pN−1[k]) := arg min
pN [j]∈PNgσN(pN−1[k], pN [j]
)(3.6a)
J∗N(pN−1[k]) := gσN(pN−1[k], PN
∗(pN−1[k])). (3.6b)
This process constructs two discrete maps on PN−1. The first map associates pN−1[k]
to the value PN∗(pN−1[k]), which the parameter should be reset to in order to trigger
the transition with the minimum cost. The second map associates a representative
pN−1[k], assumed to be written at the register before the σN transition is triggered, to
the minimum accumulated cost J∗N(pN−1[k]) incurred during the σN transition.
For i = N − 1, . . . , 2 we repeat
Pi∗(pi−1[k]) := arg min
pi[j]∈Pi
gσi(pi−1[k], pi[j]) + J∗i+1(pi[j])
(3.7a)
J∗i (pi−1[k]) := gσi(pi−1[k], Pi
∗(pi−1[k]))
+ J∗i+1
(Pi∗(pi−1[k])
). (3.7b)
Finally, for i = 1 we finish by setting
P1∗(p0) := arg min
p1[j]∈P1
gσ1(z0, p1[j]) + J∗2 (p1[j])
(3.8a)
J∗1 (z0, p0) := gσ1(z0, P1
∗(p0))
+ J∗2 (P ∗1 (p0)) . (3.8b)
Then the optimal sequence of parameter representatives p1∗, . . . , pN
∗ is obtained
iteratively. This sequence identifies a particular member of the input word family w
as the solution w∗ to Problem 3. The (conservative) accumulated cost is given by
J∗1 (z0, p0).
43
The time complexity of this variant of DP with discretized parameter space
is polynomial O(NK2), in which N = |w| and K = maxi=1,...,N Ki. To generate a
candidate data word for the algorithm DP, one checks (3.5), which in the worst case
requires the enumeration of all data words of maximal length U .
The solution obtained can be sub-optimal because: (i) in the case when a weak
bisimulation cannot be established between H and R(H), there may exist sequences
of parametrized control modes with lower cost that are only admissible in H; (ii) the
accumulated cost computed in R(H) over-approximates the time needed for executing
w from (z, p) in H, and it is conceivable that a word with a higher accumulated cost Jw
might actually be executed faster; (iii) if the upper bound on the length of data words
U is smaller than the length of the optimal solution, then the optimal solution is not
analyzed; and (iv) the discretization on the parameter space introduces quantization
errors.
3.6 Case Study
The effectiveness of abstraction and planning algorithm is illustrated with an
example. The problem to be solved is as follows: a mobile manipulator is instructed to
fetch a document at the printer and deliver it to the user. The locations of the printer
and the user is known.
The mobile manipulator consists of two subsystems: a wheeled mobile platform,
and a two degree-of-freedom robotic arm moving on a vertical plane. The robot exhibits
three different behaviors: (i) it can move from some initial position to a desired posture
(position and orientation), (ii) it can reach out with its arm, grasp an object in the
workspace and hold it, and (iii) it can reach out with its arm to a desired position
and release an object held in its gripper. When the robot performs any one of these
maneuvers we say that it is in a particular control mode, and these modes are labeled a,
b, and c, respectively. The controller responsible for each of these behaviors is given to
us a priori and no access to its low-level software is permitted. We have to determine
44
the sequence and parameterization of the controllers to achieve the desired outcome:
printout delivered to user.
One obvious (to a human) solution is to bring the robot to the vicinity of the
printer, have it reach out and pick up the printout from the output tray, then navigate
to the user and deliver the paper stack. However, it is not clear how such a plan can
be generated automatically.
3.6.1 Control Mode a: Nonholonomic Control
The mobile platform is modeled kinematically as a unicycle
x = v cosϑ y = v sinϑ ϑ = ω
where v the velocity and ω the angular velocity are the control inputs. Control mode
a steers the robot’s posture Xpdef= (x, y, ϑ)ᵀ ∈ R2 × S1 from an initial configuration
Xp0 =(x0, y0, ϑ0
)ᵀto a target Xpf =
(xf , yf , ϑf
)ᵀ. A coordinate transformation
naturally reduces this problem to steering the unicycle to the origin.9
The controller in mode a is designed based on [83]. Let x1 = ϑ mod (2π),
x2 = x cosϑ + y sinϑ, x3 = −2(x sinϑ − y cosϑ) +(ϑ mod (2π)
)(x cosϑ + y sinϑ).
Define
ω = −k1x1 + k3xr3x2
v = −k1x2 + 0.5(x1x2 − x3)ω
where k1, k3 > 0 are control gains, r = mn
, and m < n are odd naturals. The closed
loop system is
x1 = −k1x1 + k3xr3x2
x2 = −k1x2 x3 = −k3xr3 . (3.9)
Vector field fa is defined by the right-hand sides of (3.9). It can be verified that with
Cdef= x3(0)1−r and tf
def= C
k3(1−r) , when t ≤ tf , x3(t) = sign(C)∣∣ |C| − k3(1− r)t
∣∣ 11−r , and
9 Here, the workspace is obstacle-free. If obstacles are present, one may replace thecontroller in mode a with one that can handle obstacles, such as [102]. The challengecomes in approximating convergence rate—for this, see Appendix A.
45
for t ≥ tf , x3(t) = 0. Then for t ≥ tf , x1(t) = x1(tf )ek1(t−tf ) , x2(t) = x2(tf )e
k1(t−tf ) ,
where x1(tf ) =x1(0)+
k3x2(0)
∫ tf0 e2k1s(C−k3(1−r)s)
r1−r ds
ek1tf
and x2(tf ) = x2(0)ek1tf .
Post(a) is defined as the area where x1 and x2 are in a ball of radius ε of
the origin. It is guaranteed that Post(a) is satisfied in time at most maxtf +
ln(
√2x1(tf )
2ε)
k1,
ln(√2x2(0)2ε
)
k1
after switching to control mode a.
3.6.2 Control Modes b and c: Catch and Release
In control modes b and c, the robot’s arm maneuvers to pick, and release an
object, respectively. The arm is mounted on the mobile platform at a height hp. The
lengths of the two arm links are l1, l2, and the corresponding joint angles are ψ1, ψ2. Let
Ψdef= (ψ1, ψ2)ᵀ. The workspace of the arm is the set of end-effector absolute positions
pa = (pxa, pya, pza)ᵀ ∈ R3, reachable in the sense that given Xp = (x, y, ϑ) we have
√(pxa − x)2 + (pya − y)2 + (pza − hp)2 ∈ [ |l1 − l2|, |l1 + l2| ]
tanϑ = pya−ypxa−x
(3.10)
If (3.10) is true, we write pa ∈ W (Xp). The system is kinematically redundant: for
a given pa, many postures Xp can satisfy (3.10). Let the set of all these postures be
W−1(pa).
Inverse kinematics yields the joint angles Ψd def= (ψd1 , ψ
d2)ᵀ that positions the
end-effector to a desired pa:
ψd2 = cos−1 (pxa−x)2+(pya−y)2+(pza−hp)2−(l21+l22)
2l1l2
ψd1 = tan−1 (pza−hp)(l1+l2 cosψd2)−l2√
(pxa−x)2+(pya−y)2 sinψd2
l2(pza−hp) sinψd2+√
(pxa−x)2+(pya−y)2(l1+l2 cosψd2).
Let Ψh def= (ψh1 , ψ
h2 )ᵀ denote the center of the arm’s workspace, the joint angle com-
bination for which the distance between the end-effector the workspace boundary is
maximized. With the workspace being a compact set, the existence of this joint angle
configuration is ensured.
The error in joint angles is written Eψ(t)def= Ψ(t)−Ψd. With direct joint angle
control, and with steady state considered reached when |ψ1−ψd1 | ≤ ε and |ψ2−ψd2 | ≤ ε,
46
vector fields fb and fc are defined by the closed loop joint error dynamics Eψ = −KEψ,
where Kdef=
(b1 0
0 b2
). The difference between the two control modes is that while in
mode b the arm’s gripper is initially open and closes to grasp the object at the desired
end-effector position, in mode c the originally closed gripper opens at the arm’s desired
configuration. With the arm anywhere within its workspace, the maximum time to
complete a pick (b) or place (c) maneuver is
Tj = max2 ln
(|ψh1−ψ
d1 |
ε
)b1
,2 ln
(ψh2−ψ
d2
ε
)b2
.
3.6.3 The System Model
We model the robot as a hybrid agentH = 〈Z,P ,Σ, ι, πi,AP , fσ,Pre,Post, s,∆H〉with components:
Z = X × L set of composite states a
P = R2 × S1 × R3 set of control parameters b
Σ = a, b, c set of control modes c
ι = (a, 1), (b, 2), (c, 3) indexing bijection on Σ
πi, i ∈ 1, 2, 3 projection function on p ∈ P d
fσ, σ ∈ Σ parameterized vector fields c
AP = α1, α2, α3, α4 indexed atomic propositions e
Pre : Σ→ C precondition of mode σ ∈ Σ f
Post : Σ→ C postcondition of mode σ ∈ Σ f
s : Z × P → 2P system parameter reset map g
∆H : Z×P×Σ→Z×P×Σ mode transition map.
47
a X = R2 × S1 × S2 × R3 is the set of continuous variables describing the posture of
the platform Xp ∈ R2 × S1, the joint angles of the arm Ψ ∈ S2, and the position of
the manipulated object Xodef= (xo, yo, zo)
ᵀ ∈ R3. Here, L = g contains a single
Boolean variable g that expresses whether the gripper is closed (g = 1), or not
(g = 0).
b The parameter vector p = (pᵀp, pᵀa)ᵀ ∈ P describes the desired posture pp ∈ R2 × S1
for the mobile platform and the absolute position reference pa ∈ R3 for the arm’s
end-effector. Component pa ∈ R3 parameterizes modes b, c.
c In control mode a, the mobile platform evolves according to fa and converges to
a desired posture pp; in control mode b, the joint angles evolve under fb, the arm
picks up an object at Xo and holds it; in mode c, the joint angles evolve under fc
and the arm releases the object at pa.
d Defined as π1(p)def= pp, π2(p)
def= pa, π3(p)
def= pa.
e Proposition α1 is Xp ∈ pp ⊕ Bε(0) and when true, it means that the platform is
ε-close to its reference position; α2 is Xo ∈ pa⊕Bε(0), and when true, the object is
in an ε-neighborhood of position pa; α3 is pa ∈ W (pp), and when true, it suggests
given the platform being at pp, the parameter component pa specifying a reference
location for the end-effector, is within the reachable workspace. Proposition α4 is
true iff g = 1.
f C is the set of logical sentences obtained with AP . Table 3.1 summarizes the Pre
and Post for each mode.
g For p = (pᵀp, pᵀa)ᵀ and p′ = (p′p
ᵀ, p′aᵀ)ᵀ, writing p′ ∈ s(z, p) implies that p′a /∈ pa+Bε(0)
or p′p /∈ pp + Bε(0).
h Exactly as in Definition 3.The induced register semiautomaton for H is the tuple R(H) = 〈Q,Σ,P ∪
∅, τ,∆R〉, with:
48
Table 3.1: Pre and Post maps for the control modes of the hybrid agent.
a b c
Pre ¬α1 α1 ∧ α2 ∧ α3 ∧ (¬α4) α1 ∧ (¬α2) ∧ α3 ∧ α4
Post α1 α1 ∧ (¬α2) ∧ α3 ∧ α4 α1 ∧ α2 ∧ α3 ∧ (¬α4)
Q set of statesα
τ : 1→ P ∪ ∅ register assignment
∆R transition relation.β–γ
α This set can be practically restricted to 0000,1000,1110,1011,1001,0110,0001.More states exist, but for this task of reaching qf from q0 (see Section 3.6.4), the
remaining states are either unreachable from q0, or cannot reach qf , and thus are
ignored.
β The read transitions are the following:
(0000, ϕr)a→ (1000, right), (1110, ϕr)
b→ (1011, right), (1011, ϕr)c→ (1110, right),
(0001, ϕr)a→ (1001, right), (0110, ϕr)
a→ (1110, right).
γ The write transitions are the following:
(1000, ϕw)a→ (0000, stay) , (1000, ϕw)
b→ (1110, stay), (1011, ϕw)a→ (0001, stay),
(1011, ϕw)c→ (1011, stay), (1001, ϕw)
c→ (1011, stay), (1110, ϕw)a→ (0110, stay).
The set-valued maps λ appearing in the write transitions of R(H) are defined
through (3.1):
λ(τ ; 1000,0000, a
)=p′ ∈ P | p′a = pa, p
′p ∈ R2 × S1 \W−1(pa) \ pp
λ(τ ; 1000,1110, b
)=p′ ∈ P | p′p = pp ∈ W−1(Xo), p
′a = Xo
λ(τ ; 1011,0001, a
)=p′ ∈ P | p′p ∈ R2 × S1 \ W−1(pa), p′a = pa
λ(τ ; 1011,1011, c
)=p′ ∈ P | p′p = pp, p
′a ∈ W (pp) \ pa
λ(τ ; 1001,1011, c
)=p′ ∈ P | p′p = pp, p
′a ∈ W (pp)
λ(τ ; 1110,0110, a
)=p′ ∈ P | p′p ∈ W−1(pa) \ pp, p′a = pa
.
49
0000, p 1000, p 1110, p 1011, p 0001, p
0000, p′ 1110, p′ 1011, p′ 1001, p 0001, p′
0110, p 0110, p′θ
λ6
a
θ θ λ5, θ θ
λ1
λ2
λ3
λ4
a b
ca
Figure 3.2: The transformation semiautomaton TR(H) of hybrid agent H, for the taskspecification considered.
For a fixed tuple (q, q′, σ) ∈ Q × Q × Σ, the set-valued map λ(· ; q, q′, σ) maps τ ∈ Pto a subset of P , and can be inverted on appropriate subsets of P :
λ−1(p′; 1000,0000, a
)= p ∈ P | pp ∈ R2 × S1 \ p′p ∪W−1(pa), pa = p′a
λ−1(p′; 1000,1110, b
)= p ∈ P | pp = p′p ∈ W−1(p′a), pa = R3 \W (p′p)
λ−1(p′; 1011,0001, a
)= p ∈ P | pp ∈ W−1(p′a), pa = p′a
λ−1(p′; 1011,1011, c
)= p ∈ P | pp = p′p, pa ∈ W (p′p) \ p′a
λ−1(p′; 1001,1011, c
)= p ∈ P | pp = p′p, pa ∈ R3 \W (p′p)
λ−1(p′; 1110,0110, a
)= p ∈ P | pp ∈ W−1(p′a) \ p′p, pa = p′a .
The transformation semiautomaton TR(H) = 〈Q,Σ, ∆R〉 is described graphi-
cally in Fig. 3.2. The assignment of labels λi to transitions in R(H) is done by the
function Λ : Λ → Q× Q. Explicitly, Λ(λ1) = (1000,0000), Λ(λ2) = (1000,1110),
Λ(λ3) = (1011,0001), Λ(λ4) = (1001,1011), Λ(λ5) = (1011,1011), Λ(λ6) =
(1110,0110).
3.6.4 Task Specifications
Given some initial configuration for the robot, Xp(0) = (0, 1, π4)ᵀ, Ψ(0) = Ψh =
(0, π)ᵀ, g = 0, and the manipulated objectXo(0) = (−1, 2, 0.3)ᵀ, we seek a time-optimal
plan for the robot to pick the object and deliver it to a user located at Xu = (2, 3, 0.4)ᵀ.
To avoid trivial solutions, we assume that Xo(0) /∈ W (Xp(0)), and W−1(Xo(0)) ∩W−1(Xu) = ∅, which means the object is not within the vicinity of initial base location,
50
and that the arm cannot deliver the object to the user without the robot base having
to reposition itself.
Assume the register initialized with p0 = (Xp(0)ᵀ, Xᵀu)ᵀ, which sets the register
semiautomaton to state 1000. When the user receives the object at time tf , for t > tf
the system holds. At time tf , we have Xu ∈ W (Xp(tf )), Xp(tf ) ∈ π1(pf ) + Bε(0), and
Xo = Xu, g = 0; thus α1, α2, α3 evaluate true. The semiautomaton would then be
at state 1110, while π2(pf ) = Xu. Thus, when (z, p) satisfies Spec, this means that
VM(z, p) = 1110, and π2(p) = π2(pf ) = Xu. The objective is thus to find the shortest
walk w from (1000, p) to (1110, p) in TR(H), which ensures (3.5) is satisfied for some
data word in the family w given by the translation procedure.
3.6.5 Solving the Planning Problem
For the dfa obtained from TR(H) with initial (1000, p) and final (1110, p)
states, the equivalent RE(H) is:
RE(H) = (λ1 a)∗(λ2 b
(λ3 aλ4 c+ (λ5 + θ ) c
)(θ b(λ3 aλ4 c+ (λ5 + θ) c
)+ λ6 a
)∗). (3.11)
By replacing the Kleene-star in (3.11) with natural numbers, we obtain strings that
correspond to walks of certain length in the graph of TR(H). Let us denote W(m) the
set of walks of length m. Substitution in (3.11) verifies that the set of walks has to be
of even length with m > 3. For m = 4, we find W(4) = λ2 b λ5 c, λ2 b θ c. Set w =
λ2 b λ5 c translates to w = (b, p1)(c, p2) and Mj(·)2j=1 = M1 = λ(· ; Λ(λ2), b),M2 =
λ(· ; Λ(λ5), c), in which Λ(λ2) = (1000,1110) and Λ(λ5) = (1011,1011). The
resulted map M(p0) = M2 M1
((Xp(0)T, XT
u )T)
= ∅ since M1
((Xp(0)T, XT
u )T)
=
(Xp(0)T, p′aT)T | p′a ∈ Xo(0) ∩W (Xp(0)) = ∅ because the initial position of the
object is not within the workspace of the mobile platform. The same procedure applies
to the other walk and it turns out none of them satisfies (3.5). For m = 6 (3.11)
generates W(6) = λ1 a λ2 b λ5 c, λ1 a λ2 b θ c, λ2 b λ3 aλ4 c, λ2 b λ5 c λ6 a, λ2 b θ c λ6 a,all of which are rejected. For example, walk w = λ1 a λ2 b λ5 c translates to w =
51
(a, p1)(b, p2)(c, p3) and M(·) = λ(· ; Λ(λ5), c)λ(· ; Λ(λ2), b)λ(· ; Λ(λ1), a), so M(p0) =
(p′ᵀp , p′ᵀa )ᵀ | p′p ∈ W−1(Xo(0)), p′a ∈ W−1(p′p) \ Xo(0); but ∀ p′p ∈ W−1(Xo(0)), one
has Xu /∈ W−1(p′p) \ Xo(0), unless the user can get the object without the robot’s
base moving—which is trivial.
Finally, for m = 8, we find a walk w = λ1 a λ2 b λ3 a λ4 c, which translates to w =
(a, p1)(b, p2)(a, p3)(c, p4), and M1(·) = λ(· ; Λ(λ1), a
), M2(·) = λ
(· ; Λ(λ2), b
), M3(·) =
λ(· ; Λ(λ3), a
), M4(·) = λ
(· ; Λ(λ4), c
). Since the composition of maps M(p0) =
(p′ᵀp , p′ᵀa )ᵀ | p′p ∈ R2 × S1, p′a ∈ W (p′p)
allows p′a = Xu, this walk is a candidate,
and the search is terminated.
Now we resort to DP to obtain the optimal sequence of parameter vectors
pi = (pTpi, pTai)
T for i = 1, . . . , 4 = N . We have p4 = pf , and must satisfy π2(p4) = Xu.
The range of possible parameter values at each stage is:
P1 = M1
([Xp(0)Xu]
)∩(M4M3M2
)−1(W−1(Xu)×Xu
)= R2 × S1 \ Xp(0) ∪W−1(Xu) × Xu ∩W−1(Xo(0))× R3
= W−1(Xo(0))× Xu
P2 =(M4M3
)−1(W−1(Xu)×Xu
)∩M2M1
([Xp(0)Xu]
)=(R2 × S1 \W−1(Xu)
)×(R3 \ Xu
)∩W−1(Xo(0))× Xo(0)
= W−1(Xo(0))× Xo(0)
P3 = M3M2M1
([Xp(0)Xu]
)∩M−1
4
(W−1(Xu)×Xu
)=(R2 × S1 \W−1(Xo(0))
)× Xo(0) ∩W−1(Xu)× R3 \ Xu
= W−1(Xu)× Xo(0)
P4 = W−1(Xu)× Xu .
We discretize the domain of parameter pp using a polar coordinate system, in which the
radial increment between successive parameter settings is 0.06 m, and the angular in-
crement is 10. Figure 3.3 shows two sets of possible parameter settings for the position
component of pp, clustered around the object’s position at Xo = (−1, 2, 0.3)T, and the
user’s location at Xu = (2, 3, 0.4)T. The geometric parameters of the robot are l1 = l2 =
52
0.2 m, and hp = 0.15 m. Then, after setting rmin(z) =√
max0, (l1 − l2)2 − (z − hp)2,and rmax(z) =
√(l1 + l2)2 − (z − hp)2, the domains P1, P2, P3 and P4 are covered by
sets of points p1[k]N1k=1, p2[k]N2
k=1, p3[k]N3k=1, and p4[k]N4
k=1, respectively, where
N1 = N2 = 36⌊rmax(0.3)−rmin(0.3)
0.06
⌋, N3 = N4 = 36
⌊rmax(0.4)−rmin(0.4)
0.06c.
−1.5 −1 −0.5 0 0.5 1 1.5 2 2.51
1.5
2
2.5
3
3.5
X
Y
Figure 3.3: Discretized workspace for the mobile manipulator and optimal path. Thetwo concentric collection of points mark parameter class representatives around theobject and user positions.
The DP algorithm described in Section 3.5 runs as follows.
N = 4: For every p3[k], compute (3.6)
P4∗(p3[k]) = argminp4[j]∈P4
gc(p3[k], p4[j]) .
N = 3, 2: For every p2[k] and p1[k], compute (3.7)
P3∗(p2[k]) = argminp3[j]∈P3
ga(p2[k], p3[j]) + J∗4 (p3[j])
P2∗(p1[k]) = argminp2[j]∈P2
gb(p1[k], p2[j]) + J∗3 (p2[j])
N = 1: Finish by evaluating (3.8) for z0 = [Xp(0) Xo(0)]
P1∗(p0) = argminp1[j]∈P1
ga(z0, p1[j]) + J∗2 (p1[j])
.
53
We find p∗1 = (−0.60, 1.85, 2.79, 2, 3, 0.4), p∗2 = (−0.60, 1.85, 2.79,−1, 2, 0.3), p∗3 =
(2.10, 2.91, 2.62,−1, 2, 0.3) and p∗4 = (2.10, 2.91, 2.62, 2, 3, 0.4). The accumulated cost
is Jw(z0, p0) = gc + ga + gb + ga = 5.49 + 31.72 + 6.08 + 35.80 seconds. The resulting
path on the horizontal plane of the mobile manipulator is shown in Fig. 3.3.
3.7 Conclusions
In this chapter, an abstraction method for a special class of hybrid systems is in-
troduced, which approximates the original system with a discrete system of managable
size and is independent of control objectives. This method depends on the convergent
continuous dynamics of the system, which affords a partitioning of the continuous state
space based on the asymptotic properties of the vector fields, and the capacity of the
system to re-parametrize its continuous controllers. With the abstraction method, an
abstraction-based optimal planning methods is hereby demonstrated. Although the
method is applicable for this special class of systems, it is noted that this systems class
represents a wide range of systems, for which low-level stable controllers have been
designed and can be reused for achieving complicated control objectives.
This partitioning introduced by this abstraction gives rise to purely discrete
abstract systems—no dynamics on the continuous values—which are weakly simulated
by the underlying concrete hybrid dynamics. Since the method does not require a state-
quantization, the state-explosion problem is avoided and the solution is thus scalable to
practical applications. Moreover, as the abstract model is derived directly based on the
dynamics of the underlying hybrid agents, there is also no dependence on the control
objective. Hence, once the specification or objective is changed, the same abstract
model can be reused for optimal planning, with newly designated initial state and the
set of final states.
With the established weak (bi)simulation relation, it is guaranteed that any
plan generated by this abstraction-based planning method is implementable in the
concrete hybrid system, provided the model of the system is correct and there is no
external disturbance from the environment. Due to the over-approximation occurred
54
in the process of abstraction, the plan is in general suboptimal, except for some special
conditions which are identified.
55
Chapter 4
ADAPTIVE CONTROL SYNTHESIS— WITH PERFECTINFORMATION
4.1 Overview
In this chapter, we show that game theory and grammatical inference can be
jointly utilized to synthesize and implement adaptive controllers for finite-state tran-
sition systems operating in unknown, dynamic, and potentially adversarial environ-
ments. Finite-state transition systems can arise as discrete abstractions of dynamical
systems [5, 13, 92, 93]. The synthesis of controllers becomes adaptive in the sense that
the agent completes the information that is missing from its model about its environ-
ment, and subsequently updates its control policy, during execution time.
Reactive synthesis and algorithmic game theory have been introduced for formal
control design in the presence of dynamic environments, in which the system computes
the control output based on real-time information [34,63,106]. In this work, an assump-
tion on the environment is known [63], and the specification of the system is satisfiable
provided the environment dynamics satisfy this assumption. In cases when the envi-
ronment is partially unknown and continuously discovered as the system executing the
actions, an iterative planning framework is developed [70]. However, the re-synthesized
motion plan cannot guarantee the satisfaction of the original task specification.
The following question is raised in this dissertation: is there a method to convert
the problem of designing a symbolic controller for a system that interacts with an un-
known adversarial environment, into a synthesis problem where environment dynamics
is known?—then several known design solutions, e.g., algorithmic game theory [45] and
discrete event system (des) control theory [21], could be applied. This chapter answers
this question in affirmative by proposing a framework that incorporates learning with
56
control design at the abstract level. The identification of the environment model, and
subsequently all possible interactions with it, is essentially a process of inference—to
generalize any formal object that can describe this model based on the finite amount
of observed behaviors. Grammatical inference (GI), as a sub-field of machine learn-
ing, is a paradigm that identifies formal objects through presentation of examples with
or without a teacher [30], and thus the methodology naturally fits in this problem
formulation.
Fig. 4.1 serves as a graphical description of our framework that integrates learn-
ing with control synthesis. With product operations, we combine the system, its task
specification, and its unknown dynamic environment, into a game. On a game graph
constructed based on the agents’ inferred model of its environment, a controller is then
derived. The environment model may be crude and incorrect in the beginning, but as
the agent collects more observations, its grammatical inference module refines it, and
under certain conditions on the observation data and the environment model structure,
the fidelity of the continuously updated model converges to a point where a controller
is found, whenever the latter exist.
robot(s)environment
abstraction
control
planninglearning
transitionsystem
transitionsystem environment
actuators
sensors
specification! !
abstraction
identification
? AsA1A2
GIM G(i)
!Ha
A(i)2
Figure 4.1: The architecture of hybrid planning and control with a module for gram-matical inference
57
The framework is modular and flexible: different types of grammatical infer-
ence algorithms can be applied to learn the behavior of the environment under certain
conditions, without imposing constraints on the method to be used for control: learn-
ing is decoupled from control. In the following, we propose two different approaches
for implementing adaptive control design within this framework and demonstrate the
effectiveness with robot motion planning example.
4.2 Grammatical Inference and Infinite games
This section gives some background on temporal logic, algorithmic games, and
grammatical inference.
4.2.1 Infinite Words
Given a finite alphabet Σ, the set of infinite sequences is denoted Σω. A word
w ∈ Σω is called an ω-word. A ω-regular language L is a subset of Σω. The prefixes
of a ω-regular language L is denoted Pr(L) = u ∈ Σ∗ | (∃w ∈ L)(∃v ∈ Σω)[uv = w].Given an ω-word w, Occ(w) denotes the set of symbols occurring in w, and Inf(w) is
the set of symbols occurring infinitely often in w. Given a finite word w ∈ Σ∗, last(w)
denotes the last symbol of w. We refer to the i+ 1 th symbol in a word w by writing
w(i); the first symbol in w is indexed with i = 0.
We extend the definition of automata in section 3.2.1 to the machines that
accept ω-regular languages. An automaton, is a quintuple A = 〈Q,Σ, T, I,Acc〉 where
〈Q,Σ, T 〉 is an semiautomaton (sa) deterministic in transitions, I is the set of initial
states, and Acc is the acceptance component. A word w = σ0σ1 . . . generates run
ρw = q0q1 . . . in A if and only if T (qi, σi) = qi+1, for 0 ≤ i < |w|. Different types of
Accs give rise to:
• finite state automaton, in which case Acc = F ⊆ Q, and A accepts w ∈ Σ∗ if the
run ρw ∈ Q∗ satisfies ρw(0) ∈ I and last(ρw) ∈ F , and
• Buchi automata, in which case Acc = F ⊆ Q, and A accepts w ∈ Σω if the run
ρw ∈ Qω satisfies ρw(0) ∈ I and Inf(ρw) ∩ F 6= ∅.
58
The set of (in)finite words accepted by A is the language of A, denoted L(A).
An automaton is deterministic if it is deterministic in transition and I is a singleton.
In this case, with a slight abuse of notation, we denote I the single initial state. A
dfa with the fewest number of states recognizing a language L is called a canonical
automaton for L. Unless otherwise specified, we understand that A is the sa obtained
from an fsa A by unmarking the initial state and final states from A.
An automaton is complete if for any q ∈ Q and σ ∈ Σ, T (q, σ) is defined. Any
automaton can be made complete by adding a non-final state sink such that if for a
given q ∈ Q, σ ∈ Σ, T (q, σ) is undefined, then let T (q, σ) = sink.
4.2.2 Grammatical Inference
A positive presentation φ of a language L is a total function φ : N → L ∪ #such that for every w ∈ L, there exists n ∈ N such that φ(n) = w [55]. Here # denotes
a pause, a moment in time when no information is forthcoming. A presentation φ can
also be understood as an infinite sequence φ(0)φ(1) · · · containing every element of L,
interspersed with pauses. Let φ[i] denote the finite sequence φ(0)φ(1) . . . φ(i).
Grammars are finite descriptions of potentially infinite languages. The language
of a grammar G is L(G). A learner (learning algorithm, or grammatical inference
machine (GIM) ) is a program that takes the first i elements of a presentation, i.e.
φ[i], and outputs a grammar G, written GIM(φ[i]) = G. The grammar outputted
by GIM is the learner’s hypothesis of the language. A learner GIM identifies in the
limit from positive presentations a class of languages L if for all L ∈ L, and for all
presentations φ of L, there exists a n ∈ N such that for all m ≥ n, GIM outputs a
grammar GIM(φ[m]) = G, and L(G) = L [44].
4.2.3 Specification Language
We use ltl [33] to concisely specify desired system properties such as response,
liveness, safety, stability, and guarantee [28]. Informally speaking, ltl allows one to
59
reason about the change over time of the truth value of logical propositions. ltl is
built recursively from a set of predicates P as follows
ϕ := ϕ | ¬ϕ | ϕ1 ∨ ϕ2 | ©ϕ | ϕ1Uϕ2 | > |⊥,
where> and⊥ are unconditional true and false, respectively, “next”(©) and “until”(U)
are temporal operators.
Given negation (¬) and disjunction (∨), we can define conjunction (∧), impli-
cation ( =⇒ ), and equivalence (⇔). Additional temporal operators can be obtained
such as “eventually”(♦ = True U) and “always” ( = ¬♦¬).
A ltl formula ϕ over P can be translated into a ω-regular language over the
alphabet 2P , and one can construct a Buchi automaton that accepts this language with
methods in [9, 41].
4.2.4 Infinite Games
This section briefly reviews deterministic turn-based, two-player zero-sum games
with perfect information.
Definition 8 ( [45]). A two-player turn-based zero-sum game is a tuple G = 〈V1 ∪ V2,
Σ1 ∪Σ2, T, I, F 〉, where 1) Vi is the set of states where player i moves, 2) Σi is the set
of actions for player i, V1 ∩ V2 = Σ1 ∩Σ2 = ∅, V = V1 ∪ V2; 3) T : Vi×Σi → Vj is the
transition function where (i, j) ∈ (1, 2), (2, 1); 4) I is the set of initial game states,
and 5) F ⊆ V1 ∪ V2 is the winning condition: in reachability (resp. safety or Buchi)
games: a run ρ is winning for player 1 if Occ(ρ)∩F 6= ∅ (resp. Occ(ρ) ⊆ F for safety,
and Inf (ρ) ∩ F 6= ∅ for Buchi).
A run ρ = v0v1v2 . . . ∈ V ∗ (V ω) is a finite(infinite) sequence of states such that
for any 0 ≤ i < |ρ|, there exists σ ∈ Σ, T (vi, σ) = vi+1. A play p = v0σ0v1σ1 . . . ∈(V ∪Σ)∗ (or (V ∪Σ)ω) is a finite (or infinite) interleaving sequence of states and actions
such that the projection of p onto V is a run ρ in the game and the projection of p
onto Σ is a word that generates the run ρ.
60
A strategy for player i in game G is a function Si : V ∗Vi → 2Σi that takes a
run ρ and outputs an action for player i to take. It satisfies that for any run ρ ∈ V ∗,Si(ρ) = σ implies that T (last(ρ), σ) is defined. A memoryless strategy Si : Vi → 2Σi
outputs an action for player i to take depending only on the current state in the game.
For reachability and Buchi games, a memoryless winning strategy always exists for one
of the players [45].
We say player 1 follows strategy S1 in a play p = v0σ0v1σ1 · · · if for all n ≥ 1,
σ2n−2 ∈ S1(v0v1 · · · v2n−2). The definition of player 2 following strategy S2 can be
obtained dually. An initialized game, denoted (G, v0), is the game G with a designated
initial state v0 ∈ I. A strategy is winning for player i, denoted WSi, if every run in
(G, v0) with player i adhering to WSi, results in player i winning. The winning region
of player i, denoted Wini, is the set of states from which she has a winning strategy.
We define a projection operator for a tuple s = (s1, . . . , sN) ∈ S1× S2 . . .× SN ,
as πi((s1, . . . , sN)) = si, for 0 ≤ i ≤ N . For a set of tuples S, we write πi(S) =⋃s∈Sπi(s), and for a sequence of tuples w = s1s2 . . . , we apply the operator element-
wise in the form πi(w) = πi(s1)πi(s2) . . . .
4.3 System Behavior as Game Play
First of all, the interaction between two dynamical systems (the agent and its
environment) is modeled as a two-player, turn-based, zero-sum game in which the task
specification of the system determines the winning condition of the system player.
4.3.1 Constructing the Game
Let AP be the set of atomic propositions describing world states (or the state
of the combined agent-environment system). The set of world states C is defined to
be the set of all conjunctions of atomic propositions or their negations, i.e. C = c =
`1∧`2 . . .∧`n | (∃α ∈ AP)[`i = α∨`i = ¬α], such that, for any c ∈ C, one proposition
in AP appears at most once.
61
Assume now that the behavior of both the agent (player 1) and its environ-
ment (player 2) can be captured by some labeled transition system (lts), A1 =
〈Q1,Σ1, T1,AP1, LB1〉 for player 1, and A2 = 〈Q2,Σ2, T2,AP2, LB2〉, for player 2, where
for i = 1, 2, each component 〈Qi,Σi, Ti〉 is a sa, AP i is the set of atomic propositions
that can be changed by player i’s actions, AP = AP1 ∪ AP2, and LBi : Qi → C is a
labeling function.
We assume an action σ ∈ Σi has conditional effects:
1. the pre-condition of action σ, denoted Pre(σ) ∈ C, is a sentence that needs to
be satisfied for the player to initiate action σ, and
2. the post-condition of action σ, denoted Post(σ) ∈ C, is the sentence that is
satisfied when action σ is completed.
Given c ∈ C, if c =⇒ Pre(σ), then the effect of action σ on c, denoted σ(c) ∈ C is
the unique world state after performing σ when the world state is c. These conditional
effects can be directly related with the pre- and post-conditions of control modes in a
hybrid agent (see Chapter 3).
Without loss of generality, we assume the alphabets of A1 and A2 to be disjoint,
i.e. Σ1 ∩Σ2 = ∅. It is possible that player i can give up his turn, in which case, we say
that she “plays” a generic (silent) action εi ∈ Σi. We assume Pre(εi) = Post(εi) = >,
for i = 1, 2. In addition, εi(c) = c, i.e., a silent action cannot change the world state.
Definition 9 (Turn-based product). Given the models of system and environment
A1 = 〈Q1,Σ1, T1,AP1, LB1〉 and A2 = 〈Q2,Σ2, T2,AP2, LB2〉, the turn-based product
P = 〈Q,Σ, δ,AP , LB〉 is a lts denoted A1 A2, defined as follows:
Q = Q1 × Q2 × 0,1 is the set of states, where the last component is a Boolean
variable t ∈ 0,1 denoting whose turn it is to play: t = 1 for player 1, t = 0
for player 2.
Σ = Σ1 ∪ Σ2 is the alphabet.
δ is the transition relation. δ((q1, q2, t), σ
)= (q′1, q2,0) if t = 1, q′1 = T1(q1, σ),
with LB1(q1) ∧ LB2(q2) =⇒ Pre(σ); and δ((q1, q2, t), σ
)= (q1, q
′2,1) if t =
0, q′2 = T2(q2, σ), with LB1(q1) ∧ LB2(q2) =⇒ Pre(σ).
62
LB : Q→ C is the labeling function, and is defined by: for (q1, q2, t), LB(q1, q2, t) =
LB1(q1) ∧ LB2(q2) 6=⊥.
The time complexity of constructing P is polynomial in the size of the models
of two players’ ltss.
The task specification is given as a ltl formula Ω over AP and can be translated
into a language over the set of world states [41] , accepted by a completed deterministic
automaton As = 〈S, C, Ts, Is, Fs〉 where sink ∈ S. Intuitively, the task specification
encoded in As specifies a set of histories over the world states.
The turn-based product P gives snapshots of different stages in a game. It does
not capture any of the game history that resulted in this stage. We overcome the lack
of memory in P by using another product operation with As and P .
Definition 10 (Two-player turn-based game automaton). Given the turn-based prod-
uct P = 〈Q,Σ, δ, LB〉 and the task specification As = 〈S, C, Ts, Is, Fs〉, a two-player
turn-based game automaton is constructed as a special product of P and As, denoted
G = P nAs = (A1 A2)nAs = 〈V,Σ, T, I, F 〉, where
V = V1 ∪ V2 where V1 ⊆ (q, s) | q = (q1, q2,1) ∈ Q ∧ s ∈ S is the set of states at
which player 1 makes a move and V2 ⊆ (q, s) | q = (q1, q2,0) ∈ Q ∧ s ∈ S is
the set of states of player 2.
T : V ×Σ→ V is the transition relation and is defined by T ((q, s), σ) = (q′, s′) if and
only if δ(q, σ) = q′, and Ts(s, c) = s′ with c = LB(q′).
I = (q, s) ∈ V | s = Ts(Is, LB(q)) is the set of possible initial game states.
F = (q, s) ∈ V | s ∈ Fs is the winning condition.
From P and As the game automaton G is constructed in time polynomial in the
size of P and As. With a slight abuse of notation, the labeling function in G is defined
as LB(v) = LB(π1(v)) where π1(v) ∈ Q.
63
For a fixed initial state v0 ∈ I, when As is a dfa, (G, v0) is a reachability game.
When As is a deterministic Buchi automaton (dba), (G, v0) is a Buchi game. The runs
in (G, v0) and As are related as follows:
ρ, G : (q(0), s(0))σ1−→ (q(1), s(1)) . . .
ρs, As : IsLB(q(0))−−−−→ s(0) LB(q(1))−−−−→ s(1) . . .
(4.1)
where LB(q(1)) = σ1(LB(q(0))).
If player 1 wins (G, v0), then it means that the task specification encoded in Asis satisfied if player 1 follows her winning strategy:
Proposition 3. For any winning run ρ ∈ V ∗(or V ω) of player 1 in G, LB(ρ) ∈ C∗(or
Cω) is accepted by As.
Proof. Since ρ is winning for player 1, in a reachability (resp. Buchi) game, last(ρ) ∈ F(resp. Inf(ρ) ∩ F 6= ∅). Projecting ρ on the state set S of As, we obtain last(π2(ρ)) ∈π2(F ) ⊆ Fs (resp. Inf(π2(ρ)) ∩ Fs 6= ∅). Since the run in As corresponding to ρ is
ρs = Isπ2(ρ) by (4.1), we have last(ρs) ∈ Fs (resp. Inf(ρs) ∩ Fs 6= ∅) and thus the
word generating ρs, which is LB(ρ), is accepted in As by the definition of acceptance
component.
4.3.2 Game Theoretic Control Synthesis
For a game G = 〈V1 ∪ V2, Σ1 ∪ Σ2, T, I, F 〉 and for a set of states X ⊆ V , the
attractor [45] of X, denoted Attr(X), is the largest set of states W ⊇ X in G from
where player 1 can force a run into X. It is defined recursively as follows. Let W0 = X
and set
Wi+1 := Wi ∪v ∈ V1 | (∃σ.T (v, σ) ↓) [T (v, σ) ∈ Wi ]
∪v ∈ V2 | (∀σ.T (v, σ) ↓ ) [T (v, σ) ∈ Wi ]
. (4.2)
Since G is finite, there exists the smallest m ∈ N such that Wm+1 = Wm = Attr(X).
If G is a reachability game, the winning region of player 1 is Win1 = Attr(F ) and
the winning region of player 2 is Win2 = V \Win1. Player 1 has a memoryless winning
64
strategy if the game starts at some initial state v0 ∈ Win1∩ I. Given v0 ∈ Win1∩ I, the
memoryless winning strategy WS1 is computed as follows: (1) obtain a set of subsets
Yi, i = 0, . . . ,m in the following way: let Y0 = W0 = F and set Yi := Wi \Wi−1, for
all i ∈ 1, . . . ,m; (2) given v ∈ Yi ∩ V1 for some 1 ≤ i ≤ m, then define WS1(v) =σ ∈ Σ1 | T (v, σ) ∈ Yi−1, i ≥ 1
.
In the case that G is a Buchi game, the winning region of player 1, Win1, is
obtained by recursively computing the set of states Z [45] in the following way:
1. Z0 = V ,
2. for i ≥ 0, Xi := Attr(Zi), and
Yi :=v ∈ V1 | (∃σ ∈ Σ1 : T (v, σ) ↓)[T (v, σ) ∈ Xi]
∪v ∈ V2 | (∀σ ∈ Σ2 : T (v, σ) ↓)[T (v, σ) ∈ Xi]
, (4.3)
3. Zi+1 = Yi ∩ F .
The set Z = Zm = Zm+1 is the fixed point and the winning region for player 1
is Win1 := Attr(Z). The memoryless winning strategy WS1 of player 1 on Win1 is
computed as follows: (1) for v ∈ Win1 \ Z the winning strategy is defined in the same
way as that for the reachability game on the graph of G in which Z is the winning
condition; (2) if v ∈ Z, then define WS1(v) = σ ∈ Σ1 | T (v, σ) ∈ Win1. Applying
WS1(v) leads to a state within Win1, from which point onwards the strategy defined in
case (1) applies.
The time complexity of solving reachability and Buchi games are linear and
polynomial, respectively, in the size of the game automaton G.
4.4 Integrating Learning with Control
The problem of synthesizing a strategy (controller) has a solution if player 1
has full knowledge of the game that is being played. Suppose, however, that player 1
has knowledge of her own capabilities and objective, but does not have full knowledge
of the capabilities of player 2. How can player 1 plan effectively given her incomplete
knowledge about which game she is actually playing?
65
In this section, we show that when player 1 does not have complete knowledge of
her opponent, and as a result of the game, the integration of GI as a learning mechanism
in a setting where the game is repeated sufficiently many times, can eventually give
player 1 a winning strategy.
4.4.1 Languages of the Game
The intuition is that from abstraction level, both the system and its environ-
ment’s behavior, can be understood as their languages. Thus the identification of a
model for the environment or their interaction becomes the problem of learning the
language of the environment or the game. First of all, we define what we meant by the
languages of the environment and game.
For a designated initial state v0 ∈ V , the language of the game (G, v0), denoted
L(G, v0) = L(〈V,Σ, T, v0, V 〉), is the set of finite prefixes of all possible behaviors of
two players (sequences of interleaving actions) in the game. The language of player i,
for i ∈ 1, 2, is the projection of L(G, v0) on Σi, denoted Li(G, v0).
We assume the game (G, v0) is played repeatedly. During the game play, the
system obtain positive presentations of the languages of the game and the environment.
Let the presentation of language L(G, v0) obtained in the repeated game to be φ, define
φ(0) = λ, and denote φ[i] the presentation obtained after move i = 1, . . . , n. Since
games are repeated, the move index i counts from the first move in the very first game
until the current move in the latest game. If move i+1 is the first in one of the repeated
games and player k plays σ ∈ Σk then φ(i + 1) = σ; otherwise φ(i + 1) = φ(i)σ. The
projection of φ on player 2’s alphabet is a positive presentation of L2(G, v0), denoted
φ2.
4.4.2 Learning the Game — a First Approach
In this section we consider the case when the system cannot interfere with the
environment action in their interaction, that is, L2(G, v0) = L(〈Q2,Σ2, T2, q20, Q2〉)where q20 is the initial state of the environment given the initial game state being v0.
66
That is, v0 = (q0, s0) and q0 = (q10, q20, t). For certain classes of languages to which
L2(G, v0) belongs, in this case we can directly identify the model of the environment in
the form of a sa and consequently the game.
The assumptions that allow the implementation of the first approach are the
following:
Assumption 1. 1) Player 1 cannot restrict player 2; 2) The model of player 2 is
identifiable in the limit from positive presentations by a GIM; 3) Player 1 has prior
knowledge for selecting the correct GIM; and 4) the observed behavior of player 2 suffices
for a correct inference to be made, i.e., it contains a characteristic sample.
Definition 11. Let L be a class of languages identifiable in the limit from positive
presentation by a normal-form learner GIM, the output of which is an fsa. Then we
say that an sa A = 〈Q,Σ, T 〉, where sink /∈ Q, is identifiable in the limit from positive
presentations if for any q0 ∈ Q, the language accepted by fsa A = 〈Q,Σ, T, q0, Q〉 is
in L, and given a positive presentation φ of L(A), there exists an m ∈ N such that
∀n ≥ m, GIM(φ[m]) = GIM(φ[n]) = A. The learner GIMSA for sa A is constructed
from the output of GIM by unmarking the initial and final states.
Let SA(GIM) be the set of sas identifiable in the limit from positive presentations
by the normal-form learner GIM. Now given an sa A1, an objective As, and a class of
semiautomata SA, define the class of games
GAMES(A1,As, SA) =G | ∃A2 ∈ SA. G = (A1 A2)nAs
.
For this class of games we have the following result.
Theorem 3. If, for all A2 ∈ SA(GIM), there exists A2 ∈ range(GIM) such that L(A2) =
L2
((A1 A2) n As
), then GAMES
(A1,As, SA(GIM)
)is identifiable in the limit from
positive presentations.
67
Proof. For any game G ∈ GAMES(A1,As, SA(GIM)
)and any data presentation φ of
L(G, v0), denote φ2[n] for n ∈ N the projection of φ[n] on Σ2. Then define a learning
algorithm Alg as follows:
∀φ,∀n ∈ N, Alg(φ[n]) =(A1 GIMSA(φ2[n])
)nAs .
We show that Alg identifies GAMES(A1,As, SA(GIM)
)in the limit.
To this end, consider any game G ∈ GAMES(A1,As, SA(GIM)
), there is an
A2 ∈ SA(GIM) such that G = (A1 A2) n As. Consider now any data presentation φ
of L(G, v0). Then φ2 is a data presentation of L2(G, v0). By assumption there exists
A2 ∈ range(GIM) such that L(A2) = L2(G, v0). Thus φ2 is also a data presentation
of L(A2). Therefore, there is m ∈ N such that for all n ≥ m it is the case that
GIMSA
(φ2[n])
)= A2. Consequently, there is m′ = 2m such that for all n ≥ m′,
Alg(φ[n]) = (A1 A2)nAs = G.
Since G and φ are selected arbitrarily, the proof is completed.
Winning Strategy: WS[0]1 WS
[1]1 . . . WS
[i]1 . . . → WS1
↑ ↑ ↑Hypothesis of the Game: G[0] G[1] . . . G[i] . . . → G
↑ ↑ ↑Hypothesis of Player 2: A
[0]2 A
[1]2 . . . A
[i]2 . . . → A2
↑ ↑ ↑Data Presentation: φ2[0] φ2[1] . . . φ2[i] . . .
Figure 4.2: Learning and planning with a grammatical inference module.
Figure 4.2 illustrates how identification in the limit proceeds. Through inter-
actions with player 2, player 1 observes a finite initial segment of a positive presen-
tation φ2[i] of L2(G, v0), and uses the GIM to update a hypothesized model of player
2. Specifically, the output of GIM(φ2[i]) becomes a dfa (see Appendix A), which af-
ter removing the initial state and the finality of final states, yields a semiautomaton
A[i]2 . The labeling function LB
[i]2 in A
[i]2 is defined as LB
[i]2 = ∧σ∈IN(q)Post(σ), where
IN(q) , σ ∈ Σ2 | (∃q′ ∈ Q[i]2 )[T
[i]2 (q′, σ) = q]; this is the set of labels of incoming
transitions of the state q. The computation of labeling function is of time complexity
linear in the size of A[i]2 . Given LB
[i]2 , the interaction function U2(·) is updated in linear
68
time O(|Q1| × |Q[i]2 |). Based on the interaction functions and the updated model for
player 2, player 1 constructs a hypothesis (model for) G [i], capturing her1 best guess
of the game being played, and uses this model to compute WS[i]1 , which converges to
the true WS1 as A[i]2 converges to the true A2. Strategies WS
[i]1 for i < n, are the best
responses for the system given the information it has so far, but having been devised
based on incorrect hypotheses about the game being played, they cannot guarantee
winning. There is no guaranteed upper bound on the number of games player 1 has to
play before the learning process converges because one does not know at which point
a characteristic sample of player 2’s behavior is observed. However, as soon as this
happens, convergence is guaranteed. The game learning procedure is summarized in
the following.
1. The game starts with initial state v0 ∈ I, i := 0, and the hypothesized game is
G [0].
2. At state v = (q1, q2,1, qs), player 1 computes Win[i]1 in G [i]. If v ∈ Win
[i]1 , a winning
strategy WS[i]1 exists in (G [i], v). Player 1 plays WS
[i]1 (v), and proceeds to step 4.
If v /∈ Win[i]1 , player 1 loses and jumps to step 3; if T (v, σ) ∈ F , player 1 wins and
jumps to step 5.
3. With probability p, player 1 makes a move randomly selected from available
moves at that time instance and jumps to step 4; or player 1 jumps to step 5
with probability 1− p.4. Player 2 makes a move. Player 1 observes the move, updates A
[i]2 to A
[i+1]2 , and
G [i] to G [i+1]. Player 1 sets i := i+ 1 and goes to step 2.
5. The game is restarted at a random initial state v0 and If v0 /∈ Win[i]1 , player 1
makes a random move and goes to step 4; otherwise, player 1 jumps to step 2.
When player 1 finds herself out of her assumed winning set she can either quit and
restart the game, or explore an action with probability 0 ≤ p ≤ 1 and keep playing
hoping that her opponent’s response allows her to improve her hypothesis of the game.
1 In this context, we refer to player 1 as a “she” and player 2 as a “he”.
69
4.4.3 Learning an Equivalent Game
The assumption for the first approach can be restrictive in cases when the en-
vironment behavior can be constrained by the system during their interactions. For
example, a mobile robot’s environment can be composed by another mobile robot and
the interference of actions goes in both direction. On the other hand, the learning
algorithms used in the first approach are restricted to those that output fsas.
To relax these assumption in the first approach, in this section, we combine
action model learning [84] with grammatical inference to directly identify an equivalent
game of the original one being played. The equivalence considered here is a modified
version of game equivalence in [15], and leads to guarantees that even if the true model
of the environment is never found, the controllers built based on the equivalent model
are effective in terms of satisfying the system specification.
We assume the following condition for the second approach to apply.
Assumption 2. The language of the environment in the system and environment
interaction, L2(G, v0), belongs to a class of languages identifiable in the limit from
positive presentations, and player 1 has correct prior knowledge of the class of languages
to which L2(G, v0) belongs.
4.4.3.1 Equivalence in Games
Through the concept of bisimulation in transition systems we establish the
equivalence between two games.
Definition 12. [91] A bisimulation of two transition systems P = 〈Q,Σ, δ, LB〉 and
P ′ = 〈Q′,Σ, δ′, LB′〉 is a binary relation R ⊆ Q × Q′ that whenever (q, q′) ∈ R and
σ ∈ Σ, the following conditions hold:
(i) LB(q) = LB′(q′).
(ii) if δ(q, σ) = p, then δ′(q′, σ) = p′ for some p′ ∈ Q′ such that (p, p′) ∈ R.
(iii) if δ′(q′, σ) = p′, then δ(q, σ) = p for some p ∈ Q such that (p, p′) ∈ R.
70
We write P ' P ′ if P and P ′ are bisimilar. For designated initial states q0, q′0, we say
(P, q0) is bisimilar to (P ′, q′0) and write (P, q0) ' (P ′, q′0) if and only if after trimming
all states inaccessible from q0 and q′0, P ' P ′ and (q0, q′0) ∈ R.
The following definition is adapted from [15].
Definition 13 (Equivalence between games). Two games (G, v0) = 〈V,Σ, T, v0, F 〉 and
(G ′, v′0) = 〈V ′,Σ, T ′, v′0, F ′〉 are equivalent if there exist two functions r : V ∗ → V ′∗ and
r′ : V ′∗ → V ∗ such that given a winning strategy of player 1 in (G, v0), WS1 : V ∗ → Σ1,
the strategy WS′1 : V ′∗ → Σ1 defined by for any ρ′ = v′0v′1 . . . v
′n ∈ V ′∗, WS′1(ρ′) =
WS1(r′(ρ′)) is winning for player 1 in (G ′, v′0) and vice versa.
Proposition 4. If (P, q0) and (P ′, q0) are bisimilar, then the games (G, v0) = (P, q0)n
As and (G ′, v′0) = (P ′, q′0)nAs are equivalent, for any deterministic objective automaton
As = 〈S, C, T, Is, Fs〉.
Proof. Define r, r′ in Definition 13 using complete induction: for initial states v0 =
(q0, s0) and v′0 = (q′0, s0) where s0 = Ts(Is, LB(q0)) = Ts(Is, LB′(q′0)) since LB(q0) =
LB′(q′0), let r(v0) = v′0 and r′(v′0) = v0, we have π2(v0) = π2(v′0) and (π1(v0), π1(v′0)) =
(q0, q′0) ∈ R; then suppose r, r′ are defined for finite runs ρ = v0v1 . . . vn and ρ′ =
v′0v′1 . . . v
′n such that r(ρ) = ρ′, r′(ρ′) = ρ and for all 0 ≤ i ≤ n, π2(vi) = π2(v′i) ∈ S and
(π1(vi), π1(v′i)) ∈ R.
Consider σ ∈ Σ for which T (vn, σ) ↓, and let vn+1 = T (vn, σ). Suppose vn =
(qn, sn) and v′n = (q′n, s′n), by the assumption of r, r′, we have (qn, q
′n) ∈ R and s′n = sn.
Since As is total and δ′(q′n, σ) is defined by bisimulation, T ′(v′n, σ) is defined. The
transitions in G and G ′ are related:
G : (qn, sn)σ−→ (qn+1, sn+1)
l R l RG ′ : (q′n, sn)
σ−→ (q′n+1, s′n+1)
where sn+1 = Ts(sn, LB(qn+1)) and s′n+1 = Ts(sn, LB′(q′n+1)). Since q′n+1 is related to
qn+1 through R, from LB(qn+1) = LB′(q′n+1) we must have s′n+1 = sn+1.
71
Let r(ρvn+1) = ρ′v′n+1 and r′(ρ′v′n+1) = ρvn+1 and inductively it follows that for
two runs ρ ∈ V ∗ and ρ′ ∈ V ′∗ such that r(ρ) = ρ′ and r′(ρ′) = ρ, it holds
(∀i : 0 ≤ i < |ρ|)[π2(vi) = π2(v′i) ∧ (π1(vi), π1(v′i)) ∈ R] .
Now suppose WS1 : V ∗ → Σ1 is a winning strategy for player 1 in (G, v0). For
any run ρ produced by player 1 applying WS1, let r(ρ) be the run produced by player
1 applying WS′1 (Definition 13) and note that LB(ρ) = LB′(r(ρ)), where LB and LB′ are
the labeling functions of G and G ′, respectively. Because of this latter equality between
the images of the labeling functions, when ρ is winning for player 1, by Proposition 3
we can infer LB(ρ) is accepted by As, and consequently r(ρ) is winning for player 1 in
(G ′, v′0) since LB′(r(ρ)) = LB(ρ) is accepted by As as well.
Proposition 4 sets the theoretical foundation that allows us to compute a winning
strategy for player 1 in a game equivalent to the original one when the latter is unknown
but can be learned from positive presentations.
4.4.3.2 Learning an Equivalent Game from Positive Presentations
The learning module in our framework combines two learning processes that
work in parallel: one aims to identify a transition system that keeps track of the
updates of world states during the course of the game, and the other is a typical GIM.
By combining these two we are able to compute a game equivalent to the true game
in the sense of Definition 13.
We assume that player 1 always knows whose turn it is (i.e. the Boolean value
t ∈ 0, 1) and has full observation of the set of atomic propositions AP , i.e., at any
time instance during the game, player 1 knows the current evaluation of α for each
α ∈ AP whose value can be determined (either true or false). In the repeated game,
the (move) index i counts from the very first game until the current move.
Definition 14 (World state transition system). During the game in which (G, v0) being
played repeatedly, the world state transition system constructed by player 1 at index n
72
is W (n) = 〈C × 0,1,Σ, Tw, (c0, t0)〉, where C × 0,1 is a set of states and (c0, t0) is
the initial state;2 Tw : (C × 0,1)× Σ→ C × 0,1 is the transition relation defined
based on the observations of player 1 as follows:
1. t = 1: Tw((c,1), σ) = (c′,0) is defined if for c ∈ C with c =⇒ Pre(σ), we have
c′ = σ(c) is the world state that captures the effect of σ on world state c.
2. t = 0: Tw((c,0), σ) = (c′,1) is defined if for c ∈ C, after player 2 plays σ, the
observed world state is c′.
In the course of the game G, player 1 updatesW as follows. Let the world state at
index n be c ∈ C. At n+1, suppose player 2 plays σ ∈ Σ2 and the world state becomes c′.
Then W (n+ 1) is obtained from W (n) incrementally by first adding state (c′,1) if it is
not already included in the state set and then defining a transition Tw((c,0), σ) = (c′,1)
if not already existing. Outgoing transitions of (c′,1) are subsequently added according
to the definition of Tw: for any σ ∈ Σ1 such that Pre(σ) is satisfied by c′, we add a
transition from (c′,1) to (σ(c′),0) labeled σ.
The incremental construction of W is reminiscent of learning an action model
with full observation [84] by treating the action of player 2 as the set of actions whose
conditional effects have to be learned. The convergence of learning is guaranteed:
informally, according to the turn-based product P , the set of world states that may
actually be encountered during players’ interaction is fixed. Once the set of states in
W (i) for some i ∈ N converges to a set that contains LB(Q), then one adds transitions
with a known state set using the rules defined above. The convergence is reached when
no more transition can be added. The construction of a world-state transition system
is linear in the size of the state space.
Definition 15. Suppose that upon the initialization of the game (G, v0) player 1 is
at state I1, W (n) = 〈C × 0,1,Σ, Tw, (c0, t0)〉 has been constructed at index n and
a hypothesis of L2(G, v0) is given in the form of a grammar G by a GIM, for which
2 By construction, not all states in C × 0,1 can be accessed in W (n).
73
one finds a dfa B = 〈Qh2 ,Σ2, T
h2 , I
h2 , F
h2 〉 such that L(B) = L(G). The hypothesized
turn-based product is
HP = W (n) ×s A1 ×s B = W (n) ×s 〈Q1,Σ1, T1, I1, Q1〉 ×s B = 〈H,Σ, δ′, h0, LB′〉
where H = C × 1,0 × Q1 × Qh2 is the state set and h0 = (c0, t0, I1, I
h2 ) is the initial
state; the transition relation δ′ is defined as follows: for h = (c, t, q1, qh2 ),
1) σ ∈ Σ1 ∧ t = 1: δ′(h, σ) = (Tw((c, t), σ), T1(q1, σ), qh2 );
2) σ ∈ Σ2 ∧ t = 0: δ′(h, σ) = (Tw((c, t), σ), q1, Th2 (qh2 , σ)) .
The labeling function LB′ : H → C is defined such that for any h = (c, t, q1, qh2 ) ∈ H,
LB′(h) = π1(h) = c.
Theorem 4. Let GIM be a learning algorithm that identifies in the limit from positive
presentations a class of regular languages L. Suppose game (G, v0) with v0 = (q0, s0)
is such that L2(G, v0) ∈ L, and consider any positive presentation of L2(G, v0) denoted
φ2 : N→ L2(G, v0) ∪ #. Let the algorithm Alg be defined by
Alg(φ2[n]) , (W (n)×s A1 ×s A2(φ2[n]))nAs ,
where n ∈ N, A2(φ2[n]) = 〈Qh2 ,Σ2, T
h2 , I
h2 , F
h2 〉 is a dfa that accepts the language
generated by the grammar GIM(φ2[n]), and A1 is obtained from A1 by assigning I1 =
π1(q0) as the initial state and all states final. Then there exists some index N ∈ N,
such that
1. L(GIM(φ2[N ])) = L2(G, v0);
2. W (n) = W (N) for all n ≥ N ;
3. Alg(φ2[N ]) is game equivalent to (G, v0).
Proof. It suffices to prove that at index N ∈ N, the hypothesized turn-based prod-
uct W (N) ×s A1 ×s A2(φ2[N ]) ' (P, q0) = 〈Q,Σ, δ, q0, LB〉, because it follows from
Proposition 4 that Alg(φ2[N ]) is equivalent to (G, v0).
74
At index N , let the hypothesized turn-based product be HP = W (N)×sA1 ×sA2(φ2[N ]) = 〈H,Σ, δ′, h0, LB
′〉. Let the relation R ⊆ Q×H be defined as (q0, h0) ∈ R
with (q, h) ∈ R whenever there exists w ∈ Σ+ such that δ(q0, w) = q and δ′(h0, w) = h.
We show R is a bisimulation:
First we show that if (q, h) ∈ R, then LB(q) = LB′(h): given that LB′(h0) =
c0 is the world state at the initialization of game, LB′(h0) = LB(q0) = c0 due to
the uniqueness of the initial world state. For (q, h) ∈ R, we assume without loss of
generality that there exists σ ∈ Σ for which δ(q, σ) = q′ and δ′(h, σ) = h′. Since in a
deterministic game, for the same world state c the world state resulting from applying
action σ on c is unique, and given LB(q) = LB′(h) = c we can infer LB(q′) = LB′(h′) =
σ(c) = c′ ∈ C. Inductively, it follows that for any (q, h) ∈ R, LB(q) = LB′(h).
Next we show that if (q, h) ∈ R, then for every σ such that δ(q, σ) = q′ ↓, there
must exist h′ ∈ H such that δ′(h, σ) = h′ and (q′, h′) ∈ R and vice versa. So far, we
have
(q, h) ∈ R =⇒ LB(q) = LB′(h) = c ∧ (∃w ∈ Σ∗)[δ(q0, w) = q ∧ δ′(h0, w) = h] .
Consider any σ such that δ(q, σ) ↓ and let q′ = δ(q, σ) = δ(q0, wσ). In showing that
δ′(h, σ) ↓, two cases can arise:
Case 1
σ ∈ Σ1. By definition of the turn-based product, and from the fact that σ
is taken at state q of P , we can infer LB(q) =⇒ Pre(σ), which means in W (N),
Tw((c,1), σ) = (σ(c),0) is defined. Meanwhile as δ(q0, wσ) ↓, let u1 be the projection
of w on Σ1, and note that u1σ ∈ L1(G, v0) ⊆ L(A1) implies T1(I1, u1σ) ↓. By Defini-
tion 15, we have δ′(h, σ) ↓. Let h′ = δ′(h0, wσ) = δ′(h, σ). Then (q′, h′) ∈ R by the
definition of R.
Case 2
σ ∈ Σ2. Similarly to the previous case, since δ(q0, wσ) ↓, let u2 be the projection
of w on Σ2, and note that u2σ ∈ L2(G, v0) ⊆ L(A2). As L(A2(φ2[N ])) = L2(G, v0)
75
in the limit, it follows that T h2 (Ih2 , u2σ) ↓. For Tw((c,0), σ) to be defined in W (N),
player 1 must have observed player 2 taking action σ when the world state is c. In
the true turn-based product, given LB(q) = c, unless player 2 never plays σ at q (in
which case we can safely assume wσ /∈ L(G, v0)), there must exist a time index k ≤ N
when the transition labeled σ from (c,0) is added to W (k). Now since T h2 (Ih2 , u2σ) ↓and Tw((c,0), σ) ↓, by construction we have that h′ = δ′(h, σ) = δ′(h0, wσ) ↓, and thus
(q′, h′) ∈ R by the definition of R.
Till now, we have shown that P is simulated by HP: any sequence of actions of
player 1 and 2 in P can be matched with a sequence of actions in HP. For the other
direction note that any sequence of actions in HP can also be matched by a sequence
of actions in P , because observed behaviors originate from the true turn-based product
P that captures all possible interactions. Having shown W (N)×s A1 ×s A2(φ2[N ]) '(P, q0), Proposition 4 allows us to conclude that the game Alg(φ2[N ]) is equivalent to
(G, v0).
We say Alg identifies a game equivalent to (G, v0) in the limit from positive pre-
sentations of L2(G, v0). From Theorem 4 and Proposition 4, it follows that the winning
strategy of player 1 computed using Alg(φ2[N ]) ensures the satisfaction of the task
specification accepted by As. The winning strategy computed in Alg(φ2[N ]) converges,
through functions r, r′, to the winning strategy for player 1 in the (G, v0). There is no
need to compute r and r′ explicitly because for any finite run ρ in Alg(φ2[N ]), WS′1(ρ′)
computed using Alg(φ2[N ]) is exactly the action given by WS1(r′(ρ′)) in (G, v0).
Until GIM converges, there can be no guarantee that an effective strategy can
be found. However, since the output of GIM is always consistent with the history of
observed environment behavior, the adaptive controller performs at least as good as any
other synthesized one without making any inference. With observations accumulating,
the adaptive controller is monotonically improving.
The computational complexity of learning depend on which grammatical infer-
ence algorithm is being used by GIM. It has been shown [49] that many of the learning
76
algorithms for lattice classes of languages are learnable by algorithms that are set
driven (i.e., the order of data is irrelevant), and poly-time iterative (i.e. can compute
the next hypothesis in polytime from the previous hypothesis and current data point
alone).
4.5 Case Study
In this section, we illustrate the method presented herein with a robot motion
planning example: : a robot (player 1) aims to visit rooms 1 through 4 in Fig. 4.3a,
when the doors a, b, c, d, e, f, g are controlled by an adversary (player 2).
We assume player 2 adheres to the following rules: 1. doors b, c, e, f (around
room 0) can be kept closed for at most two rounds, and the rest can be closed for at
most one round.3 2. two doors closed consecutively, if not the same, must be adjacent
to each other (doors are adjacent if connected via a wall; for example, doors b, c are
adjacent to a.). Player 1 can either stay in her current room or move to an adjacent
one, if the door connecting the two is open, but cannot stay in rooms 3, 4, 0 for more
than one round. Note that although in principle our method applies to cases where
player 1 restricts the behavior of player 2, in this particular example this is not the
case.4
As the first game starts, player 1 is informed that the language of player 2 is
in the class of strictly 3-local languages [48] (see Appendix for characterization and
available GIMs for this class of languages). Such information is inferred from the fact
that player 2 has a finite memory of size 3. Figure 4.3b gives a graphical description
of a fraction of the game automaton (G, v0), which totally has 1214 states and 3917
transitions. The winning region for player 1 contains 996 states.
3 Doors can be automatic sliding doors with different—designed—closing time spans.
4 Perhaps that could happen if the robot were to remain at a certain door thus prevent-ing it from closing, but this would not be particularly beneficial in terms of achievingits goal.
77
a
b
c
d
e
f
g
1 2
3
0
4
(a) A graphical depiction of the envi-ronment
cstartba
cc
cb
efce
ee
fe
b
c b
a
e
(b) A fragment of A2 where the initial state is c.Since player 2 cannot close b, c, e, f for more thantwo rounds, she needs to maintain a memory of size2 keeping track of recently closed two doors, e.g.ab means doors a is closed in previous turn andthe current closed door is b. Upon initialization, cmeans so far only door c has been closed.
Figure 4.3: The environment and its abstraction.
(2, cd, 0), 12
(1, c, 1), 1start
(2, c, 0), 12
(1, c, 0), 1
(2, ce, 1), 12
(1, cc, 1), 1 (2, cc, 0), 12
(2, cd, 1), 12
(0, ce, 0), 12
(1, ce, 0), 12
d
1
0
2
2
2
1
e
c
Figure 4.4: A fraction of (G, v0) where v0 = ((1, c,1), 1). A state ((q1, q2, t), qs) meansthe robot is in q1, the recent consecutively closed (at most two) doors are q2, t = 1 ifplayer 1 is to make a move, otherwise player 2 is to make a move and the visited roomsare encoded in qs, e.g. 12 means rooms 1, 2 have been visited.
Figure 4.5 shows how the learning algorithm converges after about 3895 turns
(368 games). Convergence is quantified by measuring the ratio of the size (cardinality)
of grammar GIM(φ[n]) over that of the grammar describing L2(G, v0); the latter has 121
factors of length 3 (see Appendix B for background details). Interestingly, although it
takes 368 games for the learning algorithm to converge, we observe that after 7 games
player 1 only loses once (in the 51st game). This fact suggests that the controllers
computed using the hypothesized games, even when those are not game-equivalent to
the actual game, can still be effective. We compare this outcome with the case where
player 1 has no capacity to learn, and replans in a way similar to [70] using a naive
78
model for player 2: when player 1 observes a door is closed (open), she will assume
the door will remain closed (open) until she observes it opening (closing) again. With
player 2 exploiting every opportunity to prevail, player 1 achieves a win ratio of 27%
when no learning is employed, compared to a ratio of 98% when GIM is used.
0 1000 2000 3000 4000 5000 60000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Move index
Rate
of C
onve
rgen
ce
Figure 4.5: Convergence of learning L2(G, v0): the ratio between the size of the grammarinferred by the GIM and that of L2(G, v0), in terms of number of moves made
4.6 Conclusions
This chapter shows how particular classes of reactive systems, which capture the
interactions between systems and their environments, can be identified in the limit. By
doing so, the control policy for the system can adapt to the identified model of the
environment successively in their interaction. The prerequisites for this are as follows.
1) The behavior of environment in the reactive system corresponds to a language which
belongs to a class of languages identifiable in the limit from positive presentations.
2) the system has the prior knowledge of the class of languages which the environment
behavior belongs to. Provided these conditions hold, it is guaranteed that the system
can compute a winning strategy (control policy) which converges to the true winning
strategy in the limit from positive presentations.
The learning results in this chapter are primarily made possible by factoring the
game according to its natural subsystems: the dynamics of the system, the dynamics of
its environment, and the task specification. This isolated the uncertainty in the game
79
to the model of the environment. Consequently, the framework is flexible and modular :
for a continuous dynamical system interacting with an unknown environment, as long
as both the dynamical system and the environment afford consistent abstractions—in
the sense that abstract plans are always implementable—in the form of finite-state
machines, a grammatical inference module can be incorporated and is guaranteed to
work provided the prerequisites above hold.
An implicit assumption in this work is the availability of complete and precise
sensing information during the execution of the controllers. In practice, this ideal
assumption of complete information is in general difficult to be realized. However,
the proposed framework does not prevent it from affording extensions to games with
imperfect information. Control synthesis with partial observations has been studied
extensively in des theory [21]. In parallel, the solution concept for games with partial
observations is developed in algorithmic game theory [23]. In the next chapter, we
study the control synthesis for reactive systems with partial observations and provide
an outlook for the integration of learning methods under noisy data with the synthesis
methods, in order to construct adaptive temporal-logic controllers for systems with
limited sensing modalities.
80
Chapter 5
CONTROL SYNTHESIS WITH PARTIAL OBSERVATIONS
5.1 Overview
An implicit yet unrealistic assumption in the work on control synthesis with
temporal logic specifications is the availability of complete and precise sensing infor-
mation during the execution of the controllers.
In this chapter, we develop automatic synthesis methods for systems with partial
observation based on the solution concepts for two-player, turn-based, temporal-logic
game with incomplete information [23]. A direction application of results in algorithmic
game theory is impossible due to the lack of an appropriate definition of sensor in the
game formulation. In control theory of discrete event systems, two definitions of sensor
models have been introduced. The first definition of sensing uncertainty [21] simply
partitions the set of events into observable and unobservable ones, and only captures
global sensing uncertainties. For example, if the robot positioned at (x, y) cannot
obtain information for the value of y, then any event that involves the variable y will
be unobservable. Another definition of sensor models introduces a mask function which
is a mapping from the state and/or action space into the observation space [64]. In this
chapter, we introduce an observation function based on the second definition, which
can be used to capture both local and global sensing uncertainties. With the sensor
model defined, we formulate the interaction between a system and its environment
into a variant of partial-information, turn-based, temporal-logic game [6,23] and derive
control synthesis methods with respect to the given system’s specification.
In the case of partial observation, a general approach to control design for a
reactive system is to construct another system with complete information based on
81
a knowledge-based subset construction [66, 96] and then apply the solution for games
with complete information [45]. Methods are also developed when the specification is
expressed in modal µ-calculus [7]. Recently, a nondeterministic control policy has been
introduced for reachability objectives [108]. For safety objectives in infinite state des,
Kalyon et al. [56] introduce a synthesis method that generates a k-memory controller
(a controller with finite memory of length k).
In this chapter, we examine synthesis problems under partial observations with
temporal logic specifications including safety, obligation (reachability), liveness and
persistence (Buchi) requirements. Two different measures of success are discussed.
One requires to complete the task certainly (sure winning). The other requires the
specification to be met with probability 1 (almost-sure winning). In the case of complete
information, these two measures are equivalent because the strategies used by the
system and its environment are both deterministic. In the case of partial observations,
when the system cannot always be sure of its current state and can only hypothesize
about the set of states to which the current state might belong, it has to select an
action randomly from a set of admissible actions.
We show that by allowing the system to keep a finite-memory of history and to
randomize its choice of actions, more permissive control policies can be derived with
respect to temporal logic specifications under the almost-sure winning criteria. Thus,
when a deterministic control policy cannot be computed, one can still try to compute
a randomized one that realizes the objective with probability 1.
5.2 Symbolic Synthesis with Partial Observations
In this section, we formalize the notion of sensor and present a game formula-
tion that captures the interaction between the system and its environment under an
incomplete information regime. Then, by adapting the solutions for games with partial
observations [23], automatic synthesis methods with respect to reachability and Buchi
objectives are developed.
82
5.2.1 The Model
For a set S, let |S| be the cardinality of S. For a finite set S, a probability
distribution on S is a function Pr : S → [0, 1] such that∑
s∈S Pr(s) = 1. We denote
the set of all probability distributions on S by D(S). Let a be an event, and Pr(a) be
the probability of event a happening. For example, if a := x ≥ 5, then Pr(a) is the
probability that x is greater than or equal to 5.
Similar to Chapter 4, we model the system and its environment as players:
Ai = 〈Qi,Σi, Ti,AP i, LBi〉, i = 1, 2. In the case of partial observations, we assume
that the system has a finite number of sensor configurations Θ and there is a surjective
function that maps a system state into a sensor configuration at that state: γ : Q1 → Θ.
We introduce two variables act and sensor. For each σ ∈ Σ, let act = σ be a
predicate, which evaluates true at a state q ∈ Q if and only if the most recent action
before reaching q is σ. For each θ ∈ Θ, let sensor = θ be a predicate, which evaluates
true at q ∈ Q if the current sensor configuration for the system is θ. Then, we augment
the set of predicates AP by AP := AP∪act = σ | σ ∈ Σ∪sensor = θ | θ ∈ Θ∪t.Recall that t is the turn variable and when t = 1 (resp. t = 0) the system (resp.
environment) chooses an action. After augmenting AP , a world state c ∈ C becomes
a conjunction of literals over AP such that an atomic proposition or its negation only
occurs once in c. We assume the values of variable sensor, and turn variable t, and the
set of predicates act = σ | σ ∈ Σ1 are always observable by the system.
With the augmented set of atomic propositions, we define a game graph that
captures the interaction between the system and its environment.
Definition 16. A game graph that captures the interaction of a system
A1 = 〈Q1,Σ1, T1,AP1, LB1〉 and its environment A2 = 〈Q2,Σ2, T2,AP2, LB2〉 is a tuple
P = 〈Q,Σ, δ,AP , LB〉 where the components are defined as follows.
Q = Q1 ×Q2 ×Θ× Σ× 0, 1 the set of states.
Σ = Σ1 ∪ Σ2 the alphabet
83
δ : Q× Σ→ Q transition relation.α, β
AP the set of atomic propositions.
LB : Q→ C the labeling function. γ
α δ((q1, q2, θ, σ, 1), σ′) = (q′1, q2, θ′, σ′, 1) where q′1 = T1(q1, σ
′), LB1(q1) ∧ LB2(q2) =⇒Pre(σ′), γ(q1) = θ and γ(q′1) = θ′.
β δ((q1, q2, θ, σ, 0), σ′) = (q1, q′2, θ, σ
′, 1) where q′2 = T2(q2, σ′), LB1(q1) ∧ LB2(q2) =⇒
Pre(σ′) and γ(q1) = θ.
γ The labeling function is defined such that given q = (q1, q2, θ, σ, x), LB(q) = LB1(q1)∧LB1(q2) ∧ act = σ ∧ sensor = θ ∧ t = x.
We define an observation function obs : C → O where O is the finite set of
observations. An observation is a disjunction of world states, which the system thinks
the true world state may belong to but is not sure exactly which one is the true state.
A sensor model is a tuple Sensor = 〈Θ, C,O, obs〉. The observation function obs(·)is extended to runs. Given ρ = q0q1 . . . ∈ Q∗(Qω), let obsR(ρ) = obs(c0)obs(c1) . . .
where LB(qi) = ci, for 0 ≤ i < |ρ|. Given two runs ρ, ρ′ ∈ Q∗ (Qω), player 1 cannot
distinguish them if and only if obsR(ρ) = obsR(ρ′). In this case we say ρ and ρ′ are
observation-equivalent, denoted ρ ≡ ρ′.
Based on the definition of strategy in Chapter 4, for the partial observation
cases, player 1 is limited to use the following class of strategies.
Definition 17. An observation-based deterministic (resp. randomized) strategy for
player 1 is a function S1 : Q∗ → Σ1(resp. S1 : Q∗ → D(Σ1)) that satisfies: (1) S1
is a deterministic(randomized) strategy of player 1; and (2) for any two runs ρ, ρ′, if
ρ ≡ ρ′, then S1(ρ) = S1(ρ′).
The intuition is that since player 1 cannot distinguish two runs, the actions she
takes, or the probability distributions over the set of actions for these two runs, have
to be the same.
The target problem in this chapter is the following:
84
Problem 4. Given a reactive system (turn-based product) P = 〈Q,Σ, δ,AP , LB〉, the
sensor model Sensor = 〈Θ, C,O, obs〉 and a specification ϕ in the form of a temporal
logic formula, determine whether there exists an observation-based deterministic control
policy S1 such that for any strategy of the environment S2, ϕ is satisfied surely. If
no such controller exists, then determine whether an observation-based randomized
strategy S1 exists such that for any strategy of the environment S2, ϕ is satisfied almost
surely(with probability 1).
We consider reachability and Buchi objectives. The specification ϕ over APcan be translated into an objective automata As = 〈S, C, Ts, Is, Fs〉. For reachability
objective, As is a dfa and for Buchi objective, As is a dba.
5.2.2 Deterministic Finite-memory Strategy
In this section, we develop a synthesis method as a solution to Problem 4, with
respect to reachability and Buchi temporal logic specifications. First, the system and
the interaction it has with its environment, as through partial observations, is captured
by a game graph with complete information, defined as follows.
Definition 18 (Observed game graph). For the game graph P = 〈Q,Σ, δ,AP , LB〉 with
a designated initial state q0, denoted (P, q0), an observed game graph with designated
initial state is a tuple obsP = 〈Qo,O ∪ Σ1, δo, qo0〉 where the components are defined as
follows.
Qo ⊆ 2Q the set of states. α
δo : Qo × (O ∪ Σ1)→ Qo the transition function. β
qo0 = q0 ∈ Qo1 the designated initial state.
α The set of states are partitioned into player 1 and player 2’s states Qo = Qo1 ∪ Qo
2
such that Qo1 = qo ⊆ Q | ∀q ∈ qo, LB(q) =⇒ t = 1 and Qo
2 = qo ⊆ Q | ∀q ∈qo, LB(q) =⇒ t = 0.
85
β Let δo(qo1, a) = qo2 if one of the following condition holds.
• a ∈ Σ1, qo1 ∈ Qo1, qo2 = q′ ∈ Q2 | ∃q ∈ qo1, δ(q, a) = q′. Let qo2 ∈ Qo
2.
• a ∈ O, qo1 ∈ Qo2, if there exists σ ∈ Σ1 and qo ∈ Qo
1 such that T (qo, σ) = qo1,
then qo2 = q ∈ qo1 | obs(LB(q)) = a and qo2 ⊆ qo1. Let qo2 ∈ Qo2; otherwise, any
transition leading to qo1 is labeled with some a ∈ O, let qo2 = q′ ∈ Q | (∃q ∈qo1)[∃σ ∈ Σ2, δ(q, σ) = q′ ∧ obs(LB(q′)) = a∧ (a =⇒ act = σ)]. Let qo2 ∈ Qo
1.
Intuitively, given a state qo1 ∈ Qo1, player 1 chooses an action σ ∈ Σ1 and
hypothesizes that she can possible arrive at any state within qo2 = δo(qo1, σ). Then once
the transition takes place, she obtains an observation o1 and then filters the hypothesis
that she previously made. Since it is the environment’s turn now, the environment
picks an action and as a result, another observation o2 is obtained by the system.
Player 1, based on the new observation o2 and her previous hypothesis of states from
o1, determines the set of states that she can be in and then selects an action admissible
at her current state. It is assumed that for any qo ∈ Qo1, and for any pair q, q′ ∈ qo, the
sets of available actions for player 1 is the same in both q and q′.
We illustrate the construction by a simple example in Fig. 5.1. Here, the states in
the same block have the same image under the observation function, e.g. obs(LB(0)) =
obs(LB(1)) = o1 ∈ O. When player 1 obtains observation o1, she is not certain which
state she is in: it can either be 0 or 1, and the only available action at these two states
is a. Player 1 knows that after taking a she will be in either 2 or 3 but is not sure
which one of the two she lands at before she takes action a. When player 1 takes action
a, and if she observes o2, then she is certain that the state is 2 and the previous state
was 0. Then player 2 may pick an action b or c, either way player 1 receives the same
observation o4 and thus again she is uncertain as to whether she is in 4 or 5.
A run ρ = q0q1q2 . . . qn ∈ Q∗ is related with a run ρo = qo0qo1 . . . q
om ∈ (Qo)∗,
m 6= n as follows: let qo0 = q0. If LB(q1) =⇒ act = σ, obs(LB(q1)) = o1, and
obs(LB(q2)) = o2, then qo1 = δo(qo0, σ), qo2 = δo(qo1, o1) and qo3 = δo(qo2, o2), etc.
86
0
1
2
3
4
5
6
a
a
bo1
o2
o3
o4
o5
c
b
(a) The game graph P .
0
1
2
3
4
5
6
2
3
a
o1
o2
o2
o3o3
o4
o4
o5
o5
(b) The observed game graph obsP .
Figure 5.1: A fragment of a game graph P and the observed game structure obsP .
Lemma 1. For two runs ρ1, ρ2 ∈ Q∗(Qω), if ρ1 ≡ ρ2, then the corresponding runs
ρo1, ρo2 ∈ (Qo)∗((Qo)ω) satisfies ρo1 = ρo2.
Proof. Since ρ1 ≡ ρ2, obsR(ρ1) = obsR(ρ2) and the sequence of player 1’s actions are
the same in both ρ1 and ρ2, a string w ∈ (O ∪ Σ1)∗ (or (O ∪ Σ1)ω ) is generated for
both ρ1 and ρ2. Thus, the runs in obsP generated by the same string w have to be the
same, i.e., ρo1 = ρo2.
For a given objective automaton As = 〈S, C, Ts, Is, Fs〉, with respect to the
observed game graph obsP, we define the following observed game:
obsG = obsPnAs = 〈M,O ∪ Σ1,∆,m0, Fo〉
where
M = M1 ∪M2 Mi = Qoi × S is the set of states for player i.
∆ : M × (O ∪ Σ1)→M the transition function.α
m0 = (q0, Ts(s0, LB(q0)) the initial state. β
Fo = (qo, s) ∈M | s ∈ Fs the winning condition. γ
α Let ∆(m, a) = m′ if one of the following conditions holds.
• a ∈ O, m = (qo1, s) ∈ M2, m′ = (qo2, s′) ∈ M1 ∪ M2, δo(qo1, a) = qo2 and
Ts(s, c) = s′ where c ∈ C, a =⇒ c and for any c′ such that Ts(s, c′) is defined,
either c =⇒ c′ or c ∧ c′ =⊥.
• a ∈ Σ1, m = (qo1, s) ∈M1, m′ = (qo2, s′) ∈M2, δo(qo1, a) = qo2 and s′ = s.
87
β It is assumed that player 1 knows the initial state q0 and the initial world state
LB(q0).
γ If As is a dfa, a run ρ ∈ M∗ is winning for player 1 if and only if last(ρ) ∈ Fo. In
the case where As is a dba, a run ρ ∈ Mω is winning for player 1 if and only if
Inf(ρ) ∩ Fo 6= ∅.Definition 19. With reference to obsG = 〈M,O ∪ Σ1,∆,m0, Fo〉, a finite-memory
controller (i.e. finite-memory winning strategy) for player 1 is a deterministic Moore
machine
M = 〈M,m0,O ∪ Σ1,Σ1,∆,WS1〉
where M is a finite set of memory states, m0 ∈M is the initial memory state, O∪Σ1
is the alphabet, ∆ : M×(O∪Σ1)→M is the state update function(transition function),
and WS1 : M1 → Σ1 is the next-action function if the controller is deterministic or
WS1 : M1 → D(Σ1) where D(Σ1) is a probability distribution over Σ1 if the controller
is randomized.
The semantics of this controller is as follows: after player 1 plays σ at memory
state m, she will receive two consecutive observations o1, o2 ∈ O, and the controller
updates to memory state m′ = ∆(m,σo1o2). In response player 1 selects an action or a
distribution over available actions according to WS1(m′). The deterministic controller,
at memory state m, suggests action WS1(m) = σ ∈ Σ1. The randomized controller
assigns probability WS1(m)(σ) to selecting action σ (Note that WS1(m) is a probability
distribution WS1(m) : Σ1 → [0, 1]). In what follows we compute the next-action
function WS1 to complete the controller M.
Theorem 5. There exists an observation-based deterministic finite-memory controller
M = 〈M,m0,O ∪ Σ1,Σ1,∆,WS1〉 for player 1 in the game graph P with respect to
specification ϕ, if and only if there exists a memory-less deterministic winning strategy
WS1 : M1 → Σ1 for player 1 in the observed game obsG.
88
Proof. The proof follows from existing solutions to games with partial observations
[7, 23].
The game obsG is simply a deterministic game with complete information. We
can directly apply the result in [45] to obtain a deterministic memoryless winning
strategy for player 1 WS1 : M1 → Σ1, if there exists one. We complete the synthesis
for a finite-memory controller by incorporating WS1 as the next-action function inM.
5.2.3 Randomized Finite-memory Controllers
Even if a deterministic finite-memory controller cannot be found, it is still pos-
sible that we can find a randomized finite-memory controller that ensures the specifi-
cation is met with probability 1. For this purpose, we augment (P, q0) with obsG to
obtain the following two-player game:
G = (P, q0)× obsG = 〈V,Σ, T, v0, F 〉
where the components are defined as follow.
V = V1 ∪ V2 Vi ⊆ Qi ×Mi is the states of player i.
T : V × Σ→ V the transition function. α
v0 = (q0,m0) the initial game state.
F = (q,m) | m ∈ Fo the winning condition.
α Let T (v, σ) = v′ if one of the following conditions holds:
• v = (q,m) ∈ V1, δ(q, σ) = q′, ∆(m,σobs(LB(q′))) = m′, and v′ = (q′,m′).
• v = (q,m) ∈ V2, δ(q, σ) = q′, ∆(m, obs(LB(q′))) = m′, and v′ = (q′,m′).
If As is a dfa, the game is a reachability game. If As is a dba, the game is a Buchi
game.
Given a pair of strategies S1, S2 for player 1 and 2 respectively, and an initial
state v ∈ V , the set of runs of the game generated by this pair of strategies is denoted
89
Out(v, (S1, S2)). A run is almost-sure winning for player 1 if it satisfies the task speci-
fication with probability 1. A strategy WS1 for player 1 is almost-sure winning in the
game starting with v0 ∈ V , if and only if for any strategy S2 of player 2, and for any
ρ ∈ Out(v0, (WS1, S2)), ρ is almost-sure winning for player 1. The set of states from
which there exists an almost-sure winning strategy for player 1 is referred to as the
almost-sure winning region of player 1.
Given a state v ∈ V , the projection of v onto the set of memory statesM is π2(v).
Let the set of observation equivalent game states of state v be [v] = v′ ∈ V | π2(v′) =
π2(v). Intuitively, for two runs ρ, ρ′ ∈ V ∗ such that T (v0, ρ) = v1 and T (v0, ρ′) = v2,
if π2(v1) = π2(v2), then the system arrives at the same state in the Moore machine and
thus chooses the same action (or a distribution over actions) according to the next-
action function. Hence, we introduce the definition of observation-based memoryless
deterministic (resp. randomized) strategy WS1 : V1 → Σ1 ( resp. WS1 : V1 → D(Σ1)) in
G which is such that, if [v] = [v′] for any two states v, v′ ∈ V1, then WS1(v) = WS1(v′).
5.2.3.1 Randomized Controllers for Reachability Objectives
Similar to the case of complete observation, where the sure-winning strategy for
player 1 is related to the concept of the attractor, the notion of almost-sure attractor
in the partial observation case can be used to define the almost-sure winning region of
player 1, with respect to both reachability and Buchi objectives.
Definition 20. Consider the game G = 〈V,Σ, T, v0, F 〉, the almost-sure attractor of
X ⊆ V , denoted ASAttr(X), is a set of states such that for any v ∈ ASAttr(X), there
exists an observation-based randomized strategy WS1 : V1 → D(Σ1) that ensures for
any strategy S2 of player 2, for any ρ ∈ Out(v, (WS1, S2)), ρ reaches a state in X
with probability 1. That is, for any player 2’s strategy S2, (∀v ∈ ASAttr(X))[∀ρ ∈Out(v, (WS1, S2)), Pr(last(ρ) ∈ F ) = 1].
For a given set of state X, we provide a procedure for computing ASAttr(X).
90
1. First we introduce the function Allow : V1 × 2V → 2Σ1 defined by
Allow(v, Z) = σ ∈ Σ1 | T (v, σ) ↓ ∧T (v, σ) = v′ ∈ Z]
and set Allow([v], Z) =⋂v′∈[v] Allow(v′, Z). Intuitively, as long as the current
state v′ ∈ [v], an action in Allow([v], Z) will never lead to a state outside Z.
We introduce the function Spre : 2V × 2V → 2V defined by
Spre(Z, Y ) = v ∈ V1 | (∃σ ∈ Allow([v], Z))[T (v, σ) ∈ Y ]
∪ v ∈ V2 | (∀σ ∈ Σ : T (v, σ) ↓)[T (v, σ) ∈ Y ]
Informally, Spre(Z, Y ) is a set of states such that the following conditions are
satisfied. For a player 1’s state in this set, there exists an action for player 1 that
leads to a state in Y ; for a player 2’s state in this set, any action of player 2 will
lead to a state within Y .
2. let Z0 = V and X0 = X, j := 0, i := 0, inductively,
X i+1 = X i ∪ Spre(Zj, X i), and repeat until i = i∗ ∈ N : X i∗+1 = X i∗
Then, let Zj+1 = X i∗ , and repeat until j∗ ∈ N : Zj∗+1 = Zj∗ = Z(5.1)
The fixed point Z in (5.1) is the almost-sure attractor of X: Z = ASAttr(X). The
complexity of computing the almost-sure attractor for a given game automaton
G in this way is polynomial in the size of the game G.
Theorem 6. Given Z = ASAttr(X), the memoryless randomized strategy WS1 : V1 →D(Σ1) is defined by
for each σ ∈ Allow([v], Z), WS1(v)(σ) =1∣∣Allow([v], Z)
∣∣ .where WS1(v)(σ) is the probability of selecting action σ at the state v.
That is, given v ∈ Z, player 1 picks σ ∈ Allow([v], Z) uniformly at random. By
adhering to WS1, from any state v ∈ ASAttr(X), player 1 can force a run in the game
G to reach X with probability 1.
91
Proof. Given Z, one can compute a list of sets of states W0, . . . ,Wk such that W0 = X,
and for 0 < i < k, Wi+1 = Wi ∪ Spre(Z,Wi) where k = mini | Wi = Wi+1. The
sequence of sets of states is essentially the sequence of sets X0, . . . , Xk in (5.1) where
Zj is Z. Since Z is the fixed point, Wk = Z. For each v ∈ Z\W0, there exists an unique
ordinal i such that v ∈ Wi+1 \Wi. By construction, if v ∈ V1∩Z, then there exists σ ∈Allow([v], Z) such that T (v, σ) ∈ Wi. If player 1 selects σ to enter Wi, and inductively,
forces a run ρ with length |ρ| ≤ k such that last(ρ) ∈ W0 = X. However, with a
randomized strategy, σ is chosen with probability WS1(v)(σ) = 1
|Allow([v],Z)| . Hence, after
player 1 makes her move, the probability that v′ ∈ Wi is Pr(v′ ∈ Wi) = 1
|Allow([v],Z)| ≥1n
where n = maxv∈V∣∣Allow([v], Z)
∣∣. When σ′ 6= σ is selected, T (v, σ′) ∈ Wj for some
0 ≤ j ≤ k. Note that T (v, σ′) /∈ V \ Z because any allowed move of player 1 cannot
force the run out of Z.
Let Pr(v,♦kW0) denote the probability of reaching W0 from state v in k turns.
For any v ∈ Z, and all strategies of player 2, since v ∈ Wi for some i ≤ k, when player
1 applies WS1, it is Pr(v,♦kW0) ≥ ( 1n)k > 0 and the probability of not reaching W0 in
k turns is less than or equal to 1− ( 1n)k = r. In this case, from game state v′, which is
in Wj for some 0 < j ≤ k, the probability of not reaching W0 in k turns is still ≤ r,
and inductively, the probability of a path starting from v eventually reaching W0 is
Pr(v,♦W0) = limk→∞(1− rk) = 1 as r < 1.
Since X is reached with probability 1 using WS1, and for any v ∈ X, for any
player 2 strategy S2, the set of runs ρ ∈ Out(v, (WS1, S2)), satisfies Pr(last(ρ) ∈ X) =
1.
For the case when As is an dfa and captures a reachability objective, let us denote
the target set of states X to be F . If v0 ∈ ASAttr(F ), then the computed strategy
WS1 : V1 → D(Σ1) is exactly the randomized finite-memory controller that ensures the
state set F is visited with probability 1.
92
Proposition 5. Given WS1 : V1 → D(Σ1) which is the almost-sure winning strategy
for player 1 in G, if for any two given states v1 = (q1,m1), v2 = (q2,m2) ∈ V , it holds
that m1 = m2, then WS1(v1) = WS1(v2).
Proof. For i = 1, 2, WS1(vi)(σ) = 1
|Allow([vi],Z)| for each σ ∈ Allow([vi], Z). If m1 = m2,
by the definition of observation-equivalence in states, we can infer [v1] = [v2]. Therefore,
WS1(v1) = WS1(v2).
Hence WS1 can also be defined by WS1 : M → D(Σ1). This discussion com-
pletes the description of the finite memory controller — the Moore machine M =
〈M,m0,O ∪ Σ1,Σ1,∆,WS1〉. During game-play, player 1 updates the memory state
using the transition function ∆ based on her observations, and at her turn, selects
an action according to the probability distribution given by WS1(m) where m is the
current memory state. By Theorem 6 we can ensure the task is accomplished with
probability 1. Note that WS1 is memoryless in G but the Moore machine M that
includes WS1 implements a finite-memory strategy.
5.2.3.2 Randomized Controllers for Buchi Objectives
For Buchi objectives, a run ρ is almost-sure winning for player 1 in obsP if and
only if Pr(ρ,♦F ) = 1, which means the set of runs along which F is always reachable
but which F is visited only finite many times is almost empty, i.e. has probability
measure 0. In other words, F is visited infinitely often with probability 1.
Theorem 7. In the two-player turn-based Buchi game G = 〈V,Σ, T, v0, F 〉, there exists
an observation-based randomized strategy WS1 : V1 → D(Σ1), which if player 1 adheres
to, she wins G almost surely if and only if v0 ∈ ASWin1. Here ASWin1 is the winning
region of player 1, obtained as follows: let Z0 := V , j := 0 and define inductively
Xj = ASAttr(Zj), Y j = Spre(Xj, Xj),
Zj+1 = Y j ∩ F, and repeat until j = j∗ ∈ N : Z = Zj∗+1 = Zj∗ ,
ASWin1 := ASAttr(Z).
(5.2)
93
The observation-based randomized strategy WS1, which ensures player 1 almost surely
wins the game, is defined by
for each σ ∈ Allow([v],ASWin1),WS1(v)(σ) =1∣∣Allow([v],ASWin1)
∣∣ .That is, given state v ∈ ASWin1, player 1 picks σ ∈ Allow([v],ASWin1) uniformly at
random.
Proof. Given the fixed point Z in (5.2), one can compute a list of sets of states W where
W0 = Z ⊆ F , and for 0 < i ≤ k, Wi+1 = Wi ∪ Spre(Z,Wi), and k = mini | Wi =
Wi+1. The sequence of sets of states W is essentially the sequence of sets X0, . . . , Xk
in (5.1) in the computation of ASAttr(Z). By construction, for every v ∈ V1 ∩ Z,
there exists σ ∈ Allow([v],ASWin1) such that T (v, σ) ∈ ASAttr(Z) and WS1(v)(σ) > 0.
In addition, for all v′ ∈ [v] and v′ 6= v, T (v′, σ) ∈ ASAttr(Z) by the definition of
Allow([v],ASWin1) (note that ASWin1 = ASAttr(Z)). Intuitively, once v ∈ Z ⊆ F is
visited, by adhering to the randomized strategy WS1 player 1 is ensured to stays in
ASWin1. For a state v ∈ (V1 ∩ ASWin1) \ Z, one can identify an ordinal i ∈ N such
that v ∈ Wi+1 \Wi and there exists σ ∈ Allow([v],ASWin1) such that T (v, σ) ∈ Wi.
According to the randomized strategy WS1, Pr(v ∈ Wi) = 1
|Allow([v],ASWin1)| ≥1n
where
n = maxv∈V∣∣Allow([v],ASWin1)
∣∣. Similar to the proof of Theorem 6, player 1 forces a
run into Z with probability 1 by adhering to WS1. Upon reaching Z, for all actions of
player 2 and actions of player 1 indicated by WS1, the game will reach some state in
ASAttr(Z), from which onward player 1 again forces a run into Z with probability 1
by adhering to WS1. Hence, for all v ∈ ASAttr(Z), the probability of player 1 always
eventually visiting Z ⊆ F is 1.
Similar to the case of reachability objectives, by Proposition 5 we obtain WS1 :
M → D(Σ1) and complete the finite memory controller — the Moore machineM that
ensures the Buchi objective is satisfied with probability 1.
94
5.3 Discussions and Conclusions
In this chapter, we presented automatic synthesis methods for reactive systems
in the presence of incomplete information for temporal logic specifications. To construct
adaptive controllers in the presence of incomplete information, one future direction is
to develop an algorithm that identifies or learns a model for the unknown environment
with data obtained with limited sensing capability, and then to investigate how learning
can be incorporated into control synthesis with partial observations.
The extension of adaptive control synthesis for the partial observation case is
not straightforward. The challenge is that the data presentation we obtained with
respect to the environment behavior is incomplete. So far, there are limited results on
grammatical inference with incomplete or noisy data presentations and to the best of
our knowledge, the methods that work with the definition of limited sensing modality
in reactive synthesis have yet to be developed. Existing work [12, 20, 46] concentrates
on noisy data presentations for the most part, an example of which can be, a string
which does not belong to the target language to be considered as positive data in the
presentation. Probably approximately correct semantics have been extended [73] to the
case of learning concepts (Boolean functions) with incomplete data. On the synthesis
part, [22] established an equivalence relation between games with partial observations
and games with probabilistic uncertainty. This result indicates that the interaction
between the system and its environment under partial observations can be converted
into an equivalent partially observable Markov model. This makes possible to apply
learning algorithms for hidden Markov models, or probabilistic deterministic finite-
state automata [10] to identify an equivalent game with probabilistic uncertainty, from
the original game with partial observation.
95
Chapter 6
OUTLOOK: MULTI-AGENT SYSTEMS
6.1 Overview
In previous chapters, we analyze synthesis problems for a single autonomous
system interacting with a hostile environment by formulating their interaction as a two-
player, zero-sum temporal logic game. In the case of multiple interacting autonomous
agents, where each has its own objective, the approach adopted earlier may not be
applicable or satisfactory anymore. In real world, several instances of multi-agent
interaction that is not always adversarial. Sometimes, a single agent exploits others’
capabilities towards its own objective.
In literature, multi-agent control synthesis with respect to temporal logic speci-
fications is typically done for cooperative systems. Given a global objective, all agents
cooperate to achieve it [25, 100]. In this setting, control synthesis hinges on a task
decomposition problem: how to decompose the global goal into subtasks, in a way
that completion of these subtasks implies that the goal is achieved. In its most general
form, this is a standard problem in supervisory control [104], and concurrency the-
ory [75]. To this point, there is existing work [65] for systems with computational tree
logic (ctl) task specifications. When the global task is in the form of a ltl formula,
methods have been developed [24] to break up the global ltl specification into a set
of control and communication policies for each agent to follow. Centralized controllers
can also be synthesized in the face of modeling uncertainty [101]. Meanwhile, instances
where agents have their own temporal logic specifications and treat each other as part
of some dynamic uncontrollable environment, have also been examined in [63]. When
other agents can be modeled as stochastic processes, probabilistic verification methods
can be applied for controlling for a single deterministic agent [106].
96
However, this line of work does not provide us enough insight to the interaction
of multiple non-cooperative agents. The central problem in mechanism design [78] is
to motivate autonomous interacting agents by giving them the right incentives, so that
some desired behavior emerges as a stable expression of interaction between them. The
reason to consider decentralized control is that a centralized plan/control policy admits
a single point of failure. In contrast, a distributed design affords more robustness for
the entire multi-agent system.
Here we adopt the approach of rational synthesis [37], that poses the control
synthesis problem for this class of multi-agent systems inside a non-zero-sum game
theoretic framework (cf. [63]). Particularly, we offer a game theoretic analysis for the
class of multi-agent systems in which each player is assigned an ω-regular objective
and has a preference over all possible outcomes of their interactions. The approach
builds on the availability of methods for extracting discrete abstractions in the form of
labeled transition systems from continuous or hybrid dynamical systems (see Chapter 2
and [13,92,93,109]).
Capturing the interactions of different agents with independent objectives in
the form of a non-cooperative, concurrent graph game [16], we develop a decision
process for the computation of pure Nash equilibria associated with specified outcomes.
The difference between our work and existing formulations [16], is that in the latter
agents are assigned an ordered set of tasks, whereas in our formulation, each agent
is assigned a single task and has preferences over different game outcomes. We also
analyze the case of ω-regular objectives and then identify the conditions under which
the set of pure Nash equilibria can be computed in polynomial or linear time. In
cases when inter-agent communication [87,107] or learning [39] cannot be realized, we
propose a different solution concept based on security strategies, which ensure a utility
or performance bound for a particular agent, irrespectively of other agent strategies.
We finally present the decision procedure for this type of equilibrium with coalition
games in which groups of agents are allowed to communicate and then team up.
97
6.2 Preliminaries
A deterministic Muller automaton (dma) is an automaton A = 〈Q,Σ, T, I,Acc〉where the acceptance component is expressed as Acc = F ⊆ 2Q, and the machine
accepts a word w ∈ Σω if and only if the run ρ on that word satisfies ρ(0) ∈ I and
Inf(ρ) ∈ F . Given an sa A = 〈Q,Σ, T 〉, a cycle ρ = ρ(0)ρ(1)ρ(2) . . . ρ(n) is a run in A
such that ρ(0) = ρ(n) and for all 1 ≤ i, j ≤ n − 1, ρ(i) 6= ρ(j). We write Cycles(A) to
denote the set of all cycles in A.
The task specifications considered here is given in the form of second-order logic
of the successor (s1s) logical formulas. This logic is an extension of first-order logic,
with quantified variables denoting subsets of the considered relational structures—for
a precise definition of the syntax and semantics of s1s, see [79]. ltl is a fragment of
s1s, and the Buchi theorem [19] establishes the equivalence between an s1s formula
and a dma: for any s1s formula φ over a set of atomic propositions AP , there exists
a dma with the alphabet Σ ⊆ 2AP accepting exactly the set of infinite words over
Σ satisfying φ. It is known that s1s logic is strictly more expressive than ltl. In
addition, a nondeterministic Buchi automaton is more expressive than a deterministic
one, but no more expressive than a deterministic Muller automaton.
6.3 Modeling a Multi-agent Game
6.3.1 Constructing the Game Arena
We capture the interaction between autonomous, independent agents as a con-
current game. Similar to Chapters 4 and 5, let AP be a set of atomic propositions
and C be the set of world states (conjunctions of literals over AP). For each agent we
have a model in the form of a tuple Ai = 〈Qi,Σi, δi,AP i, LBi〉. The conditional effect
of an action σ ∈ Σi is captured by its pre- and post-conditions, denoted Pre(σ) and
Post(σ), respectively. Whenever δi(q, σ) ↓, we have LBi(q) =⇒ Pre(σ); similarly,
when we observe a transition from q to q′ on action σ, compactly expressed as qσ→ q′,
it has to hold that LBi(q′) =⇒ Post(σ).
98
We capture the concurrent interaction between agents by means of the following
construction.
Definition 21 (Concurrent product). For a set of agents Ai, with i ∈ Π = 1, 2, . . . , N,their concurrent product is a tuple A1 A2 . . . AN = 〈Q,ACT , T, LB〉 where
Q ⊆ Q1 ×Q2 × . . . QN the set of states.
ACT = Σ1 × . . .× ΣN the alphabet. α
LB : Q→ C the labeling function. β
T : Q×ACT → Q the transition function. γ
α Each a = (a1, a2, . . . , aN) ∈ ACT is an action profile, encoding the actions played
by all agents simultaneously.
β Given q = (q1, . . . , qN) ∈ Q, LB(q) = ∧i∈ΠLBi(qi) is a logical sentence which is true
at state q.
γ Given q = (q1, . . . , qN) and a = (a1, . . . , aN) ∈ ACT , we have
T (q,a) = T(
(q1, . . . , qN), (a1, . . . , aN))
= (q′1, . . . , q′N)
provided that ∀i ∈ Π, (i) q′i = δi(qi, ai), and (ii) LB(q) =⇒ Pre(ai).
The arena of the game expresses all possible interactions between agents, and
is captured by the concurrent product of Definition 21. The arena A1 A2 . . . AN is
itself an sa, which we denote P = 〈Q,ACT , T 〉. The arena does not incorporate the
agents’ objectives: it just describes what they can do.
6.3.2 Specification Language
The objective of an agent is given as an s1s formula, which is translated into a
language over 2AP . Using the labeling function LB, the objective is translated into a
language Ωi over C, accepted by a total dma Ai = 〈Si, C, Ti, Ii,Fi〉, with sink ∈ Si and
Lω(Ai) = Ωi. The set of all agents’ objectives is denoted ΩiΠ, and the collection of
the deterministic Muller automata that captures these objectives is written AiΠ.
99
6.3.3 The Game Formulation
The concurrent game of a set of agents indexed in Π with objectives ΩiΠ on
the arena P , is denoted G = 〈Π, P,Mov, ΩiΠ〉. In this tuple, Mov : Q × Π → 2Σ,
where Σ = ∪i∈ΠΣi, is a set-valued map, which for state q ∈ Q and agent i ∈ Π, outputs
a set of actions available to agent i at state q. Formally, we write Mov(q, i) = a[i] ∈Σi | T (q,a) ↓, where T is the transition map in P . An initialized game (G, q(0)) is the
game G with a specified initial state q(0) ∈ Q. An initialized arena (P, q(0)) is defined
in a similar fashion: it corresponds to the sa P with a designated initial state q(0).
A play p = q(0)a(0)q(1)a(1)q(2)a(2) . . . in (G, q(0)) is an interleaving sequence of
states and action profiles, such that for all i ≥ 0, we have T(q(i),a(i)
)= q(i+1). A
run ρ = q(0)q(1) . . . is the projection of play p on the set of states. A deterministic
strategy for agent i in (G, q(0)) is a map fi : Q∗ → Σi such that ∀ ρ = q(1)q(2) . . . ∈ Q∗,fi(Pr
=k(ρ)) ∈ Mov(q(k−1), i
)for 1 ≤ k. A deterministic strategy profile f = (f1, . . . , fN)
is a tuple of strategies, with fi being the strategy of agent i. The set of all strategy
profiles is denoted SP . We consider only deterministic strategies. We say that a run
ρ is compatible with a strategy profile f = (f1, . . . , fN) if it is produced when every
agent i adheres to strategy fi. In a particular game (G, q(0)), the set of runs that are
compatible with strategy profile f , is the outcome of the game for this strategy profile,
and is denoted Out(q(0),f). Thus, outcomes are all possible game plays that can result
from the application of a specific strategy profile.
The payoff of agent i is given by a function ui : Q × SP → 0, 1 defined as
ui(q,f) = 1 if and only if for all ρ ∈ Out(q,f), LB(ρ) ∈ Ωi. The payoff vector is the
tuple made of the payoffs of all agents: u(q,f) =(u1(q,f), . . . , uN(q,f)
); then we say
that f yields the payoff vector u(q,f). In game (G, q(0)), the set of all possible payoff
vectors is denoted PV =⋃
f∈SP u(q(0),f).
Definition 22. A preference relation for agent i is a partial order .i defined over PV: for u1,u2 ∈ PV, if u1 .i u2, then agent i either prefers a strategy profile f 2 with
which u(q(0),f 2
)= u2, over a strategy profile f 1 with which u
(q(0),f 1
)= u1, or is at
100
least indifferent between f 1 and f 2.
We say agent i is indifferent between payoff vectors u1,u2 whenever u1 .i u2
and u2 .i u1. In this case we write u1 'i u2.
With the game formulation and the defined agent’s objective and their interets,
we proceed with game theoretic analysis.
6.4 Game Theoretic Analysis
For games that arise as models for interaction in multi-agent systems that consist
of self-interested agents with objectives expressed in the form of a s1s formula, this
section defines two solution concepts: one in the form of a pure Nash equilibrium, and
another in the form of a security strategy.
The former expresses a solution, a tuple of strategies of all agents, that is stable
in the sense that if each player behaves rationally by keeping its own interests in mind
when deciding the actions, any unilateral deviation from this solution will not make
any sense — the agent cannot be better off by such a deviation. Such solutions are
essentially pure Nash equilibria, which emerge in cases of interaction between intelligent
autonomous agents. The second solution concept is a conservative view of a player
who has no knowledge of others’ objectives or levels of rationality; what she tries to
do, therefore, is to find that particular behavior that minimizes her losses under the
worst, arguably irrational, possible scenario.
6.4.1 Pure Nash Equilibria
First we define the notion of pure Nash equilibrium in the multi-agent concurrent
games with individual temporal logic objectives and preference orderings.
Definition 23 (Pure Nash equilibrium). A deterministic strategy profile f is a pure
Nash equilibrium in an initialized multi-agent non-cooperative game(G, q(0)
), if for
any other strategy profile f ′ obtained by agent i ∈ Π unilaterally deviating from one
action profile f , results in u(q(0),f ′
).i u
(q(0),f
).
101
Since we consider only pure Nash equilibria, we will simply refer to them from
now on as just equilibria. Following [16], we employ an alternative procedure that
directly computes a set of pure equilibria in a multi-agent game in the class, through
answering the following question:
Problem 5. For a payoff vector u ∈ PV ⊆ 0, 1N , is there an equilibrium f in the
game(G, q(0)
)such that u
(q(0),f
)= u?
The decision procedure presented below differs from alternative solutions [16] in
that each agent ranks the set of outcomes based on explicit preference relations and has
a single ω-regular objective. The outline of the proposed process is as follows. First the
concurrent multi-player game G is used to construct the arena of a zero-sum two-player
turn-based game H with some fictitious players I and II. This is done incrementally
in two steps: first we form a factor H of H from the arena P of G, and based on H, we
then construct the arena H of the two-player game H using the synchronized product,
which incorporates all player objectives ΩiΠ. The synchronized product operation—
defined more generally in the context of transition systems [8]—is particularized here
for the case of automata:
Definition 24 (cf. [8]). Given a set of automata Ai = 〈Qi,Σi, Ti, q(0)i , Fi〉, with 1 ≤
i ≤ n, their synchronized product is an automaton expressed as A1nA2n . . .nAn =
〈∏ni=1 Qi,
⋃ni=1 Σi, T,
(q
(0)1 , . . . , q
(0)n
),∏n
i=1 Fi〉 where the transition relation T is defined
as follows: for q = (q1, q2, . . . , qn), T (q, σ) = (q′1, . . . , q′n) where q′i = Ti(qi, σ) if Ti(qi, σ)
is defined, otherwise, q′i = qi.
The next step in our procedure (Proposition 6) is to show that all cycles in
the arena H of game H are produced by a subset of agents adhering to a sequence of
action profiles that no one is deviating from. This is important because accepting runs
in a dma are always cycles.1 Then we show (Proposition 7) that all possible payoff
1 In a dma A = 〈Q,Σ, T, I,F〉, the table F is a subset of Cycles(A) [67]. For eachinfinite run ρ, there exists C such that Inf(ρ) = C.
102
vectors which result from adopting some strategy profile f in G, can be enumerated by
looking at the cycles in H. With this at hand, Theorem 8 characterizes the equilibria
in G in the form of particular winning strategies for player I in H, where the winning
conditions are defined with respect to given payoff vectors. These winning plays, which
can be computed using existing methods [76], correspond to (pure Nash) equilibria in
G. We thus have a direct way to determine the set of equilibria for the multi-agent
concurrent game in this class.
The starting point in our process is the same as in existing solutions [16]: we
define a set of suspect agents : Suppose that in G, T(q,a)
= q′. For an action profile
b, the set of suspect agents [16] triggering a transition from q to q′ is
Susp((q, q′), b) = k ∈ Π | ∃σ ∈ Mov(q, k), b[k 7→ σ] = a ∧ T (q,a) = q′ ,
where b[k 7→ σ] is the strategy profile obtained from b when agent k decides unilaterally
to follow some strategy σ instead of b[k]. If, for example b = (b1, b2, . . . , bN), then
b[k 7→ σ] = (b1, . . . , bk−1, σ, bk+1, . . . , bN).
In this context agent k is suspected to have triggered a transition from q to q′
if her unilateral deviation b[k 7→ σ] = a suffices to initiate the transition T (q,a) = q′.
Naturally, when a = b, Susp((q, q′), b) = Π.
To solve Problem 5, the multi-agent concurrent arena P = 〈Q,ACT , T 〉 is trans-
formed to the arena of a two-player turn-based game [16], with two fictitious players:
player I and player II. The two-player turn-based arena factor is a semiautomaton
H = 〈V,ACT ∪Q, Th〉, with components defined as follows:
V = VI ∪ VII the set of states. α
ACT ∪Q the alphabet. β
Th the transition relation.γ
α VI ⊆ Q× 2Π, and VII ⊆ Q× 2Π ×ACT .
103
β ACT = Σ1×· · ·×ΣN represents the available moves for player I, and Q the moves
for player II.
γ given v ∈ V , either
v = (q,X) ∈ VI and if for any a ∈ ACT it is T (q,a) ↓, then Th(v,a) := (v,a) ∈ VII ;or
v = (q,X,a) ∈ VII and if for any q′ ∈ Q it is X ′ = X ∩ Susp((q, q′),a
)6= ∅, then
Th(v, q′) := (q′, X ′) ∈ VI .
In this two-player game the fictitious players alternate: at each turn, one picks a
state in the original game as its move, and the other picks an action profile as its move.
For an initialized concurrent game(G, q(0)
), the initial condition for the associated
two-player game is v(0) =(q(0),Π
), i.e. a couple consisting of the initial game state q(0)
in G and the set of player indices Π. The initial condition of an objective automaton
Ai is s(0)i = Ti
(Ii, LB(q(0))
), and expresses the degree to which the objective of player
i is satisfied when the game is initialized. We write(Ai, s(0)
i
)to emphasize that the
automaton Ai has initial state s(0)i . The objectives ΩiΠ of the concurrent game(
G, q(0))
are incorporated into the initialized two-player arena(H, v(0)
)to yield the
two-player game using the synchronized product:
H =(H, v(0)
)n(A1, T1(I1, π1(v(0)))
)n · · ·n
(AN , TN(IN , π1(v(0)))
)= 〈V ,ACT ∪Q, T , v(0)〉 , (6.1)
with the understanding that the projection operator π1 singles out the first component
in v(0) and gives π1(v(0)) = q(0), while the remaining components are described as
V = VI ∪ VII the set of states. α
ACT , Q the sets of actions for player I and II respectively.
T the transition function. β
v(0) ∈ VI the initial state. γ
104
α VI = VI × S1 × · · · × SN , with Si the states of Ai, are the states where player I
takes a transition. VII = VII × S1 × · · · × SN are the states where player II moves.
β Given a state v = (v, s1, . . . , sN),
if v ∈ VI and σ ∈ ACT , then T (v, σ) := (v′, s1, . . . , sN) provided that v′ = Th(v, σ);
if, on the other hand, v ∈ VII and σ ∈ Q, then T (v, σ) := (v′, s′1, . . . , s′N), provided
that v′ = Th(v, σ) and for each i ∈ Π it is s′i = Ti(si, LB(σ)).
γ It is a tuple (v(0), s(0)1 , . . . , s
(0)N ) where for each i ∈ Π, s
(0)i = Ti(Ii, LB(π1(v(0)))) =
Ti(Ii, LB(q(0))). It is assumed that player I moves first.
Consider an arbitrary state of the two-player game v = (v, s), and note that v
can be either in VI or in VII . In the first case we have (v, s) =((q,X), s
)with q ∈ Q
and X ∈ 2Π, while in the second, (v, s) =((q,X,a), s
)with a ∈ ACT . Now applying
the projection twice, we get π1(π1(v)) = π1(v) = q as the state— associated to v—of
the underlying multi-agent arena P , and π2(π1(v)) = π2(v) = X as the set of agents
included in the expression of v ∈ V . To directly distinguish the image of these πi(πj)
compositions, we use the more intuitive notation Agt := π2 π1 and State := π1 π1;
thus Agt(v) is the set of agents in v, and State(v) is the state in Q encoded in v. We say
that player II follows player I on a run ρ = v(0)v(1) . . . ∈ V ω if for all i ≥ 0, Agt(v(i)) =
Agt(v(i+1)) = Π. Given ρ = v(0)v(1) . . . ∈ V ω, let Agt(ρ) = Agt(v(0))Agt(v(1)) . . . and
State(ρ) = State(v(0))State(v(1)) . . ..
Problem 5 requires us to find the exact set of possible payoff vectors PV in a
game(G, q(0)
). The following two propositions provide the answer to this question.
Proposition 6. Given a cycle C ∈ Cycles(H), and for each pair (v1, v2) ∈ C × C, we
have Agt(v1) = Agt(v2).
Proof. : For any transition from v ∈ VII where Agt(v) = X, the destination state
v′ ∈ VI contains the intersection of X with the set of suspect agents. Hence, Agt(v′) ⊆Agt(v). Given a run ρ ∈ V ω, for all i ≥ 0 it holds that Agt(ρ(i+1)) ⊆ Agt(ρ(i)); in other
words, the number of agents along a run cannot increase, due to the intersection taken
105
with the set of suspect agents. Thus in a cycle C ∈ Cycles(H), and for v1, v2 ∈ C,
there exists one run from v1 to v2, and another run from v2 to v1. From the first run
we infer Agt(v1) ⊆ Agt(v2), and from the latter Agt(v2) ⊆ Agt(v1). This suggests that
Agt(v1) = Agt(v2).
Proposition 6 implies that for any state v of a cycle C ∈ Cycles(H), Agt(v) is
invariant. This set is thus a feature, a special property of the particular cycle and for
this reason we will denote it Agt(C).
Proposition 7. For an initialized game (G, q(0)), the set of all possible payoff vectors is
PV =⋃C∈Cycles(H) u(C) ⊆ 0, 1N where u(C) = (u1(C), . . . , uN(C)) is a tuple defined
as follows: for i = 1, . . . , N , ui(C) = 1 if⋃v=(v,s)∈Cs[j] ∈ Fj, i.e., the Acc of dma
Aj, and ui(C) = 0 otherwise.
Proof. : It suffices to show that given an initial state q(0) ∈ Q, and for any strategy
profile f , there exists a cycle C ∈ Cycles(H) such that for all i ∈ Π, ui(C) = ui(q(0),f
).
Let r = q(0)q(1)q(2) . . . be a run compatible with f , and generated by input word
w = a(0)a(1) . . . ∈ ACT ω. By construction, there exists a run ρ ∈ V in H such that
State(ρ) = r and for all i ≥ 0, Agt(ρ(i)) = Π. Let τ ∈ Sωj be the run in Aj generated by
r. Since r, ρ and τ are related as shown in Fig. 6.1, we have Inf(τ) =⋃v=(v,s)∈Inf(ρ)s[j].
r,G q(0)a(0)
−→ q(1)a(1)
−→ q(2) . . .
ρ,H ((q(0),Π), s(1))a(0)
−→ ((q(0),Π,a(0)), s(1))q(1)−→ ((q(1),Π), s(2))
a(1)
−→ ((q(1),Π,a(1)), s(2))q(2)−→ ((q(2),Π), s(3)) . . .
τ,Aj Ijq(0)−→ s(1)[j]
q(1)−→ s(2)[j]q(2)−→ s(3)[j] . . .
Figure 6.1: Relating runs r in multi-agent game G, ρ in the two-player turn-based gamegraph H and τ in objective Aj.
If C ∈ Cycles(H) such that Inf(ρ) = C, then Inf(τ) =⋃
(v,s)∈Cs[j]. As defined in
the statement of the proposition, uj(C) = 1 if and only⋃
(v,s)∈Cs[j] ∈ Fj, and we
directly get that in this case Inf(τ) ∈ Fj. Now if Inf(τ) ∈ Fj then uj(q(0),f) = 1,
which means that uj(C) = uj(q(0),f). The proof is completed since f is selected
arbitrarily.
106
So payoff vectors are associated to cycles. For a specific payoff vector, we define
the winning condition that completes the definition of the two-player game through
the definition of a set-valued objective function. The objective function OBJ is a map
from the set of payoff vectors PV , to subsets of cycles in Cycles(H), such that all cycles
C in the image of OBJ(u), for some u ∈ PV , are associated to payoff vectors which
the players in Agt(C) do not prefer over u. Formally, this is expressed as
OBJ(u) , C ∈ Cycles(H) | ∀i ∈ Agt(C),u(C) .i u . (6.2)
The objective function allows us to complete the description of the two-player
Muller game. For each payoff vector u, the Muller game H(u) is the one played in
arena H for which the objective of fictitious player I is OBJ(u). (Since H is a zero-sum
game, player II wins at Cycles(H) \ OBJ(u).) Therefore,
H(u) = 〈 VI ∪ VII ,ACT ∪Q, T , v(0), (OBJ(u),Cycles(H) \ OBJ(u)) 〉 .
As it turns out, given u ∈ PV , whether an equilibrium yielding u exists, can be
determined constructively by computing the winning strategy of player I in the Muller
game H(u):
Theorem 8. Given u ∈ PV—a payoff vector in game (G, q(0)), there exists an equilib-
rium f such that u(q(0),f) = u if and only if the following two conditions are satisfied
(in the order given):
1. there exists a winning strategy for player I in H(u), and
2. there exists a run ρ ∈ V ω in H(u) with ρ(0) = v(0) for which Inf(ρ) ∩ C ∈OBJ(u) | u(C) = u 6= ∅, and for all i ≥ 0, ρ(i) ∈ WinI and Agt(ρ(i)) = Π.
Proof. : Let WSI be the winning strategy of (fictitious) player I given its Muller
objective OBJ(u). Let C be a cycle in OBJ(u) that satisfies condition 2 of the theorem,
namely u(C) = u, Agt(C) = Π, and there exists a run ρ in H that starts at v(0), never
leaves the winning region of player I, and visits C infinitely often. For an odd number
107
m ≥ 1, we obtain a finite prefix of ρ of the form Pr=m(ρ) = ρ(0)ρ(1) . . . ρ(m−1). At
the last state in this prefix, ρ(m−1) =((q,Π), s
), player I can make a move a ∈
WSI(Pr=m(ρ)
)and reach state ρ(m) = T
(ρ(m−1),a
)=((q,Π,a), s
). If every agent
in Π adheres to action profile a, it means that player II selects action qo = T (q,a),
and the next state in H(u) becomes ρ(m+1) =((qo,Π), s′
), with s′[i] = Ti
(s[i], qo
),
for all i ∈ Π. However, if any agent in Π (say j) deviates and unilaterally changes its
action, the action profile in G changes from a to a[j 7→ σ] = a′, with T (q,a′) = q′.
In H(u), this amounts to player II selecting some action q′ 6= qo, that brings H(u) to
state((q′,Π ∩ Susp(q, q′,a)), s′′
)6= ρ(m+1). Agent j is now suspected of triggering the
transition from q to q′: j ∈ Susp(q, q′,a).
Since by assumption ρ(m) ∈ WinI , any action of fictitious player II (including
q′) still keeps the game in the winning region WinI of fictitious player I. If player
I adheres to WSI , game H(u) will still end up in some C ′ ∈ OBJ(u). Since for all
i ∈ Agt(C ′), we know that u(C ′) .i u from the definition of the objective function,
the suspect agent j ∈ Susp(q, q′,a) who is in Agt(C ′), will not be happier with the
outcome of her unilateral deviation once she compares it with the outcome resulting
from sticking to the equilibrium: based on its preference relation, the payoff vector in
visiting C ′ infinitely often is not ranked higher than the one obtained when visiting C
infinitely often.
Since m and j are selected arbitrarily, by Definition 23 there must exist an
equilibrium f associated with payoff vector u. The strategy profile f can be computed
from ρ: if w is the ω-word generating the run ρ, f is the projection of w on the set of
action profiles. In this way, u(q(0),f) = u.
An equilibrium corresponds essentially to an optimal strategy profile—note that
different equilibria are not comparable—for a group of agents with independent objec-
tives and preferences over outcomes.
108
Lemma 2. If there exists u ∈ PV such that for every agent i ∈ Π, the payoff vector u
ranks highest in preference—that is, for any u′ ∈ PV \ u, for any i ∈ Π, u′ .i u—
then there exists a pure Nash equilibrium that yields u.
Proof. : Suppose such a u ∈ PV exists. Given Proposition 7, we have that for any
C ∈ Cycles(H), and for any i ∈ Agt(C), it is u(C) .i u. From (6.2), it follows that
OBJ(u) = Cycles(H). In H(u), player I has objective F1 = OBJ(u) = Cycles(H) and
player II strives for the opposite: F2 = Cycles(H)\OBJ(u) which is empty. Because of
this, player I always wins since every game can only end up visiting cycles in Cycles(H)
infinitely often. (For the computation of the equilibrium f , see the last part of the
proof of Theorem 8; in this case WinI includes all states in H(u).)
An equilibrium f in Lemma 2 does not always exist. As a result, the quest for
equilibria in scenario of multi-agent concurrent interactions cannot always be reduced
to a global optimization problem.
6.4.2 Special Cases — Buchi and Reachability Objectives
In the previous section, the equilibrium or the deterministic security strategy is
found by solving a two-player turn-based Muller game. In general, the operation is of
computational complexity O(3n) [35] where n is the number of game states. In what
follows, we analyze two interesting special cases, where the computational complexity
is significantly smaller. These are the cases when an objective is defined in the form of a
dba or a dfa; then the set of equilibria (and security strategies, for that matter) can be
found in polynomial and linear time in the size of two-player game arena, respectively.
We do not have to turn all type of games into Muller games.
6.4.2.1 Deterministic Buchi Games
Objectives expressed using dbas can also be defined using dmas. In principle,
one can convert a dba to a dma2 and apply the process described in the previous
2 A dba with Acc = F can be converted into a dma by defining Acc′ of the later as:Acc′ = F =
⋃q∈Fq.
109
sections. However, the additional structure of the deterministic Buchi machine allows
for tangible computational gains when searching for equilibria. For the Buchi objective
Ωi of agent i, there is a dba Ai = 〈Si, C, Ti, Ii, Fi〉 that accepts exactly those runs that
satisfy Ωi. With respect to multi-agent game (G, q(0)), the arena H of the two-player
turn-based game is constructed using the methods of Section 6.4.1:
H =(H, v(0)
)n(A1, T1(I1, q
(0)))n · · ·n
(AN , TN(IN , q
(0)))
= 〈V ,ACT ∪Q, T , v(0)〉 ,
where V = VI ∪ VII . For each v ∈ V , a payoff vector u(v) =(u1(v), . . . , uN(v)
)is
computed such that ui(v) = 1 if s[i] ∈ Fi, and ui(v) = 0 otherwise. The set of payoff
vectors is PV =⋃v∈V u(v). The next proposition states that we can determine whether
there exists an equilibrium associated with a given u in the multi-player concurrent
Buchi game (G, q(0)), by solving a two-player turn-based Buchi game H(u). This is
significant because the equilibria in the latter can be obtained in time polynomial in
the size (i.e., the number of game states and transitions) of the Buchi game H(u).
Proposition 8. Given the initialized game (G, q(0)) and a payoff vector u ∈ PV, a
two-player turn-based Buchi game can be constructed as
H(u) = 〈V ,ACT ∪Q, T , v(0), F (u)〉
where F (u) = v ∈ V | ∀i ∈ Agt(v),u(v) .i u. There exists a pure Nash equilibrium
f such that u(q(0),f) = u if and only if the following conditions are satisfied (in the
order given):
1. player I wins the Buchi game H(u);
2. there exists a run ρ ∈ V ∗ in H(u) that satisfies ρ(0) = v(0); ∀i ≥ 0, ρ(i) ∈ WinI ,
Agt(ρ(i)) = Π, Inf(ρ)∩ v ∈ F (u) | u(v) = u 6= ∅ and Inf(ρ)∩ (V \ F (u)) = ∅.
Proof. : It follows directly from the proof for the case of Muller objectives: if player
I wins the game, then the set of states visited infinitely often will be in F (u). That
is, a unilateral deviation made by one of the agents j ∈ Π will not give her a payoff
110
vector better than u. The second condition ensures the existence of an equilibrium
f associated with payoff vector u: if w is an ω-word that can generate this run ρ in
H(u) which satisfies 1) ∀k ≥ 0, ρ(k) ∈ WinI , Agt(ρ(k)) = Π (this means that all rational
agents adhere to this policy); 2) Inf(ρ) ∩ (V \ F (u)) = ∅ and ∃v ∈ Inf(ρ), u(v) = u,
then the equilibrium f is just the projection of w on the set of action profiles.
The computational complexity of solving two-player turn-based Buchi games is
O(n(m+n)), where n is the number of game states and m is the number of transitions
in the game H(u) [45].
6.4.2.2 Reachability Games
Reachability objectives correspond to first-order logic formulas. An automaton
accepting a reachability objective is an fsa. Again, we can view an fsa as a special
case of a dba3. However, as in the case of the previous section, there is a faster way.
In this case, the game equilibria can be computed in linear time, as solutions of a
two-player turn-based reachability game H.
The reachability objective Ωi for agent i ∈ Π is just a formula (can be thought
of as a regular expression) evaluated true only for those strings that are accepted by
the dfa Ai = 〈Si, C, Ti, Ii, Fi〉. In this case, with respect to (G, q(0)), the two-player
turn-based arena H is obtained once again by computing the synchronized product
H =(H, v(0)
)n(A1, T1(I1, q
(0)))n · · ·n
(AN , TN(IN , q
(0)))
= 〈VI ∪ VII ,ACT ∪Q, T , v(0)〉 .
For each v = (v, s) of H, a payoff vector is computed: u(v) =(u1(v), . . . , uN(v)
)where
ui(v) = 1 if s[i] ∈ Fi, and 0 otherwise. The set of payoff vectors is PV =⋃v∈V u(v).
The proposition that follows is in the spirit of Proposition 8 of the previous section,
in the sense that the decision problem of finding an equilibrium in the multi-agent
3 An fsa A can be converted into an dba A′ by adding a self-loop labeled “λ” (theempty string) for each q ∈ F = Acc of A, and letting Acc′ of A′ be F [19].
111
reachability game G is mapped to a corresponding decision problem in a two-player
zero-sum reachability game H.
Proposition 9. Given a game (G, q(0)) and a payoff vector u ∈ PV, a two-player
turn-based game arena can be constructed as H = 〈V ,ACT ∪ Q, T , v(0)〉. If u ∈ PV,
then there exists a pure Nash equilibrium f such that u(q(0),f) = u if the following
conditions are satisfied (in the order given):
1. player I wins the reachability game H(u) = 〈V ,ACT ∪ Q, T , v(0), SafeI〉, where
SafeI is a set of states computed iteratively as follows:
(a) Let Safe(0)I = v ∈ V | ∀i ∈ Agt(v),u(v) .i u;
(b) for i ≥ 1, Safe(i+1)I = Safe
(i)I ∩ NextI(Safe
(i)I ) where NextI(W ) :=
v ∈ VI ∩W | (∃a ∈ ACT )[T (v,a) ∈ W
]∪
v ∈ VII ∩W |(∀q ∈ Q : T (v, q) ↓
) [T (v, q) ∈ W
].
SafeI = Safe(m)I = Safe
(m+1)I is the fixed point.
2. there exists a run ρ = ρ(0)ρ(1) . . . ρ(m),m ∈ N such that ∀i ∈ 0, . . . ,m ρ(i) ∈WinI , where WinI is the winning region of player I, Agt(ρ(i)) = Π, ρ(m) ∈ SafeI
and u(ρ(m)) = u.
Proof. : The set of states SafeI has the following property: for any v ∈ SafeI , if v is
a state where player I moves, then by definition either there exists a move of player I
that keeps the game within SafeI , or, if that state is one where player II moves, any
move player II can make will not take the game outside SafeI . Since SafeI ⊆ Safe(0)I
by construction, we have that for any v ∈ SafeI , it is true that for every i ∈ Agt(v),
u(v) .i u. If a two-player turn-based reachability game is played with SafeI as the
objective of player I, any outcome ρ that is resulted from player I adhering to her
winning strategy WSI , will satisfy Occ(ρ)∩SafeI 6= ∅. Once the game state is in SafeI ,
player I can apply strategy WSI , which at any state v ∈ SafeI presents an action profile
a ∈ ACT that keeps the game within SafeI . Any agent i ∈ Π who unilaterally deviates
from this strategy profile does so at her own cost: she will find that the outcome of
112
the game is not going to be more preferable to her than the outcome associated with
the payoff vector u. In this case we need to verify the existence of an equilibrium f
associated with u: if this equilibrium exists, then we can find a word w that generates
a run ρ in H(u) satisfying: 1) ∀0 ≤ k ≤ |ρ|, ρ(k) ∈ WinI , Agt(ρ(k)) = Π (all agents
must adhere to the strategy) and 2) ∃` ≥ 0, ρ(`) ∈ SafeI with u(ρ(`)) = u. Since this is
the case, the equilibrium will be the projection of w onto the set of action profiles.
The complexity of solving two-player turn-based reachability or safety games is
O(m + n), where n is the number of game states and m is the number of transitions
in the game H(u) [45].
6.4.3 Security Strategies
If game (G, q(0)) is played only once, and agents cannot communicate to decide
on which equilibrium to adopt, the implementation of an equilibrium strategy profile
becomes problematic. This is because each agent has to perform their own calculations,
and in general, there can be several equilibria yielding the same payoff vector. These
equilibria are not comparable nor interchangeable. Even though there exists a set of
pure equilibria, each of which is an optimal solution, if the agents do not agree on
which one to adhere to jointly, the strategy profile that they end up following may be
one that leads to an inferior outcome and may not necessarily be an equilibrium, a
phenomenon known as “thrashing.”
In the face of uncertainty about the behavior of other agents, one reasonable
strategy for agent i is to secure a payoff vector above some specified level, against any
(rational or irrational) behavior of the others. Such a solution concept is similar to the
notion of security strategy in matrix games [87].
Definition 25 (Pure security strategy). A pure security strategy for agent i, denoted
f si : Q∗ → Σi, with respect to a designated security level u ∈ PV, satisfies
∀f ∈ SP such that f [i] = f si , it is u .i u(q(0),f) .
113
That is, by adhering to f si , agent i can ensure a payoff vector ranked at least
as high as u. Intuitively, the pure security strategy is the best choice for agent i in
the absence of information about the objectives, rationality, and preferences of other
agents.
Our approach to analyzing this case is similar in spirit to the treatment of
the previous section. For player i, we take the arena P of the initialized multi-agent
concurrent game (G, q(0)), and construct an arena for an agent-specific two-player turn-
based game (the superscript i is added to stress that this is the particular agent’s
defensive view of the world)
H i = 〈V i,Σi ∪Q, T ih〉
where
V i = V iI ∪ V i
II the set of states. α
Σi the set of actions of player I. β
Q the alphabet of player II.
T ih the transition function. γ
α V iI = (q, i) | q ∈ Q is the set of states where player I makes a move, and
V iII = (q,Π[−i], σ) | q ∈ Q, σ ∈ Σi is the states where player II makes a move.
(Π[−i] ≡ Π \ i.)β It is the alphabet of player i.
γ Given v ∈ V i,
if v = (q, i) ∈ V iI and for any σ ∈ Σi there exists an a ∈ ACT such that T (q,a) is
defined and a[i] = σ, then T ih((q, i), σ
):=(q,Π[−i], σ
);
if v =(q,Π[−i], σ
)∈ V i
II and for any q′ ∈ Q, there exists a ∈ ACT such that
T (q,a) = q′ and a[i] = σ, then T ih((q,Π[−i], σ), q′
):= (q′, i).
In this arena, player I is agent i who at each turn can select an action from its
action set. Player II, who represents the whole collective, implements an action profile
which includes the choice of agent i, and whatever all other agents decide to do. This
114
arena is initialized with (q(0), i), and combined with the other agents’ objectives using
the synchronized product
H i =(H i, (q(0), i)
)n(A1, T1(I1, q
(0)))n · · ·n
(AN , TN(IN , q
(0)))
= 〈V i,Σi ∪Q, T i, v(0)i〉
where
V i = V iI ∪ V i
II is the set of states. α
Σi the set of actions for player I.
Q the set of actions for player II.
T i the transition function. β
v(0)i =((q(0), i), s
(0)1 , . . . , s
(0)N
)the initial state. γ
α V iI = V i
I × S1 × · · · × SN , are the states where player I makes a move (Si are the
states of Ai), and V iII = V i
II × S1 × · · · × SN are the states where player II moves.
β for v = (v, s1, . . . , sN)
if v = (q, i) ∈ V iI and σ ∈ Σi, then T i(v, σ) := (v′, s1, . . . , sN) provided that
v′ = T ih(v, σ);
if, on the other hand, v =(q,Π[−i], σ
)∈ V i
II and q′ ∈ Q, then T i(v, q′) :=
(v′, s′1, . . . , s′N), provided that v′ = T ih(v, q
′) and for each i ∈ Π it is s′i =
Ti(si, LB(q′)).
γ Here, s(0)i = Ti(Ii, LB(q(0))) for i ∈ Π.
For each cycle C ∈ Cycles(H i), the payoff vector is u(C) =(u1(C), . . . , uN(C)
)where for each j ∈ Π, uj(C) = 1 if
⋃(v,s)∈C s[j] ∈ Fj. Note that the two-player game
on arena Hi expresses the particular agent’s defensive view of the game dynamics,
and thus this game’s objective function is slightly different, skewed toward the agent’s
conservative game-play:
OBJi(u) := C ∈ Cycles(H i) | u(C) &i u . (6.3)
In view of (6.3), the security level for agent i in game (G, q(0)) is the specific
u ∈ PV . Playing defensively, agent i wants to end the game on one of the cycles in
115
OBJi(u); this way, the game’s outcome, measured in terms of payoff vectors, is at least
as good as u.
With the definition of the objective function in (6.3), the description of the two-
player turn-based game, related to the security strategy u for agent i, can completed
as follows:
Hi(u) := 〈V iI ∪ V i
II ,Σi ∪Q, T i, v(0)i,(OBJi(u),Cycles(H i) \ OBJi(u)
)〉 .
Given v = (v, s) where v = (q, i) or v = (q,Π[−i], σ), let State(v) = q ∈ Q be
the state—associated to v—of the underlying multi-agent arena P . The following
statement establishes the conditions under which a security strategy that enables agent
i to achieve such a lower bound on the outcomes, exists:
Theorem 9. In the concurrent multi-agent game (G, q(0)), there exists a security strat-
egy f si with respect to the security level u for player i, if agent I wins in game Hi(u).
Proof. : Suppose that the winning strategy for player I in Hi(u), is WSiI : (V i)∗ V iI →
2Σi . At each turn when she moves, player I selects an action σ according to WSiI ,
and subsequently, player II selects a state q ∈ Q which can be reached through an
action profile a ∈ ACT , with entry i of a being σ. Since WSiI is a winning strategy
for player I, by definition, no matter how player II plays, player I can win the game
by visiting infinitely often a cycle in the winning condition OBJi(u). On all cycles in
OBJi(u), agent i in (G, q(0)) achieves a payoff vector which ranks at least as high as
u according to this agent’s preference. The security strategy f si : Q∗ → 2Σi for agent
i in (G, q(0)) is then derived from WSiI as follows: given r = v(0)v(1) . . . v(n) ∈ (V i)∗V iI ,
note that for i = 0, 2, 4, . . . , n, State(v(i)) = State(v(i+1)). Projecting the run r onto
the state set Q of G, and then removing all repeated entries, we obtain a run in G,
ρr = State(v(0)) State(v(2)) . . . State(v(n−2)) State(v(n)). Then f si (ρr) = WSi1(r).
The next statement guarantees the existence of at least one security level for
each agent in the concurrent multi-agent game. The level itself, however, may not
necessarily be quite desirable for the associated agent.
116
Lemma 3. In every game (G, q(0)), the following statements hold:
• For each agent, there exists a security strategy for at least one security level;
• For each agent, either there is a unique highest security level, or there exists a set
of highest security levels, between the elements of which the agent is indifferent.
Proof. : For the first part we can reason as follows. For each agent i, consider the
lowest ranked payoff vector u ∈ PV such that for any C ∈ Cycles(H i), we have
u .i u(C). Then OBJi(u) = Cycles(H i) and the winning condition in Hi(u) becomes
(Cycles(H i),∅). (Player I wins on every cycle, while player II wins nowhere.) In this
game, player I definitely has a winning strategy, because every run in Hi(u) ends
visiting one of the cycles in Cycles(H i) infinitely often. The second part follows from
the definition of preference relation.
6.4.4 Cooperative Equilibria
In the cases considered so far, each agent acts independently, and cooperation
was implicit through the ordering of possible outcomes in the preference relations: agent
i cooperates with agent j if the success of both i and j makes i happier than when
she succeeds alone. In this section, cooperation is considered explicitly. We identify
cooperation as a concurrent deviation from an equilibrium policy for the purpose of
collectively achieving some better outcome.
Our notion of such a cooperative equilibrium is related with stability solutions
in coalition games: the group of agents who form a coalition is not determined a priori
but emerges through the computation of the equilibrium.4
In a concurrent game G, a team is a subset X of the set Π of agents. An
unilateral team deviation by team X ∈ 2Π from an action profile a is denoted a[X 7→σ] = (a′1, a
′2, . . . , a
′N), where σ = (bj)j∈X is the tuple of the actions of all agents in
team X, ordered by their index; we have a′i ≡ ai if i /∈ X and a′i ≡ bi if i ∈ X. The set
4 Recall the prisoner’s dilemma problem and note that the optimal payoff cannot beachieved unless both prisoners deviate together from their lower-payoff equilibriumpolicy.
117
of teams in G is denoted Teams ⊆ 2Π. Note that nothing prevents those agents from
switching teams or breaking up—provided that at any given moment, the teams in the
game belong to Teams. This opens up a realm of possibilities, allowing teams to be
formed and dissolved in an opportunistic way.
In this context, the concept of suspect agents generalizes to that of a suspect
team: suppose T (q,a) = q′ is defined in G; then for an action profile b ∈ ACT , the set
of suspect teams triggering a transition from q to q′ is
SuspTeams((q, q′), b) :=
X ∈ Teams | (∀i ∈ X)[(∃σi ∈ Mov(q, i))[b[X 7→ (σi)i∈X ] = a ∧ T (q,a) = q′] .
Definition 26 (Pure cooperative equilibrium). A strategy profile f is a cooperative
equilibrium in an initialized multi-agent non-cooperative game (G, q(0)) if for any team
X ∈ Teams, and for any strategy profile f ′ obtained from f by an unilateral team
deviation of X, it holds that for all k ∈ X, u(q(0),f ′) .k u(q(0),f).
Just as we did in all previous cases where agents played on their own, we can
still use the multi-player concurrent arena P = 〈Q,ACT , T 〉 to construct a two-player
turn-based arena H in the familiar form
H = 〈V,ACT ∪Q, Th〉
where
V = VI ∪ VII the set of states. α
ACT ∪Q the alphabet. β
Th the transition function. γ
α VI ⊆ Q× 2Teams and VII ⊆ Q× 2Teams ×ACT .
β ACT = Σ1×· · ·×ΣN represents the available moves for player I, and Q the moves
for player II.
118
γ Given v ∈ V , either
v = (q,S) ∈ VI where S ⊆ Teams and for any a ∈ ACT it is T (q,a) ↓, in which
case Th((q,S),a) := (q,S,a) ∈ VII ; or
v = (q,S,a) ∈ VII and for any q′ ∈ Q it is SuspTeams((q, q′),a
)∩ S 6= ∅, in
which case Th(v, q′) := (q′,S ′) ∈ VI where S ′ = X ∈ Teams | X ⊆ Y ∧ Y ∈
SuspTeams((q, q′),a
)∩S. Intuitively, S ′ not only includes the suspect teams
intersecting with S but also includes the set of teams, each of which is a subset
of one of the suspect teams.
The analysis of equilibria in this game where opportunistic teams of agents can
play against each other if the interests of the teammates align, can be performed in
exactly the same way as in Sections 6.4.1–6.4.2, in fact, the cases considered in these
sections are merely special cases of the one considered here, where each team is a
singleton: Teams = i | i ∈ Π. Since players in Π have their own objectives and
preference relations, teams can form and dissolve in an ad-hoc fashion, depending on
the opportunities of the moment, the interests of the agents, and the agents’ preference
relations over outcomes of the game.
6.5 Case Study
We consider the scenario in which three agents Π = 1, 2, 3 need to visit
different rooms in Fig. 6.2. The rooms in the environment is indexed with A, B,
C, D. Agents can pass through doors a, b, c and d, but only one agent at a time
can go through a given door. Each agent dynamics is modeled in the form of an sa
Ai, depicted graphically in Fig. 6.6a. The concurrent product of the agent dynamics,
A1 A2 A3 yields the concurrent multi-agent game arena P in Fig. 6.3, in which the
constraint that only one agent can pass through a door at any given time is encoded.
In Fig. 6.3, a state (i, j, k) is represented as ijk—agent 1 is in room i, agent 2 in
room j and agent 3 in room k. Transitions are labeled with the action profiles σ1 σ2 σ3
119
that trigger them, and each σi denotes the door through which agent i goes. For all
(ai)Π ∈ ACT , and i 6= j ∈ Π, if ai, aj 6= ε, then ai 6= aj captures the constraint
that two agents cannot pass through the same door simultaneously. AP = α(i,m) :
the robot i is in room m, i ∈ 1, 2, 3;m ∈ A,B,C,D. The set of propositions
evaluated true at q ∈ Q indicates the current locations of agents.
Based on the concurrent product P , we compute the two-player, turn-based
game H in Fig. 6.4. Each state can be a tuple of either two (e.g., (ABC, 1, 2, 3)) or
three components (e.g., (ABC, 1, 2, 3, acb)). In the first case, the state is in VI while
in the second it is in VII . The semantics of a state in VI (e.g., (ABC, 1, 2, 3, acb)) is
that agents are in the rooms marked by the first component (i.e., 1 in A, 2 in B and 3
in C), the agents suspect for triggering the transition there are the ones in the second
component (i.e., all of them), and agents are supposed to execute the actions specified
in the third component (i.e., 1 go through a, 2 go through c and 3 go through b). The
semantics of a state in VII (say, (BDD, 3)), is that agents are now where the first
component says (i.e., 1 in B, 2 in D and 3 also in D) and that for this state to have
been reached, the agents in the second component (i.e., 3) are suspect of triggering
the transition: the action plan that was actually implemented to reach that particular
state in VII is acd. By comparing acd with acb, it is clear that 3 deviates.
a
b c
d
A B
DC
Figure 6.2: A partitioned rectangular environment in which three agents roam.
6.5.1 Reachability Objectives
Let us consider three different combinations of preference relations and rules for
team formation, and see what type of interaction behaviors can emerge as a result. We
will assume that the objective of agent i, with i ∈ Π, is a reachability objective Ωi,
120
ACC
BDD
ABCstart BDA
BDC
ACB
BBA
acd
acε
acd
εcbacε
adc
acb adb
adbacb
adc
εcb
Figure 6.3: A fragment of the multi-agent arena P = 〈Q,ACT , T 〉.
(ABC, 1, 2, 3)
(ABC, 1, 2, 3, acd)(ABC, 1, 2, 3, acb)
(BDA, 3)(BDA, 1, 2, 3)
(BDA, 3, acb)(BDD, 3, acd)
(BDD, 3)
(BDA, 1, 2, 3, acb) (BBD, 2, aεd)
(BBD, 2)acb aεd
BDA BDABDD
acb acd
BBD
acb acd
Figure 6.4: A fragment of the two-player turn-based game arena H.
BBA
acε
BDA
caεaεb
BDC
acbεca
ABC
cdb
BDA
((Ih, 1, 2, 3), 0, 0, 0)
((ABC, 1, 2, 3, aεb), A, 0, C)
((ABC, 1, 2, 3), A, 0, C)
((BDA, 2, acε), AB,D,AC)((BDA, 2), AB,D,AC)
((BBA, 1, 2, 3, caε), AB, 0, AC)((BBA, 1, 2, 3), AB, 0, AC)
((BDA, 1, 2, 3), AB,D,AC)((ABC, 1, 2, 3, acb), A, 0, C)
((BDC, 3), AB,D,C)
((BDA, 1, 2, 3, εca), AB,D,AC)
((BDC, 3, cdb), AB,D,C)
Figure 6.5: Fragment of the partial synchronization product H
A
CB
D
ε
d
εa
ε
b
ε
a
cd
b
c
(a) The sa modelingagent dynamics. Tran-sition labeled ε meansthe agent stays in thesame room.
A
0start
B
AB
A,C,D
A,B,C,DA
B,C,DB
BC,D
A
(b) A1: visit roomsA,B in any order.
A
ABCD
AC
AB
ABD
......ACD0start
ABC
D
A
B
A,B
D
A,B,C
A
A,B,C
A,C,D
A,B,C,D
BA,C
B
CD
C
C
(c) A fragment of A3: visit allrooms in any order. A transitionlabel A,B stands for the worldstate c such that c = (A∧¬(B ∨C ∨D))∨ (B ∧¬(A∨B ∨C)).
Figure 6.6: The fsas representing the agent objectives.
equivalent to an fsa denoted Ai. The objective of agent 1, Ω1, is to visit rooms A and
B; the associated fsa A1 is shown in Fig. 6.6b. The objective of agent 2, Ω2, is to
121
visit rooms C and D, and its associated fsa is obtained from the fsa of Fig. 6.6b by
relabeling the rooms as follows: A 7→ C, B 7→ D, C 7→ A and D 7→ B. The objective
of agent 3, Ω3, is to visit all rooms, and a fragment of the associated fsa appears in
Fig. 6.6c.
Case 1: Everyone for themselves.
Agents are selfishly focusing on achieving their own objective, and there are no
teams. Formally is expressed in the form of preference relations as follows: u i u′
with i ∈ Π, if u[i] = 0 and u′[i] = 1; similarly, u 'i u′ if u[i] = u′[i]. With the
arena of the two-player reachability game being the sa H shown in Fig. 6.4, and the
objective fsas Ai of Fig. 6.6, the synchronization product gives us H, a fragment of
which is shown in Fig. 6.5.
The set of payoff vectors in this two-player reachability game arena H is PV =
0, 13, with elements of the form u(v) in which the argument is a game state v =
(v, s) ∈ VI ∪ VII . We can see the structure of state v in the graph of Fig. 6.5. The
first component in each of these states, v, is essentially one of the states of H, which
show as labels in the nodes of the graph in Fig. 6.4. The second component of v, s,
(see Fig. 6.5) is a tuple of three states of the agents’ objective automata. Clearly, s
keeps track of what each agent has achieved so far in terms of their objectives: the first
element relates to agent 1, and if s[1] = AB then agent 1 achieves its goal, the second
element relates to agent 2 and when s[2] = CD, agent 2 accomplish its task, the third
element relates to agent 3 and reads s[3] = ABCD if agent 3 completes its task.
Let us single out a payoff vector u = (0, 0, 1) in PV . We can verify that
the designated initial state of the game((ABC, 1, 2, 3), A, 0, C
)is in the winning
region WinI of player I, by computing the equilibrium associated with (0, 0, 1) in the
reachability game H(u). This equilibrium corresponds to a sequence of action profiles,
a strategy profile f , from the state ABC of the multi-player arena P which suggests a
sequence of (concurrent) moves for each agent:
f = (εaε)(εab)(baε)(dba)(dbc)(bac)(εac) .
122
For example, according to this strategy, in the opening of the game agent 1 and agent
3 are to remain still, while agent 2 is to go through door a; then while agent 1 is still
at rest, agent 2 crosses door a again (in the opposite direction) and agent 3 springs
into action going through door b.
Following the strategy profile f , agents 1, 2, and 3 eventually find themselves
in rooms A, A, and D, while having already visited rooms A,C,D, A,B,C,A,B,C,D, respectively. Agent 3 has achieved its goal, but agents 1 and 2 have
not: agent 1 really wanted to visit A and B, while agent 2 needed to go to C and D.
Agents 1 or 2 cannot achieve a better payoff by unilaterally deviating from f . However,
if they deviate together, e.g. defect and instead of implementing baε in the third action
profile do aεε instead, then at least one of the two (in this case, agent 1) accomplishes
its goal.
An exhaustive analysis of all possible payoff vectors in PV , reveals that there
exists an equilibrium for each of them (first row of Table 6.1).
Table 6.1: Nash equilibria for all payoff vectors in concurrent game G with reachabilityobjectives
PV000 001 010 100 110 011 101 111
case 1 X X X X X X X Xcase 2 7 7 7 X X 7 X Xcase 3 X 7 X X X 7 7 X
Case 2: Selfish individuals in teams.
Agents in teams can deviate concurrently from an equilibrium policy. In this
case, consider the set of possible teams Teams =1, 2, 1, 2, 3
in game G;
agents 1 and 2 can work together if doing so serves their common interests. Note that
while they can form a team and cooperate in an ad-hoc way, they can still perform
unilateral deviations as individuals.
123
Solving the reachability game from the same initial condition for all payoff vec-
tors in PV , produces the second row of equilibria in Table 6.1. Now only half of the
possible payoff vectors are associated with equilibria. Those that are, appear biased
toward solutions that yield higher payoffs for agents 1 and 2.
Case 3: Teaming against others.
In this case, we do not explicitly define possible team groupings. Instead, we
prescribe preference relations over the set of possible payoff vectors, and let the agents
choose how to team up. What is of interest is the kind of gameplay strategies emerge
as stable equilibria.
The preference relation of agent 1 explicitly defines the following order among
eight possible outcomes:
(0, 0, 1) 1 (0, 1, 1) 1 (1, 0, 1) 1 (1, 1, 1)
1 (0, 0, 0) 1 (0, 1, 0) 1 (1, 0, 0) 1 (1, 1, 0) , (6.4)
which reads “ideally I want myself and agent 2 to achieve our goal but not agent 3,
and if I cannot have that I would rather win alone; if this is not possible I can let agent
2 win, but under no circumstances do I let 3 get his way—if 3 really has to win, then
my preferences are the same as in the case where she loses.” Agent 2 similarly prefers
the outcomes in the order
(0, 0, 1) 2 (1, 0, 1) 2 (0, 1, 1) 2 (1, 1, 1)
2 (0, 0, 0) 2 (1, 0, 0) 2 (0, 1, 0) 2 (1, 1, 0) . (6.5)
Agent 3 plays selfishly, with its mind setting at achieving her own objective. Note that
here agents 1 and 2 are radicalized: they prefer failure to letting agent 3 achieve her
objective.
In this scenario (see Table 6.1), the payoff vector 010, which does not correspond
to an equilibrium when agent 1 and 2 play selfishly, now it does, because agent 1 lets
agent 2 succeed as long as agent 3 loses. An implication of this observation is that by
124
simply redefining the preference relations, an opportunistic alliance between agents 1
and 2 can emerge.
6.5.2 Buchi Objectives
In this section we incorporate Buchi objectives into the the multi-agent game:
the objectives of agent 1 and 2 is to visit room A and B, and C and D, respectively,
infinitely often. Agent 3 needs to visit rooms A, B, and D infinitely often. The
temporal logic formulae5 are
Ω1 :(♦(A ∧ ♦B))
Ω2 :(♦(C ∧ ♦D))
Ω3 :(♦(A ∧ ♦(B ∧ ♦D))) .
Any of the three objectives can be accepted by a suitable dba. It turns out that when
ε ∈ Σi for all i ∈ Π, meaning that agent i can remain stationary at any time step, in all
three cases we can find a pure Nash equilibrium for each pay-off vector. However, if we
restrict the behavior of agents by defining ε /∈ Σi for all i ∈ Π, then we observe some
interesting behaviors as indicated in Table 6.2. In case 3, when agents 1 and 2 team up
with the preferences indicated in (6.4) and (6.5), there does not exist an equilibrium
that ensures the success of agent 3.
Table 6.2: Nash equilibria for all payoff vectors in concurrent game G with Buchiobjectives (in the case when ε /∈ Σi, for all i ∈ Π)
PV000 001 010 100 110 011 101 111
case 1 X X X X X X X Xcase 2 X X X X X X X Xcase 3 X 7 X X X 7 7 7
5 For semantics of temporal logic formulae, see [33].
125
6.5.3 Strategy Alternatives for Agent 3
In case 3 of section 6.5.1, if agent 3 knows that the other two agents can team
up against her and thus prevent her from completing her task, she may consider the
following two options: she can either announce to the other players that she is willing to
play fair, and call everyone6 to follow the strategy associated with payoff (1, 1, 1) which
allows everyone to win, or, if this is not an option—e.g., she cannot communicate—she
can simply plan for the worst. Planning for the worst case amounts to searching for a
security strategy.
Using Lemma 3, we can compute the best security level for every agent by solving
a two-player turn-based game Hi(u) for i = 1, 2, 3 each time, with respect to different
payoff vectors u. Figure 6.7 shows a fragment of the two-player turn-based arena factor
H1. It turns out that for both agents 1 and 2, the highest security level is (1, 0, 0); for
agent 3, there is a set of highest security levels (0, 0, 0), (1, 0, 0), (0, 1, 0), (1, 1, 0) which
are indifferent to agent 3 based on her preference—it is a lost case. No matter how she
plays, she cannot meet her own objective if both other two agents prefer she loses.
The existence of an equilibrium associated with payoff vector (1, 1, 1), does not
necessarily ensure that agent 3 has a security strategy that can force these payoffs.
This (Nash) equilibrium is stable under the implicit assumption that players behave
strictly rationally, that is, they will always prefer to improve the outcome with respect
to their own preferences if they have the choice.
bCAC
DCC
ACBbab
AADd
CDD
(ABC,Π[−1], a)(ACB, 1)(ABC, 1)
(CAC, 1)(ABC,Π[−1], b)
(CAC,Π[−1], d)
(CDD, 1) (CDD,Π[−1], b)
(CAC,Π[−1], b) (AAD, 1)
(DCC, 1)
Figure 6.7: Fragment of the two-player turn-based arena H1.
6 The utterance of a strategy profile in this case is different from the one found whencommunication in game theory is considered [87], because here it is supposed to bebounding for all players.
126
6.6 Conclusions
This chapter suggests a game-theoretic approach to decentralized planning for
the class of multi-agent systems with independent ω-regular objectives. The analysis
of Nash equilibria in such collections of autonomous, rational systems that do not have
the same objectives, has not received enough attention in temporal logic control and
discrete event systems literature. A method is developed for constructing a multi-agent
concurrent game which captures the interaction of multiple non-cooperative systems.
Each subsystem is assigned an individual task defined using an ω-regular logical formula
and has his own preference over all possible outcomes of their interactions. A pure
Nash equilibrium in the resulting game is a collection of control strategies for the set
of agents. We also analyze security strategies as an alternative solution concept for
this special class of multi-agent concurrent games, and then introduce the notion of
cooperative equilibrium in coalition games, that allows the overall system to produce
behaviors with implicit or explicit cooperation between subsystems.
127
Chapter 7
CONCLUSIONS AND FUTURE WORK
In this thesis, the main focus is on optimal planning and adaptive control design
for hybrid systems that interact with changing environments. The specifications and
control objectives for such systems are given in terms of logical formulas over predicates
which capture the interesting behavior of the systems when operating in their dynamic
environments. We solve the problems within a hierarchical framework. By obtaining
an abstraction for the class of hybrid systems considered, the planning and control
synthesis problems can then be lifted into the purely discrete level. Controllers and
plans synthesized at the abstraction level are implementable in the original systems of
hybrid dynamics due to the establishment of special simulation relations linking the
concrete system and the abstract one.
For making the abstraction of hybrid systems scalable and computationally
feasible, the methodology adapts a bottom-up approach. In this approach, we show
that a special class of hybrid systems where the continuous dynamics are convergent
can afford a partition of the continuous state space based on the asymptotic properties
of the vector fields. In addition, the system is capable of re-parametrizing its continuous
controllers. This partition gives rise to purely discrete abstractions, which are weakly
simulated by the underlying concrete hybrid dynamics. Solutions obtained through
this process are in general suboptimal, unless under some special conditions which are
identified.
In the presence of unknown but rule-governed environment, grammatical infer-
ence can be incorporated as an identification method along a path toward building
128
robust, reactive and adaptive systems. Starting with an incomplete model of the envi-
ronment, a system iteratively updates the model based on observations of its environ-
ment behavior, using an appropriate grammatical inference algorithm selected based
on whatever prior knowledge is available. If none is available, a hypothesis about the
class of models which the adversary dynamics belongs to is made. If the hypothesis is
correct, and a characteristic sample of the opponent’s behavior (language) is observed,
the learned model converges to the actual environment model in finitely many steps.
Then combining ideas from action model learning with grammatical inference, it is
shown that with the learning component, eventually we can construct a game equiva-
lent to the true game actually being played by the system and its environment. Due
to this equivalence, a winning strategy computed on the hypothesized game, is just as
effective as the true winning strategy computed on the game with complete informa-
tion. In the proposed adaptive control architecture, learning and control are combined
in a modular way, in the sense that a range of different control synthesis methods can
be adapted and used, in conjunction with a variety of different grammatical inference
algorithms.
Although reactive synthesis produces correct-by-construction controllers, the as-
sumption on complete information is in general hard to realize due to limited sensing
capabilities. In the case of partial observation, we defined a sensor model and formu-
lated the interaction between the system and its environment as a two-player game with
incomplete information. Control methods are developed with respect to sure-winning
and almost-sure winning criteria. From a practical point of view, controllers that do
not require extra memory to keep track of a history are more desirable. However, we
show that in the case of partial observation, it is not possible to obtain a controller
that is memoryless. Randomized control policies can be found to ensure the task is
accomplished with probability 1, in case where a deterministic controller may not exist.
Treating the environment as an adversary can be unnecessarily defensive when
the environment of a system is in fact a collection of other systems, each of which has its
own task specification in the form of a temporal logic formula and preference over their
129
interaction. We formulated the interaction of multiple individual rational systems as
a multi-agent noncooperative game and adapted results from algorithmic game theory
to compute (pure Nash) equilibria, security strategies and cooperative equilibria in
this multi-agent system. The analysis of equilibria and cooperative equilibria can be
applied to decentralized planning and control design of multi-agent systems. When
giving the right incentives to individual agents, a global desired behavior can emerge
from this interaction. Moreover, this emergent behavior is robust in the sense that once
a single agent fails, the interaction of the rest can converge to another desired stable
point (infinite sequences of concurrent actions). The analysis of security strategies can
be used for control synthesis of a system in the presence of a dynamic environment, in
which both the system and the environment act concurrently.
Future work can be focused on extending the abstraction method for the special
class of hybrid systems in Chapter 3 to stochastic systems, or on adaptive control for
partial observation case and decentralized planning and control design of multi-agent
systems with respect to a set of temporal logic specifications.
The current bottom-up abstraction method is restricted to the class of hybrid
systems where each low-level controller is deterministic, in the sense that for a given
state within its region of attraction, after initiating the controller, the state of the
system will certainly satisfy the predicates that characterize its limit set. There are
many hybrid systems with existing low-level controllers that are probabilistic. For
example, due to exogenous disturbances and unmodeled dynamics, the convergence
of a controller can be given as a probability distribution over a set of state sets. It is
meaningful to extend the proposed abstraction method for the class of stochastic hybrid
systems. In addition, in this thesis we consider qualitative reachability properties. It
is also a promising direction to consider quantitative measures (positive probability)
for abstraction-based optimal planning of stochastic hybrid systems, as well as to allow
more general temporal logic properties such as liveness and safety.
For the adaptive control in the case of partial observations, there is a need
to develop a learning algorithm that identifies a model of the environment, or some
130
model that is observation-equivalent to it. To identify a model of the environment,
it is necessary to incorporate a filtering method which removes the noise from the
observed environment behavior in a computationally efficient way. It is also important
to ensure that the data presentation contains a characteristic sample after the removal
of noisy information. If none of the above conditions can be fulfilled due to the sensing
uncertainty, a promising direction is to identify an equivalent game based on the notion
of observation equivalent for the model of the environment. The intuition is that when
the system only observes partial behavior of its environment, it suffices to identify a
model which, based on the sensing uncertainty, exhibits the same observable behaviors
to the system as the true environment does.
The work on game theoretic modeling and analysis of multi-agent systems pro-
vides theoretical results on how to compute the set of pure Nash or cooperative equi-
libria in the system. Although it can be adapted to decentralized control design of
multi-agent systems (by assigning different preference orderings for outcomes for dif-
ferent agents, and computing the set of equilibria by resolving different resulted games),
we still need to design a communication protocol to realize an equilibrium strategy pro-
file and also quantify the preference orderings and task specification by means of utility
functions. A possible direction along these lines is to extend the decentralized control
methods based on solution concepts applicable to multi-agent finite-stage games to the
case of infinte-stage games with winning conditions expressed in temporal logic.
131
BIBLIOGRAPHY
[1] Eric Aaron, Harold Sun, Franjo Ivancic, and Dimitris Metaxas. A hybrid dynam-ical systems approach to intelligent low-level navigation. In IEEE Proceedings ofComputer Animation, pages 154–163, 2002.
[2] Rajeev Alur, Costas Courcoubetis, Thomas A. Henzinger, and Pei-Hsin Ho. Hy-brid automata: An algorithmic approach to the specification and verification ofhybrid systems. In Robert L. Grossman, Anil Nerode, Anders P. Ravn, andHans Rischel, editors, Hybrid Systems: Computation and Control, volume 736 ofLecture Notes in Computer Science, pages 209–229. Springer Berlin Heidelberg,1993.
[3] Rajeev Alur, Thao Dang, and Franjo Ivancic. Reachability analysis of hybridsystems via predicate abstraction. In Claire J. Tomlin and Mark R. Greenstreet,editors, Hybrid Systems: Computation and Control, volume 2289 of Lecture Notesin Computer Science, pages 35–48. Springer Berlin Heidelberg, 2002.
[4] Rajeev Alur, Thao Dang, and Franjo Ivancic. Predicate abstraction for reach-ability analysis of hybrid systems. ACM Transactions on Embedded ComputingSystems, 5(1):152–199, 2006.
[5] Rajeev Alur, Thomas A. Henzinger, Gerardo Lafferriere, and George J. Pappas.Discrete abstractions of hybrid systems. Proceedings of the IEEE, 88(7):971 –984,july 2000.
[6] Krzysztof R. Apt and Erich Gradel. Lectures in game theory for computer sci-entists. Cambridge University Press, 2011.
[7] A. Arnold, A. Vincent, and I. Walukiewicz. Games for synthesis of controllerswith partial observation. Theoretical Computer Science, 303(1):7 – 34, 2003.
[8] Andre Arnold. Synchronized products of transition systems and their analysis.In Jorg Desel and Manuel Silva, editors, Application and Theory of Petri Nets,volume 1420 of Lecture Notes in Computer Science, pages 26–27. Springer BerlinHeidelberg, 1998.
[9] Tomas Babiak, Mojmır Kretınsky, Vojtech Rehak, and Jan Strejcek. Ltl to buchiautomata translation: Fast and more deterministic. In Tools and Algorithms forthe Construction and Analysis of Systems, pages 95–109. Springer, 2012.
132
[10] Raphael Bailly. Quadratic weighted automata: Spectral algorithm and likelihoodmaximization. Journal of Machine Learning Research, 20:147–162, 2011.
[11] Sai K. Banala, Sunil K. Agrawal, Seok Hun Kim, and John P. Scholz. Novelgait adaptation and neuromotor training results using an active leg exoskeleton.IEEE/ASME Transactions on Mechatronics, 15(2):216–225, 2010.
[12] Leonor Becerra Bonache, Colin Higuera, Jean-Christophe Janodet, and FredericTantini. Learning balls of strings with correction queries. In JoostN. Kok, JacekKoronacki, RaomonLopezde Mantaras, Stan Matwin, Dunja Mladenic, and An-drzej Skowron, editors, Machine Learning: ECML 2007, volume 4701 of LectureNotes in Computer Science, pages 18–29. Springer Berlin Heidelberg, 2007.
[13] Calin Belta, Antonio Bicchi, Magnus Egerstedt, Emilio Frazzoli, Eric Klavins,and George Pappas. Symbolic planning and control of robot motion. IEEERobotics Automation Magazine, 14(1):61–70, 2007.
[14] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, Two VolumeSet. Athena Scientific, 2nd edition, 2001.
[15] Dietmar Berwanger and Lukasz Kaiser. Information tracking in games on graphs.Journal of Logic, Language and Information, 19(4):395–412, 2010.
[16] Patricia Bouyer, Romain Brenguier, Nicolas Markey, and Michael Ummels. Con-current games with ordered objectives. In Lars Birkedal, editor, Foundations ofSoftware Science and Computational Structures, volume 7213 of Lecture Notesin Computer Science, pages 301–315. Springer Berlin Heidelberg, 2012.
[17] Mireille Broucke. A geometric approach to bisimulation and verification of hybridsystems. In Frits W. Vaandrager and Jan H. Schuppen, editors, Hybrid Systems:Computation and Control, volume 1569 of Lecture Notes in Computer Science,pages 61–75. Springer Berlin Heidelberg, 1999.
[18] Janusz A. Brzozowski. Derivatives of regular expressions. Journal of the ACM,11:481–494, October 1964.
[19] Julius R. Buchi. On a decision method in restricted Second-Order arithmetic. InInternational Congress on Logic, Methodology, and Philosophy of Science, pages1–11. Stanford University Press, 1962.
[20] John Case, Sanjay Jain, and Frank Stephan. Vacillatory and BC learning onnoisy data. Theoretical Computer Science, 241(1–2):115 – 141, 2000.
[21] Christos Cassandras and Stephan Lafortune. Introduction to Discrete Event Sys-tems. Kuwer, 1999.
133
[22] Krishnendu Chatterjee, Martin Chmelık, and Rupak Majumdar. Equivalence ofgames with probabilistic uncertainty and partial-observation games. In SupratikChakraborty and Madhavan Mukund, editors, Automated Technology for Verifi-cation and Analysis, Lecture Notes in Computer Science, pages 385–399. SpringerBerlin Heidelberg, 2012.
[23] Krishnendu Chatterjee, Laurent Doyen, Thomas A. Henzinger, and Jean-Francois Raskin. Algorithms for omega-regular games with imperfect informa-tion. In Zoltan Esik, editor, Computer Science Logic, volume 4207 of LectureNotes in Computer Science, pages 287–302. Springer, 2006.
[24] Yushan Chen, Xu Chu Ding, and Calin Belta. Synthesis of distributed controland communication schemes synthesis of distributed control and communicationschemes from global ltl specifications. In IEEE Conference on Decision andControl, pages 2718–2723, Orlando FL, 2011.
[25] Yushan Chen, XuChu Ding, Alin Stefanescu, and Calin Belta. A formal approachto deployment of robotic teams in an urban-like environment. In A. Martinoli,F. Mondada, N. Correll, G. Mermoud, M. Egerstedt, M. A. Hsieh, L. E. Parker,and K. Støy, editors, Distributed Autonomous Robotic Systems, volume 83 ofSpringer Tracts in Advanced Robotics, pages 313–327. Springer Berlin Heidelberg,2013.
[26] Alongkrit Chutinan and Bruce H. Krogh. Verification of polyhedral-invarianthybrid automata using polygonal flow pipe approximations. In Frits W. Vaan-drager and Jan H. Schuppen, editors, Hybrid Systems: Computation and Control,volume 1569 of Lecture Notes in Computer Science, pages 76–90. Springer BerlinHeidelberg, 1999.
[27] Edmund Clarke, Ansgar Fehnker, Zhi Han, Bruce Krogh, Joel Ouaknine, OlafStursberg, and Michael Theobald. Abstraction and counterexample-guided re-finement in model checking of hybrid systems. International Journal of Founda-tions of Computer Science, 14(04):583–604, 2003.
[28] Edmund M. Clarke Jr., Orna Grumberg, and Doron A. Peled. Model checking.MIT Press, 1999.
[29] Satyaki Das and David L Dill. Counter-example based predicate discovery inpredicate abstraction. In Formal Methods in Computer-Aided Design, pages 19–32. Springer, 2002.
[30] Colin de la Higuera. Grammatical Inference: Learning Automata and Grammars.Cambridge University Press, 2010.
[31] Aldo De Luca and Antonio Restivo. A characterization of strictly locally testablelanguages and its application to subsemigroups of a free semigroup. Informationand Control, 44(3):300–319, March 1980.
134
[32] Xu Chu Ding, Stephen L Smith, Calin Belta, and Daniela Rus. MDP optimalcontrol under temporal logic constraints. In 50th IEEE Conference on Decisionand Control and European Control Conference, pages 532–538. IEEE, 2011.
[33] E Allen Emerson. Temporal and modal logic. Handbook of Theoretical ComputerScience, Volume B: Formal Models and Sematics (B), 995:1072, 1990.
[34] Georgios E. Fainekos, Savvas G. Loizou, and George J. Pappas. Translatingtemporal logic to controller specifications. In 45th IEEE Conference on Decisionand Control, pages 899–904, 2006.
[35] John Fearnley and Martin Zimmermann. Playing Muller games in a hurry. InA. Montanari, M. Napoli, and M. Parente, editors, Proceedings of the First Sym-posium on Games, Automata, Logic, and Formal Verification, volume 25, pages146–161, 2010.
[36] Diego Figueira, Piotr Hofman, and Slawomir Lasota. Relating timed and registerautomata. In Proceedings of the 17th International Workshop on Expressivenessin Concurrency, pages 61–75, 2010.
[37] D. Fisman, O. Kupferman, and Y. Lustig. Rational synthesis. Tools and Algo-rithms for the Construction and Analysis of Systems, pages 190–204, 2010.
[38] Jie Fu and Herbert G. Tanner. Optimal planning on register automata. InAmerican Control Conference, pages 4540 –4545, June 2012.
[39] Drew Fudenberg and David K. Levine. The Theory of Learning in Games, vol-ume 1 of MIT Press Books. The MIT Press, 1998.
[40] Pedro Garcia, Enrique Vidal, and Jose Oncina. Learning locally testable lan-guages in the strict sense. In Proceedings of the Workshop on Algorithmic Learn-ing Theory, pages 325–338, 1990.
[41] Paul Gastin and Denis Oddoux. Fast LTL to Buchi automata translation. InGerard Berry, Hubert Comon, and Alain Finkel, editors, Proceedings of the13th International Conference on Computer Aided Verification (CAV’01), vol-ume 2102 of Lecture Notes in Computer Science, pages 53–65, Paris, France,July 2001. Springer.
[42] Antoine Girard and George J. Pappas. Hierarchical control system design usingapproximate simulation. Automatica, 45:566–571, 2009.
[43] William Glover and John Lygeros. A stochastic hybrid model for air traffic controlsimulation. In Rajeev Alur and George J. Pappas, editors, Hybrid Systems:Computation and Control, volume 2993 of Lecture Notes in Computer Science,pages 372–386. Springer Berlin Heidelberg, 2004.
135
[44] E. Mark Gold. Language identification in the limit. Information and Control,10(5):447–474, 1967.
[45] Erich Gradel, Wolfgang Thomas, and Thomas Wilke, editors. Automata logics,and infinite games: a guide to current research. Springer-Verlag New York, Inc.,New York, NY, USA, 2002.
[46] Amaury Habrard, Marc Bernard, and Marc Sebban. Improvement of the statemerging rule on noisy data in probabilistic grammatical inference. In NadaLavrac, Dragan Gamberger, Hendrik Blockeel, and Ljupco Todorovski, editors,Machine Learning: ECML 2003, volume 2837 of Lecture Notes in Computer Sci-ence, pages 169–180. Springer Berlin Heidelberg, 2003.
[47] Jeffery Heinz. Inductive Learning of Phonotactic Patterns. PhD thesis, Universityof California, Los Angeles, 2007.
[48] Jeffrey Heinz. String extension learning. In Proceedings of the 48th Annual Meet-ing of the Association for Computational Linguistics, pages 897–906, Uppsala,Sweden, July 2010.
[49] Jeffrey Heinz, Anna Kasprzik, and Timo Kotzing. Learning with lattice-structured hypothesis spaces. Theoretical Computer Science, 457:111–127, Octo-ber 2012.
[50] Thomas A. Henzinger. The theory of hybrid automata. In Proceedings of EleventhAnnual IEEE Symposium on Logic in Computer Science, pages 278–292, 1996.
[51] Thomas A Henzinger, Ranjit Jhala, and Rupak Majumdar. Counterexample-guided control. Springer, 2003.
[52] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction toAutomata Theory, Languages, and Computation (3rd Edition). Addison-Wesley,2006.
[53] Roberto Horowitz and Pravin Varaiya. Control design of an automated highwaysystem. Proceedings of the IEEE, 88(7):913–925, 2000.
[54] Jianghai Hu, John Lygeros, and Shankar Sastry. Towards a theory of stochastichybrid systems. In Nancy Lynch and Bruce H. Krogh, editors, Hybrid Systems:Computation and Control, volume 1790 of Lecture Notes in Computer Science,pages 160–173. Springer Berlin Heidelberg, 2000.
[55] Sanjay Jain, Daniel Osherson, James S. Royer, and Arun Sharma. SystemsThat Learn: An Introduction to Learning Theory: Learning, Development andConceptual Change. The MIT Press, 2nd edition, 1999.
136
[56] Gabriel Kalyon, Tristan Le Gall, Herve Marchand, and Thierry Massart. Sym-bolic Supervisory Control of Infinite Transition Systems under Partial Observa-tion using Abstract Interpretation. Discrete Event Dynamic Systems, 22(2):121–161, 2012.
[57] Michael Kaminski and Nissim Francez. Finite-memory automata. TheoreticalComputer Science, 134(2):329–363, 1994.
[58] Hassan Khalil. Nonlinear Systems. Prentice Hall, third edition, 2002.
[59] Marius Kloetzer and Calin Belta. A fully automated framework for control of lin-ear systems from temporal logic specifications. IEEE Transactions on AutomaticControl, 53(1):287–297, 2008.
[60] Xenofon D. Koutsoukos, Panos J. Antsaklis, James A. Stiver, and Michael D.Lemmon. Supervisory control of hybrid systems. Proceedings of the IEEE,88(7):1026–1049, 2000.
[61] S. Kowalewski, S. Engell, J. Preußig, and O. Stursberg. Verification of logic con-trollers for continuous plants using condition/event-system models. Automatica,35:505–518, 1999.
[62] Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Where’s waldo?sensor-based temporal logic motion planning. In IEEE International Conferenceon Robotics and Automation, pages 3116–3121, 2007.
[63] Hadas Kress-Gazit, Georgios E Fainekos, and George J Pappas. Temporal-logic-based reactive mission and motion planning. IEEE Transactions on Robotics,25(6):1370–1381, 2009.
[64] Ratnesh Kumar, Vijay Garg, and Steven I. Marcus. Predicates and predicatetransformers for supervisory control of discrete event dynamical systems. IEEETransactions on Automatic Control, 38:232–247, 1995.
[65] Bruno Lacerda and Pedro U. Lima. Linear-time temporal logic control of discreteevent models of cooperative robots. Journal of Physical Agents, 2(1):53–61, 2008.
[66] H. Lamouchi and J. Thistle. Effective control synthesis for DES under partialobservations. In Decision and Control, 2000. Proceedings of the 39th IEEE Con-ference on, volume 1, pages 22 –28 vol.1, 2000.
[67] Helmut Lescow and Jens Voge. Minimal separating sets for muller automata. InDerick Wood and Sheng Yu, editors, Automata Implementation, volume 1436 ofLecture Notes in Computer Science, pages 109–121. Springer Berlin/Heidelberg,1998.
137
[68] Carolos Livadas, John Lygeros, and Nancy A Lynch. High-level modeling andanalysis of the traffic alert and collision avoidance system. Proceedings of theIEEE, 88(7):926–948, 2000.
[69] John Lygeros and Shankar Sastry. Hybrid systems: modeling, analysis and con-trol. preprint, 1999.
[70] Matthew R. Maly, Morteza Lahijanian, Lydia E. Kavraki, Hadas Kress-Gazit,and Moshe Y. Vardi. Iterative temporal motion planning for hybrid systemsin partially unknown environments. In Proceedings of the 16th InternationalConference on Hybrid Systems: Computation and Control, pages 353–362, NewYork, NY, USA, 2013. ACM.
[71] Zohar Manna and Amir Pnueli. The temporal logic of reactive and concurrentsystems. Springer-Verlag New York, Inc., New York, NY, USA, 1992.
[72] Robert McNaughton and Seymour Papert. Counter-Free Automata. MIT Press,1971.
[73] Loizos Michael. Learning from partial observations. In Proceedings of the 20thinternational joint conference on Artifical intelligence, pages 968–974, San Fran-cisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.
[74] Stefan Mitsch, Sarah M Loos, and Andre Platzer. Towards formal verification offreeway traffic control. In IEEE/ACM Third International Conference on Cyber-Physical Systems, pages 171–180, 2012.
[75] Madhavan Mukund. From Global Specifications to Distributed Implementations,pages 19—34. Kluwer Academic Publishers, 2002.
[76] Daniel Neider, Roman Rabinovich, and Martin Zimmermann. Down the borelhierarchy: Solving muller games via safety games. In Proceedings Third Interna-tional Symposium on Games, Automata, Logics and Formal Verification, pages169–182, 2012.
[77] Frank Neven, Thomas Schwentick, and Victor Vianu. Finite state machinesfor strings over infinite alphabets. ACM Transactions on Computational Logic,5(3):403–435, 2004.
[78] Noam Nisan and Amir Ronen. Algorithmic mechanism design. In Proceedings ofthe 31st ACM Sympocium on Theory of Computing, pages 129–140, 1999.
[79] Dominique Perrin and Jean Eric Pin. Infinite words: automata, semigroups, logicand games. Elsevier, 2004.
[80] Nir Piterman and Amir Pnueli. Synthesis of reactive(1) designs. In Proceed-ings of Verification, Model Checking, and Abstract Interpretation, pages 364–380.Springer, 2006.
138
[81] Amir Pnueli. The temporal logic of programs. In Foundations of ComputerScience, 1977., 18th Annual Symposium on, pages 46–57, 1977.
[82] J. Raisch and S.D. O’Young. Discrete approximations and supervisory control ofcontinuous systems. IEEE Transactions on Automatic Control, 43(4):569–573,1998.
[83] Velupillai Sankaranarayanan. Ravi N. Banavar. Switched finite time control ofa class of underactuated systems, volume 333 of Lecture Notes in Control andInformation Sciences. Springer, 2006.
[84] Raymond Reiter. Knowledge in Action: Logical Foundations for Specifying andImplementing Dynamical Systems. MIT Press, 2001.
[85] James Rogers and Geoffrey Pullum. Aural pattern recognition experiments andthe subregular hierarchy. Journal of Logic, Language and Information, 20:329–342, 2011.
[86] Jonathan Schiff, Philip S Li, and Marc Goldstein. Robotic microsurgical vaso-vastomy and vasoepididymostomy: a prospective random study in a rat model.Journal of Urology, 171:1720–1725, 2004.
[87] Yoav Shoham and Kevin Leyton-Brown. Multiagent Systems - Algorithmic,Game-Theoretic, and Logical Foundations. Cambridge University Press, 2009.
[88] George F. Simmons. Introduction to Topology and Modern Analysis. KriegerPublishing Company, 2003.
[89] K. Sreenath, H.-W. Park, I. Poulakakis, and J. W. Grizzle. A compliant hybridzero dynamics controller for stable, efficient and fast bipedal walking on MABEL.International Journal of Robotics Research, 30(9):1170–1193, 2011.
[90] Colin Stirling. Modal and temporal logics for processes. In Faron Moller and Gra-ham Birtwistle, editors, Logics for concurency: structure vs automata. Springer,1996.
[91] Colin Stirling. The joys of bisimulation. In Proceedings of 23rd InternationalSymposium of Mathematical Foundations of Computer Science, volume 1450,pages 142–151, 1998.
[92] Paulo Tabuada. Approximate simulation relations and finite abstractions ofquantized control systems. In A. Bemporad, A. Bicchi, and G. Buttazzo, ed-itors, Hybrid Systems: Computation and Control, volume 4416 of Lecture Notesin Computer Science, pages 529–542. Springer-Verlag, 2007.
[93] Herbert Tanner, Jie Fu, Chetan Rawal, Jorge Piovesan, and Chaouki Abdallah.Finite abstractions for hybrid systems with stable continuous dynamics. DiscreteEvent Dynamic Systems, 22:83–99, 2012.
139
[94] Y. Tazaki and J. Imura. Finite abstractions of discrete-time linear systems and itsapplication to optimal control. In Proceedings of the 17th IFAC World Congress,pages 4656–4661, 2008.
[95] Sylvie Thiebaux, Charles Gretton, John Slaney, David Price, and Froduald Ka-banza. Decision-theoretic planning with non-markovian rewards. Journal ofArtificial Intelligence Research, 25(1):17–74, January 2006.
[96] J. G. Thistle and H. M. Lamouchi. Effective control synthesis for partially ob-served discrete-event systems. SIAM J. Control Optim., 48(3):1858–1887, June2009.
[97] Ashish Tiwari and Gaurav Khanna. Series of abstractions for hybrid automata. InClaire J. Tomlin and Mark R. Greenstreet, editors, Hybrid Systems: Computationand Control, volume 2289 of Lecture Notes in Computer Science, pages 465–478.Springer Berlin Heidelberg, 2002.
[98] C. Tomlin, G.J. Pappas, and S. Sastry. Conflict resolution for air traffic manage-ment: a study in multiagent hybrid systems. IEEE Transactions on AutomaticControl, 43(4):509–521, 1998.
[99] Jana Tumova, Gavin C. Hall, Sertac Karaman, Emilio Frazzoli, and Daniela Rus.Least-violating control strategy synthesis with safety rules. In Proceedings of the16th International Conference on Hybrid Systems: Computation and Control,HSCC ’13, pages 1–10, New York, NY, USA, 2013. ACM.
[100] A. Ulusoy, S.L. Smith, Xu Chu Ding, C. Belta, and D. Rus. Optimal multi-robot path planning with temporal logic constraints. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems, pages 3087–3092, 2011.
[101] Alphan Ulusoy, Stephen L Smith, Xu Chu Ding, and Calin Belta. Robust multi-robot optimal path planning with temporal logic constraints. In 2012 IEEEInternational Conference on Robotics and Automation, pages 4693–4698. IEEE,2012.
[102] Luis Valbuena and Herbert G. Tanner. Hybrid potential field based control ofdifferential drive mobile robots. Journal of Intelligent & Robotic Systems, 68(3-4):307–322, 2012.
[103] Eric R Westervelt, Jessy W Grizzle, Christine Chevallereau, Jun Ho Choi, andBenjamin Morris. Feedback control of dynamic bipedal robot locomotion. CRCpress, Boca Raton, 2007.
[104] Y. Willner and M. Heymann. Supervisory control of concurrent discrete-eventsystems. International Journal of Control, 54(5):1143–1169, 1991.
140
[105] Eric M Wolff, Ufuk Topcu, and Richard M Murray. Efficient reactive controllersynthesis for a fragment of linear temporal logic. In Proceedings of InternationalConference on Robotics and Automation, 2013 (in press).
[106] T. Wongpiromsarn, U. Topcu, and R.M. Murray. Receding horizon temporal logicplanning. IEEE Transactions on Automatic Control, 57(11):2817–2830, 2012.
[107] Michael Woolridge and Michael J. Wooldridge. Introduction to Multiagent Sys-tems. John Wiley & Sons, Inc., New York, NY, USA, 2001.
[108] S. Xu and R. Kumar. Discrete event control under nondeterministic partialobservation. In IEEE International Conference on Automation Science and En-gineering, 2009., pages 127–132. IEEE, 2009.
[109] M. Zamani, G. Pola, M. Mazo, and P. Tabuada. Symbolic models for nonlinearcontrol systems without stability assumptions. IEEE Transactions on AutomaticControl, 57(7):1804–1809, 2012.
[110] V. I. Zubov. Mathematical Methods for the Study of Automatic Control Systems.Pergamon Press/Macmillan, 1963.
141
Appendix A
ASYMPTOTIC (T,D) EQUIVALENCE CLASSES
Denote dist(x,A
)the distance between point x and set A and let dist
(x,A
) def=
infy∈A ‖x− y‖.
Theorem 10 (Zubov [110]). The set Ω is the region of attraction of a periodic orbit
x = ϕ(t) with period T , if and only if there exist two functions V (x) and W (x) defined
on Ω satisfying: 1) V (x) is continuous on Ω and the domain of W (x) can be extended
to entire X , 2) V (x) ∈ (0, 1) ∀x ∈ Ω\ϕ, and V (x) = 0 for dist(x, ϕ
)= 0, 3) W (x) > 0
for dist(x, ϕ
)> 0 and W (x) = 0 for dist
(x, ϕ
)= 0, 4)
∇V Tf(x) = −W (x)√
1 + ‖f‖2(1− V ) , (A.1)
5) limx→∂Ω V (x) = 1.
Proposition 10. Consider a system x = f(x) and assume that its trajectories remain
inside a compact set Ω and the (attractive) limit set L+ of the trajectories contain a
single, isolated component. Denote φ(t;x(0)) the trajectory of f starting at x(0), and
let V (x) be a solution to (A.1) that satisfies the requirements of Theorem 10. Then
the trajectories of f starting in Ω will enter an ε-neighborhood of L+, in finite time at
most T = ln(
1−c1−C
), where1 c
def= minx∈clL+⊕Bε(0) V (x) and C
def= maxx∈Ω V (x).
Proof. Pick W (x) = dist(x, ϕ
). This choice trivially satisfies the requirements of
Theorem 10. Now let V (x) be a solution of (A.1) that conforms with the conditions of
Theorem 10. Then from (A.1), for x ∈ Ω\L+⊕Bε(0) it follows that V ≤ −d(1−V )
and applying the Comparison Lemma with C = maxx∈Ω V (x) one obtains V (x(t)) ≤
1 cl is used to denote set closure.
142
1 − (1 − C)edt. Let V (x) = c be the largest level set of V included in the closure of
L+⊕Bε(0). Then, the following upper bound for the time required for the flows starting
within the level set V (x) = C to reach L+ ⊕ Bε(0) can be obtained: t ≤ 1d
ln(
1−c1−C
).
Setting T , ln(
1−c1−C
), the proof is completed.
143
Appendix B
LEARNING ALGORITHM FOR THE CLASS OF STRICTLY K-LOCALLANGUAGES.
A string u is a factor of string w iff ∃x, y ∈ Σ∗ such that w = xuy. If in addition
|u| = k, then u is a k-factor of w. The k-factor function factork : Σ∗ → 2Σ≤k maps a
word w to the set of k-factors within it if |w| > k; otherwise it maps w to the singleton
set w. This function is extended to languages as factork(L) :=⋃w∈L factork(w). A
language L is Strictly k-Local (SLk) iff there exists a finite set G ⊆ factork(]Σ∗]), such
that L = w ∈ Σ∗ | factork(]w]) ⊆ G, where ] is a special symbol indicating the
beginning and end of a string. The set G is the grammar that generates L.
A language is called Strictly Local if it is Strictly k-Local for some k. There are
many distinct characterizations of this class. For example, they are equivalent to the
languages recognized by (generalized) Myhill graphs, to the languages definable in a
restricted propositional logic over a successor function, and to exactly those languages
which are closed under suffix substitution. [31, 72, 85]. Furthermore, there are known
methods for translating between automata-theoretic representations of Strictly Local
languages and these others.
Theorem 11 ( [40]). For known k, the Strictly k-Local languages are identifiable in
the limit from positive data.
Readers are referred to the cited papers for a proof of this theorem. We sketch
the basic idea here with the grammars for Strictly k-Local languages defined above.
Consider any L ∈ SLk. The grammar for L is G = factork(]·L ·]),1 and G contains
only finitely strings.
1 The operator · : Σ∗×Σ∗ → Σ∗ concatenates string sets: given S1, S2 ⊆ Σ∗, S1 · S2 =xy | x ∈ S1, y ∈ S2.
144
A poly-time, incremental and set-driven learning algorithm for strictly k-local
language is GIM defined by [48]: 1) i = 0: GIM(φ[i]) := ∅. 2) φ(i) = #: GIM(φ[i]) :=
GIM(φ[i− 1]). 3) Otherwise, GIM(φ[i]) := GIM(φ[i− 1]) ∪ factork(]φ(i)]).
There will be some finite point in every data presentation of L such that the
learning algorithm converges to the grammar of L (because the cardinality of G is
finite). This particular algorithm is analyzed by [48], and is a special case of lattice-
structured learning [49].
Example 2. Consider a strictly 2-local language L = (Σ∗aaΣ∗) ∪ (Σ∗baΣ∗), for
Σ = a, b, where S be the complement of the set S with respect to Σ∗. In other
words, L is the set of strings that don’t have aa and ba factors. We have the gram-
mar G = factork(L) = ]a, ]b, ab, bb, b], a]. Obviously, aaa /∈ L because F2(]aaa]) =
]a, aa, a] * G.
Learning proceeds as follows: given a positive presentation φ where φ(1) = ab,
φ(2) = bb, φ(3) = a, applying the learning algorithm GIM(φ[1]) = factork(]ab]) =
]a, ab, b]; GIM(φ[2]) = GIM(φ[1])∪F2(]bb]) = ]a, ]b, ab, bb, b]; GIM(φ[3]) = GIM(φ[2])∪F2(]a]) = G. The learner converges after only having observed 3 strings.
This learning algorithm does not output finite-state automaton, but sets of
factors. However, there is an easy way to convert any grammar of factors into an
acceptor which recognizes the same Strictly Local language. This acceptor is not the
canonical acceptor for this language, but it is a normal form. It is helpful to define
a function sufk(L) = v ∈ Σk | (∃w ∈ L)[(∃u ∈ Σ∗)[w = uv]]. Given k and a set of
factors G ⊆ factork(]·Σ∗·]), construct a finite-state acceptorAG = 〈Q,Σ, T, I,Acc〉as follows.
• Q = sufk−1(Pr(L(G)))
• (∀u ∈ Σ≤1)(∀σ ∈ Σ)(∀v ∈ Σ∗)[T (uv, σ) = vσ ⇔ uv, vσ ∈ Q]
• I = λ if L(G) 6= ∅ else ∅
• Acc = sufk−1(L(G))
The proof that L(AG) = L(G) is given in [47, p.106].
145