uncertain multiagent systems: games and learning h. jin kim, songhwai oh and shankar sastry...

Uncertain Multiagent Systems: Games and Learning

H. Jin Kim, Songhwai Oh and Shankar Sastry

University of California, Berkeley

July 17, 2002

Decision-Making under Uncertainty ONR MURI

Outline

• Hierarchical architecture for multiagent operations

• Partial observation Markov games (POMGame)

• Berkeley pursuit-evasion game (PEG) setup

• From PEG to unmanned dynamic battlefield– Model predictive techniques for dynamic replanning

– Multi-target tracking (detect ID track)

– Dynamic model selection for estimating adversarial intent

Partial-observation Probabilistic Pursuit-Evasion Game (PEG) with 4 UGVs & 1 UAV

A prototype systemof fully autonomousmobile teams ofintelligent andnetworked sensingagents deployedto discover and trackmobile targets in unmappedenvironments

Uncertainty pervades every layer!

Hierarchy in Berkeley Platformactuator

positions

inertialpositions

height over

terrain

• obstacles detected• targets detectedcontrol

signals

INS GPSultrasonic altimeter

vision

state of agents

obstacles detected

targetsdetected

obstaclesdetected

agentspositions

desiredagentsactions

Tactical Planner& Regulation

Vehicle-level sensor fusion

Strategy Planner Map Builder

• position of targets • position of obstacles • positions of agents

Communications Network

tacticalplanner

trajectoryplanner

regulation

•lin. accel.•ang. vel.

Targets

Exogenousdisturbance

UAV

dynamics

Terrain

actuatorencoders

UGV dynamics

Pursuit-Evasion Game Experiment Setup

Ground Command Post

Waypoint Commands

Position & vehicle status

Position & vehicle status

Pursuer UAV

Evader UGV

Evader location detected by vision system

Pursuer UGVs

Information Flow in UC Berkeley PEG Platform

Wireless Network

Pursuer UAV

Ground-basedStrategyPlanner

Current Coordination

of AgentProcessed

Vision Input

Map Builder

Policy Calculator

Probability Map

Pursuer UGV

Evader UGV

Flight Computer

Vision Computer

Vision Computer

Motion Controller

Motion Controller

Vision ComputerMap Builder

Display Info

Current Positionfor Ground

Station Display

Waypoint Requests

Vision Data

Current Position

Waypoint Requests

Vision Data

Current Position

Agent PositionRequests

Lessons Learned and UAV/UGV

Objective • Scalable/replicable system that deliver mission reliably under uncertainty

and evaluate their performance

• Hierarchical architecture design and analysis– High-level decision making in a discrete space

– Physical-layer control in a continuous space

• Hierarchical decomposition requires tight interaction between layers to achieve cooperative behavior, to deconflict and to support constraints.

• Confronting uncertainty arising from partially observable, dynamically changing environments and intelligent adversaries

Representing and Managing Uncertainty

• Uncertainty is introduced in various channels– Sensing unable to determine the current state of world

– Prediction unable to infer the future state of world

– Actuation unable to make the desired action to properly affect the state of world

• Different types of uncertainty can be addressed by different approaches – Nondeterministic uncertainty : Robust Control

– Probabilistic uncertainty :

(Partially Observable) Markov Decision Processes

– Adversarial uncertainty : Game TheoryPOMGame

Partial Observation Markov Games (POMGame)

Policy for POMGames

• Optimal value function of a state – the expected sum of a reward that agent will gain by executing the

optimal policy starting from that state:

• Poorly understood: analysis exists only for very specially structured games such as a game with a complete information on one side

• Special case : partially observable Markov decision processes (POMDP)

Berkeley Pursuit-Evasion Game (PEG) Setup

Abstraction of Pursuit-Evasion Game

• A partial-observation stochastic pursuit-evasion game in a 2-D grid world, between (heterogeneous) teams of ne evaders and np pursuers .

• At each time t, – Each evader and pursuer, located at and

respectively, – takes the observation over its visibility region– updates the belief state– chooses action from

• Goal: capture of the evader, or survival

• Performance measure : capture time

• Optimal policy minimizes the cost

*

: min 1:

where is the set of all {y(1)...y( )},

associated with an evader not being found up to

fnd

fnd

T Y Y

Y

Optimal Pursuit Policy

Optimal Pursuit Policy –Dynamic Programming Formulation

Persistent Pursuit Policies

• Solving for the optimal policy of the partial observation Markov games of non-trivial size using dynamic programming is computationally intractable.

• If the pursuit policy is persistent with a period T, then the expected capture time is bounded.

Example of Persistent Pursuit Policies

Greedy Policy

– Pursuer moves to the neighboring cell with the highest probability of having an evader at the next instant

– Strategic planner assigns more importance to local or immediate considerations

Global Maximum Policy

– Pursuer moves toward the global location with the highest probability, weighted by some distance metric, of having an evader at the next instant

Experimental Results: Pursuit Evasion Games with Four UGVs and a UAV

Game-theoretic Policy Search Paradigm

• Large number of variables affect the solution

• Many interesting games including pursuit-evasion are a large game with partial information, and finding optimal solutions is well outside the capability of current algorithms

• Approximate solution is not necessarily bad. There might be simple policies with satisfactory performances

Choose a good policy from a restricted class of policies !

• We can find approximately optimal solutions from restricted classes, using a sparse sampling and a provably convergent policy search algorithm

Constructing a Policy Class

• Given a mission with specific goals, we – decompose the problem in terms of the functions that need to be

achieved for success and the means that are available– analyze how a human team would solve the problem– determine a list of important factors that complicate task performance

such as safety or physical constraints • Maximize aerial coverage,

• Stay within a communications range,

• Penalize actions that lead an agent to a danger zone,

• Maximize the explored region,

• Minimize fuel usage, …

Policy Representation

• Quantize the above features and define a feature vector that consists of the estimate of above quantities for each action given agents’ history

• Estimate the ‘goodness’ of each action by a function

where is the weighting vector to be learned .• Choose an action that maximizes .• Or choose a randomized action according to the distribution

Example: Policy Feature

• Maximize collective aerial coverage -> maximize the distance between agents

where is the location of pursuer that will be landed by taking action from

• Try to visit an unexplored region with high possibility of detecting an evader

where is a position arrived by the action that maximizes the evader map value along the frontier

• Prioritize actions that are more compatible with the dynamics of agents

• Policy representation

Example: Policy Feature (Continued)

Benchmarking Experiments

• Performance of two pursuit policies compared in terms of capture time• Experiment 1 : two pursuers against the evader who moves greedily with

respect to the pursuers’ location

• Experiment 2 : When the position of evader at each step is detected by the sensor network with only 10% accuracy, two optimized pursuers took 24.1 steps, while the one-step greedy pursuers took over 146 steps in average to capture the evader in 30 by 30 grid.

Grid size 1-Greedy pursuers Optimized pursuers

10 by 10 (7.3, 4.8)* (5.1, 2.7)

20 by 20 (42.3, 19.2) (12.3, 4.3)

* (mean, standard deviation)

Why General-sum Games?

"All too often in OR dealing with military problems, war is viewed as a zero-sum two-person game with perfect information. Here I must state as forcibly as I know that war is not a zero-sum two-person game with perfect information. Anybody who sincerely believes it is a fool. Anybody who reaches conclusions based on such an assumption and then tries to peddle these conclusions without revealing the quicksand they are constructed on is a charlatan....There is, in short, an urgent need to develop positive-sum game theory and to urge the acceptance of its precepts upon our leaders throughout the world."

Joseph H. Engel, Retiring Presidential Address to the Operations Research Society of America, October 1969

General-sum Games

• Depending on the cooperation between the players,– Noncooperative– Cooperative

• Depending on the least expected payoff that a player is willing to accept- Nash’s special/general bargaining solution

• By restricting the blue and red policy class to be the finite size, we reduce the POMGame into the bimatrix game.

From PEG to Combat Scenarios

• Adversarial attack – Reds just do not evade, but also attack -> Blues cannot blindly

pursue reds.

• Unknown number/capability of adversary

-> Dynamic selection of the relevant red model from unstructured observation

• Deconfliction between layers and teams• Increase number of feature

-> Diversify possible solutions when the uncertainty is high

From POMGame To Bimatrix Game

Dynamic Bayesian Model Selection

• Dynamic Bayesian model selection (DBMS) is a generalized model selection approach to time series data of which the number of components can vary with time

• If K is the number of the components at any instance and T is the length of the time series, then there are O(2KT) possible models which demands an efficient algorithm

• The problem is formulated using Bayesian hierarchical modeling and solved using reversible jump MCMC methods suitably adapted.

DBMS: Graphical Representation

– Dirichlet prior

A – Transition matrix for mt

t – Dirichlet prior

wt – component weights

zt – allocation variable

F – transition dynamics

DBMS: Multi-target Tracking Example

Estimated target position+ True target trajectory Observation

Summary

• Decomposition of complex multiagent operation problems requires tighter interaction between subsystems and human intervention

• Partial observation Markov games provides a mathematical representation of a hierarchical multiagent system operating under adversarial and environmental uncertainty

• Policy class framework provides a setup for including human experience

• Policy search methods and sparse sampling produce computationally tractable algorithms to generate approximate solutions to partially observable Markov games.

• Model predictive (receding horizon) techniques can be used for dynamic replanning to deconflict/coordinate between vehicles, layers or subtasks

• THE END

Acting under Partial Observations

• We need to use memory of previous actions and observations to disambiguate the current state.

• The state estimate, or belief state– Posterior probability distribution over states – The likelihood the world is actually in the state x, at time t, given the

agent’s past experience (I.e. actions and observation histories).

Updating Belief State

– Can be updated recursively using the estimated world model and Bayes’ rule.

New info on the state of

world

New info on prediction

Pursuit-Evasion Game Experiment

PEG with four UGVs• Global-Max pursuit policy• Simulated camera view

(radius 7.5m with 50degree conic view)• Pursuer=0.3m/s Evader=0.5m/s MAX

Experimental Results: Evaluation of Policies for Different Visibility

• Global max policy performs better than greedy, since the greedy policy selects movements based only on local considerations.

• Both policies perform better with the trapezoidal view, since the camera rotates fast enough to compensate the narrow field of view.

Capture time of greedy and global-max for the different region of visibility

of pursuers

Three pursuers with trapezoidal or omni-directional view

Randomly moving evader

Experimental Results: Evader’s Speed vs. Intelligence

• Having a more intelligent evader increases the capture time

• Harder to capture an intelligent evader at a higher speed

• The capture time of a fast random evader is shorter than that of a slower random evader, when the speed of evader is only slightly higher than that of pursuers.

Capture time for different speeds and levels of intelligence of the evader

Three pPursuers with a trapezoidal view & global maximum policy

Max speed of pursuers: 0.3 m/s

Coordination under Multiple Sources of Commands

• When different agents or layers specify multiple, possibly conflicting goals or actions, how the system can prioritize or resolve them ?– a priori assignment of the degrees of authority – Surge in coordination demand when the situation deviates from textbook

cases: can the overall system adapt real-time?

• Intermediate, cooperative modes of interaction between layers, agents and human operator based on anticipatory reasoning is desirable

uncertain multiagent systems: games and learning h. jin kim, songhwai oh and shankar sastry...

Documents

uncertainty uncertainty

game theory pomgame

uncertainty onr muri

adversarial intent slide

mobile targets

terrain obstacles

networked sensing agents

ground station