approximate solutions for partially observable stochastic games with common payoffs

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs

Rosemary Emery-Montemerlo

joint work with

Geoff Gordon, Jeff Schneider and Sebastian Thrun

July 21, 2004 AAMAS 2004

Approximate Solutions for Partially Observable Stochastic Games with Common Payoffs Rosemary Emery-Montemerlo

Robot Teams


Robot Teams

With limited communication, existing paradigms for decentralized robot control are not sufficient

Game theoretic methods are necessary for multi-robot coordination under these conditions


Decentralized Decision Making


Decentralized Decision Making

A robot cannot choose actions based only on joint observations consistent with its own sensor readings

It must consider all joint observations that are consistent with its possible sensor readings


Relationship Between Decision Theoretic Models

State Space State Space Belief Space Belief Space

MDP POMDP ?

Distributionover

Belief Space


Models of Multi-Agent Systems Partially observable stochastic

games Generalization of stochastic games to

partially observable worlds Related models

DEC-POMDP [Bernstein et al., 2000] MTDP [Pynadath and Tambe, 2002] I-POMDP [Gmystrasiewicz and Doshi, 2004] POIPSG [Peshkin et al., 2000]


Partially Observable Stochastic Games

POSG = {I, S, A, Z, T, R, O} I is the set of agents, I= {1,…,n} S is the set of states A is the set of actions, A= A1 An Z is the set of observations, Z= Z1

Zn T is the transition function, T: S A S R is the reward function, R: S A O are the observation emission

probabilities O: S Z A [0,1]


Solving POSGs

POSGs are computationally infeasible to solve


Solving POSGs

Full POSG

One-StepLookaheadGame at time t(Bayesian Game)

We can approximate a POSG as a series of smaller Bayesian games


Bayesian Games Private information relevant to game

Uncertainty in utility Type

Encapsulates private information Will limit selves to games with finite number

of types In robot example

Type 1: Robot doesn’t see anything Type 2: Robot sees intruder at location x


Bayesian Games BG = {I, , A,p(), u}

is the joint type space, = 1 n is a specific joint type, = {1,…, n}

p() is common prior on the distribution over

u is the utility function, u= {u1,…,un} ui(ai,a-i,(i, -i))

i is a strategy for player i Defines what player i does for each of its

possible types Actions are individual actions, not joint

actions


Bayesian-Nash Equilibrium

Set of best response strategies Each agent tries to maximize its

expected utility conditioned on its probability distribution over the other agents’ types p() Each agent has a policy i that, given

-i , maximizes ui(i,-i, -i)


POSG to Bayesian Game Approximation {I,S,A,Z,T,R,O} to {I, , A,p(), u}t

I = I A = A Type space i

t = all possible histories of agent i’s actions and observations up to time t

p()t calculated from S0,A,T,Z,O, t-1

Prune low probability types Each joint type maps to a joint belief

u given by heuristic and ui = uj QMDP


AlgorithmInitializet=0, hi = {},p(0)0=solveGame(0,p(0))

Make Observationhi = obsi

t U ait-1 U hi

Determine Typei

t = bestMatch(hi, i

t)

Execute Actionai

t = i

t (i

t )

Propagate Forwardt+1,p(t+1)

Find Policy for t+1t+1=solveGame(t,p(t

))t= t+1

Agent i

Initializet=0, hj = {},p(0)0=solveGame(0,p(0))

Make Observationhj = obsj

t U ajt-1 U hj

Determine Typej

t = bestMatch(hj, 2

t)

Execute Actionaj

t = j

t (j

t )

Propagate Forwardt+1,p(t+1)

Find Policy for t+1t+1=solveGame(t,p(t

))t= t+1

Agent j


Robotic Team Tag Version of Team Tag

Environment is portion of Gates Hall Full teammate

observability Opponent can be

captured by a single robot in any state

QMDP used as heuristic

Two pioneer-class robots


Robot Policies


Lady And The Tiger [Nair et al. 2003]

Computation Time

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

3 4 5 6 7 8 9 10

Horizion

Tim

e(m

s)

Full POSG

Bayesian GameApproximation


Contributions Algorithm for finding approximate

solutions to POSG with common payoffs Tractability achieved by modeling POSG as

a sequence of Bayesian games Performs comparably to the full POSG for a

small finite-horizon problem Improved performance over ‘blind’

application of utility heuristic in more complex problems

Successful real-time game-theoretic controller for indoor robots


Questions?

[email protected] www.cs.cmu.edu/~remery


Back-Up Slides


Policy Performance

-80

-70

-60

-50

-40

-30

-20

-10

0

10

20

3 4 5 6 7 8 9 10

Horizon

Ex

pe

cte

d o

r A

ve

rag

e R

ew

ard

Full POSG

Bayesian GameApproximationSelfish Policy

Lady And The Tiger [Nair et al. 2003]


Robotic Team Tag I = {1,2} S = S1 X S2 X Sopponent

Si = {s0,…,s28}, sopponent= {s0,…,s28,stagged} |S| = 25230

Ai = {N,S,E,W,Tag} Zi = [{si,-1},s-i,a-i] T: adjacent cells O: see opponent if on same cell R: minimize capture time Modified from [Pineau et al. 2003]


Environment


Performance

-60

-50

-40

-30

-20

-10

0

Full Observability ofTeammate's Position

Without Full Observability ofTeammate's Position

Av

era

ge

Dis

co

un

ted

Va

lue

Full Observability

Most Likely State

QMDP

BayesianApproximation

Robotic Team Tag Results


Performance

0

10

20

30

40

50

60

70

80

90

100

Full Observability ofTeammate's Position

Without FullObservability of

Teammate's Position

Av

era

ge

Tim

es

tep

s

Full Observability

Most Likely State

QMDP

BayesianApproximation

Robotic Team Tag Results

approximate solutions for partially observable stochastic games with common payoffs

Documents

policy i

robot doesnt

single robot

joint belief u

joint observations consistent

joint type space

set of actions

set of observations