zinovi rabinovich jeffrey s. rosenschein school of ... · zinovi rabinovich jeffrey s. rosenschein...

Extended Markov Trackingsolution for Strategic/Tactical Paradigm

Zinovi RabinovichJeffrey S. Rosenschein

School of Engineering and Computer SciencesHebrew University in Jerusalem

Extended Markov Tracking – p.1/28

Agenda

Strategic/Tactical Paradigm

Markov Environment Model

Extended Markov Tracking (EMT)

Tactical Solution by EMT

Example application

Strategic Morphology

Multi-agent EMT

Multi-agent example

Future directions


Strategic/Tactical Paradigm

Strategic/Tactical Paradigm breaks a continualplanning/control problem into two basic levels:

Strategic level deals with system formalization anddefinition of ideal development.Tactical level deals not with the high level tasks, butonly with ideal system development which itattempts to implement.

The levels continually interact: Tactical level reports it’sdegree of success, while Strategic level updates andmodifies model parameters and tactical target.

This paradigm is evident in many existing planning andcontrol algorithms, and goes far back in time, even toepic stories like “Ulysses journey”.


Scylla and Charybdis


Ulysses journey

Halfway up the cliff there is a cave, misty-looking and turnedtoward Erebos and the dark, the very direction from which,

O shining Odysseus, you and your men will be steeringyour hollow ship; and from the hollow ship no vigorous

young man with a bow could shoot to the hole in the cliffside. In that cavern Scylla lives, whose howling is terror.

Her voice indeed is only as loud as a new-born puppy couldmake, but she herself is an evil monster. No one not even a

god encountering her, could be glad at that sight.

There are two dangers awaiting for you:Centuries took, but they know what they do.Sailors, who dared them - none to be seen,

For all they have perished, if stories are keen.

Mortal, of dangers first, is for all,Your crew will be fewer by second’s one toll.


Ulysses journey

She has twelve feet, and all of them wave in the air. Shehas six necks upon her, grown to great length, and uponeach neck there is a horrible head, with teeth in it, set in

three rows close together and stiff, full of black death. Herbody from the waist down is holed up inside the hollow

cavern, but she holds her heads poked out and away fromthe terrible hollow, and there she fishes, peering all over thecliff side, looking for dolphins or dogfish to catch or anything

bigger, some sea monster, of whom Amphitrite keeps somany; never can sailors boast aloud that their ship has

passed her without any loss of men, for with each of herheads she snatches one man away and carries him off from

the dark-prowed vessel.





Ulysses journey

The other cliff is lower; you will see, Odysseus, for they lieclose together, you could even cast with an arrow across.There is a great fig tree grows there, dense with foliage,and under this shining Charybdis sucks down the black

water. For three times a day she flows it up, and three timesshe sucks it terribly down; may you not be there when she

sucks down water, for not even the Earthshaker couldrescue you out of that evil.

But sailing your ship swiftly drive her past and avoid her andmake for Scylla’s rock instead, since it is far better to mourn

six friends lost out of your ship than the whole company.





Ulysses journey



Mortal, of dangers first, is for all,Your crew will be fewer by second’s one toll.Take careful decision charting your course,Make it much closer to the second of those.


Ulysses journey due to Strategy

It is the task of the Strategic level of our paradigm toformalize the problem, and it has to procure two:

Environment formal descriptionThe ideal development of the formal environment:the tactical target

The environment, and this will be our preferred choice,will be described by a Partially Observable MarkovEnvironment Model.


Formal Model

Partially Observable Markov Environment Model< S, s0, A, T,O,Ω >

S - the set of the system states, s0 ∈ S the initial system state

A - the set of action applicable

T : S × A → Π(S) - the system transition function

O - the set of possible observations

Ω : S × A × S → Π(O) - the observations probability distribution

The system develops in epochs, at each epoch anobservation received chosen with respect to theobservations probability distribution, parametrized bythe transition the system underwent.

Basic beliefs about the system state at time t areexpressed by a probability vector ~pt ∈ Π(S).


How model fits Strategy

It is quite easy to see how ship’s motion is describedby a Markovian model < S, s0, A, T,O,Ω >:

The system states S represent different distancesfrom Scylla and Charybdis.Violent sea randomly throws the ship around - T ,but steering rod - A - can influence tendency ofmotion left and right.There’s so much commotion Ω, you’ll be lucky tohave a distant clue of what’s going on - O.

This makes it easy to describe the second output ofthe Strategic level: tactical target. We’d like the ship toalways return to a prescribed distance between Scyllaand Charybdis.

Under Markov Environment Model it’s simply the(conditional) distribution p(s′|s) = 1 ⇐⇒ s′ = ideal.


How model fits Strategy

It is quite easy to see how ship’s motion is describedby a Markovian model < S, s0, A, T,O,Ω >:

This makes it easy to describe the second output ofthe Strategic level: tactical target. We’d like the ship toalways return to a prescribed distance between Scyllaand Charybdis.

Under Markov Environment Model it’s simply the(conditional) distribution p(s′|s) = 1 ⇐⇒ s′ = ideal.


Ulysses journey due to Tactics

Tactical level deals not with the problem, but only withits abstract representation.

Given a tactical target, make formal system developdue to that target.

How?Try to understand how the system actuallydevelopsCorrect it by means of action application



What journey??

Tactical level deals not with theproblem, but only with its abstract representation.







How?

Try to understand how the system actuallydevelopsCorrect it by means of action application


The need to know what others see

It is important to underline that we attempt to correctthe dynamics impression created by actual systemstate development.

But we do not have the exact knowledge of thatimpression and have to estimate it from noisy data.

First, estimating the state itself - Markov TrackingSecond, estimating the cause of the change -Dynamics Tracking

We term the overall combination the Extended MarkovTracking (EMT).


Markov Tracking

In Markov Model ’tracking’ simply means fusion ofobservation data, action data and current beliefs aboutthe system state into new beliefs.

Since the system develops in discrete time epochs,and observation distribution is known it can be doneusing, so called Bayesian Update

pt+1(s) ∝ p(o|s, a)∑

s′

T (s|a, s′)pt(s′)

Markov tracking is used on its own merits in varioussolutions of Markov Decision Problems (MDPs)


Markov Decision Problem (MDP)

Econometric extension of Partially Observable MarkovEnvironment Model

Reward function R : S × A → R

Optimality measure, e.g., maximize averageaccumulated discounted reward

Algorithms exist that can solve MDPs and provide anaction selection procedure to obtain required optimality

Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.Problem: high computational complexity of solvingMDPs in partially observable environments







Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.

Problem: high computational complexity of solvingMDPs in partially observable environments







Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.Problem: high computational complexity of solvingMDPs in partially observable environments


Dynamics Tracking

It can be done by methods of Machine LearningGraphical ModelsDecision TreesNeural Networks

All are computationally hard


Dynamics Tracking



Information Theory

DKL(p‖q) =∑

xp(x) log

(

p(x)q(x)

)

measures the cost of an encoding guided bydistribution q of a data source governed bydistribution p


Dynamics Tracking



Degree of Mental Change

DKL(p‖q) =∑

xp(x) log

(

p(x)q(x)

)

measures degree of mental change required tomove from old beliefs q to new beliefs p


Extended Markov Tracking

Given that our beliefs about the system state havechanged from pt to pt+1, and our previous beliefs aboutthe exhibited system dynamics are PDt, the update ofthis belief is the solution of the following:

PDt+1 = arg minQ

Ept+1(s) [DKL(Q(·|s)‖PDt(·|s))]

s.t.

pt+1 = Q · pt

Denote PDt+1 = H[pt+1, pt, PDt].


Tactical solution

Now that we have means of understanding how thesystem develops we can attempt to fix it to our liking,the liking of tactical target r : S → Π(S)

Tactical solution performs continual loop ofEMT estimation of system developmentAction choice

a∗ = arg mina

DKL( H[Ta ∗ pt, pt, PDt] ‖ r)

Application of a∗.


Example: Steering the Ship

We tilt the steering rod to vary ship’s tendency to moveleft and right, so as to keep it at specified distance fromthe dangerous cliffs.




Environment Model instantiation: < S, s0, A, T,O,Ω >

S = [0 : n], s0 = bn

2c, n = 12

A ⊂ [0, 1] a finite set symmetric with respect to 0.5, |A| = 11

T (s′|s, a) is computed in such a way that the ship can makethree probabilistic steps with (p−, p+), probabilities of singlestep left or right, conform to the following:

p is constant for all s, s′ ∈ S and a ∈ A

p− = a ∗ (1 − p) and p+ = (1 − a) ∗ (1 − p)

O = S, Ω(o|s′, a, s) = 1

3iff |s′ − o| ≤ 1




Environment Model instantiation: < S, s0, A, T,O,Ω >

S = [0 : n], s0 = bn

2c, n = 12

A ⊂ [0, 1] a finite set symmetric with respect to 0.5, |A| = 11

T (s′|s, a) is computed in such a way that the ship can makethree probabilistic steps with (p−, p+), probabilities of singlestep left or right, conform to the following:

p is constant for all s, s′ ∈ S and a ∈ A

p− = a ∗ (1 − p) and p+ = (1 − a) ∗ (1 − p)

O = S, Ω(o|s′, a, s) = 1

3iff |s′ − o| ≤ 1

Ideal dynamics: r(s′|s) = 1 ⇐⇒ s′ = bn2 c


Example: Results

0 1 2 3 4 5 60.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance from global goal position

Cum

ulat

ive

prob

abili

ty

TacticalPOMDP


Strategic Morphology

Now that EMT works, we can ask: What kind of tacticaltargets can it really solve?

Those it can get from Strategic level.

There are several major dimensions for tactical taskspecifications:

Tactical target complexity

Model’s action complexityModel’s state observability


Target complexity

Tactical target complexity refers to the structure of theconditional distribution r : S → Π(S).

State distribution: tactical target r expressespreference over states, rather then state transitions.Essentially it means that original task wasexpressed by some vector p ∈ Π(S) and p = r ∗ p.Complete Dynamics: tactical target is anunstructured set of preferences. It may have beencreated by normalization from a reward functionR : S × S → R


Action complexity

Action complexity refers to the composition of theactions space. It is possible in fact to replace A by aCartesian product A1 × · · · × An, expressing thepresence of multiple active parties - agents.

For example in the case of Ulysses ship, one may thinkof controlling it by coordinated action of rowers.

EMT in this case may deal with different degrees ofsocial awareness

Complete social awareness: each and all agentsselect actions with respect to the presence andpotentials of other agents.Unaware scenario: agents are unaware of otheragents presence, and respond to them onlyindirectly through the environment modifications.


Observations complexity

As part of multi-agent scenario, it is also possible tosplit observations O = O1 × · · · × On and/or systemstate S = S1 × · · · × Sn, disengaging and dis-correlatingthe knowledge of different agents withing the system.

The degree of mutual dependency between whatagents know and experience provides additionalcomplexity dimension for Strategic level to use.


Example: Springed bar problem

Consider a long bar resting it’s ends on two equalsprings and two agents of equal mass are standing onthe bar. Their task is to shift themselves around so thatthe bar would level.

Formally the system state is described by the positionsof the two agent on the bar S = [1 : dmax]2, where dmax

is the length of the bar in “steps”, and the initial state isa dis-balanced one s0 = (1, dmax

2 + 1). The actions setsare Ai = left, stay, right, and the transitionprobability is built according to physics of motion.


Observations

1. Oi = S = all positions of the two agents, Ω1 = Ω2 andcreates uniform noise over the immediateneighborhood of the real joint position of agents.

2. Oi = [1 : dmax] and represents the position of theobserving agent. Ωi creates a uniform noise over theimmediate neighborhood of the observing agent realposition.


Multi-agent EMT

EMT has to be modified only slightly to admitmulti-agent scenario with complete social awarenessand dis-correlated observations and state. The agenthas to consider the complete Cartesian product ofactions A = A1 × · · · × An, but perform only it’s part.

Multi-Agent EMT loop:EMT estimation of system development due tolocal (agent specific) data.Action choice

~a∗ = arg min~a

DKL( H[Ta ∗ pt, pt, PDt] ‖ r)

Application of a∗i from a∗ = (a∗1, ..., a∗

n).


Results: First Observational scenario

In this first observation scenario, agents converge to asymmetric position around the ideal center of mass

5 10 15 20 25 30 35 400

2

4

6

8

10

12

14

16

Time step

Pos

ition

on

the

bar 1

≤ po

s ≤

d max

=15

Agent 1Agent 2Center of Mass


Results: First Observational scenario

In this first observation scenario, agents converge to asymmetric position around the ideal center of mass

10 20 30 40 50 60 70 80 90 100−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Time Step

Dev

iatio

n −c

onfid

ence

bar

val

ue *

104


Results: Second Observational scenario

In the second observation scenario have found anequilibrium point, where each agent occupies the farend of the bar, thus balancing it.

5 10 15 20 25 30 35 400

2

4

6

8

10

12

14

16

Time step

Pos

ition

on

the

bar 1

≤ po

s ≤

d max

=15

Agent 1Agent 2Center of Mass


Results: Second Observational scenario

In the second observation scenario have found anequilibrium point, where each agent occupies the farend of the bar, thus balancing it.

10 20 30 40 50 60 70 80 90 100−1.5

−1

−0.5

0

0.5

1

1.5

Time Step

Dev

iatio

n


whEre May iT lead?

There are still many questions to be answered andapplications tested.

Can EMT be optimized for parametric models?

Can EMT handle balancing between multiple targets:a∗ =arg min

a[DKL( H[. . . ] ‖ r1) + DKL( H[. . . ] ‖ r2)]

Can EMT handle negative target? We’d like to avoidcertain dynamics

a∗ = arg maxa

DKL( H[. . . ] ‖ r)

a∗ = arg mina

DKL( H[. . . ] ‖ 1 − r)


where Else May iT lead?

There are also questions regarding Strategic/TacticalParadigm itself

Are there alternative tactical solutions?How they compare to EMT’s performance?Is there any optimality measure that we may thinkof?

More specifically to Strategic levelIf Tactical level fails to achieve its target, is thereany alternative tactical target that Strategic levelcan provide?Can Strategic level claim that this tactical target isbest? Most easily achieved?


zinovi rabinovich jeffrey s. rosenschein school of ... · zinovi rabinovich jeffrey s. rosenschein...

Documents