zinovi rabinovich jeffrey s. rosenschein school of ... · zinovi rabinovich jeffrey s. rosenschein...
TRANSCRIPT
Extended Markov Trackingsolution for Strategic/Tactical Paradigm
Zinovi RabinovichJeffrey S. Rosenschein
School of Engineering and Computer SciencesHebrew University in Jerusalem
Extended Markov Tracking – p.1/28
Agenda
Strategic/Tactical Paradigm
Markov Environment Model
Extended Markov Tracking (EMT)
Tactical Solution by EMT
Example application
Strategic Morphology
Multi-agent EMT
Multi-agent example
Future directions
Extended Markov Tracking – p.2/28
Strategic/Tactical Paradigm
Strategic/Tactical Paradigm breaks a continualplanning/control problem into two basic levels:
Strategic level deals with system formalization anddefinition of ideal development.Tactical level deals not with the high level tasks, butonly with ideal system development which itattempts to implement.
The levels continually interact: Tactical level reports it’sdegree of success, while Strategic level updates andmodifies model parameters and tactical target.
This paradigm is evident in many existing planning andcontrol algorithms, and goes far back in time, even toepic stories like “Ulysses journey”.
Extended Markov Tracking – p.3/28
Scylla and Charybdis
Extended Markov Tracking – p.4/28
Ulysses journey
Halfway up the cliff there is a cave, misty-looking and turnedtoward Erebos and the dark, the very direction from which,
O shining Odysseus, you and your men will be steeringyour hollow ship; and from the hollow ship no vigorous
young man with a bow could shoot to the hole in the cliffside. In that cavern Scylla lives, whose howling is terror.
Her voice indeed is only as loud as a new-born puppy couldmake, but she herself is an evil monster. No one not even a
god encountering her, could be glad at that sight.
There are two dangers awaiting for you:Centuries took, but they know what they do.Sailors, who dared them - none to be seen,
For all they have perished, if stories are keen.
Mortal, of dangers first, is for all,Your crew will be fewer by second’s one toll.
Extended Markov Tracking – p.5/28
Ulysses journey
She has twelve feet, and all of them wave in the air. Shehas six necks upon her, grown to great length, and uponeach neck there is a horrible head, with teeth in it, set in
three rows close together and stiff, full of black death. Herbody from the waist down is holed up inside the hollow
cavern, but she holds her heads poked out and away fromthe terrible hollow, and there she fishes, peering all over thecliff side, looking for dolphins or dogfish to catch or anything
bigger, some sea monster, of whom Amphitrite keeps somany; never can sailors boast aloud that their ship has
passed her without any loss of men, for with each of herheads she snatches one man away and carries him off from
the dark-prowed vessel.
There are two dangers awaiting for you:Centuries took, but they know what they do.Sailors, who dared them - none to be seen,
For all they have perished, if stories are keen.
Mortal, of dangers first, is for all,Your crew will be fewer by second’s one toll.
Extended Markov Tracking – p.5/28
Ulysses journey
The other cliff is lower; you will see, Odysseus, for they lieclose together, you could even cast with an arrow across.There is a great fig tree grows there, dense with foliage,and under this shining Charybdis sucks down the black
water. For three times a day she flows it up, and three timesshe sucks it terribly down; may you not be there when she
sucks down water, for not even the Earthshaker couldrescue you out of that evil.
But sailing your ship swiftly drive her past and avoid her andmake for Scylla’s rock instead, since it is far better to mourn
six friends lost out of your ship than the whole company.
There are two dangers awaiting for you:Centuries took, but they know what they do.Sailors, who dared them - none to be seen,
For all they have perished, if stories are keen.
Mortal, of dangers first, is for all,Your crew will be fewer by second’s one toll.
Extended Markov Tracking – p.5/28
Ulysses journey
There are two dangers awaiting for you:Centuries took, but they know what they do.Sailors, who dared them - none to be seen,
For all they have perished, if stories are keen.
Mortal, of dangers first, is for all,Your crew will be fewer by second’s one toll.Take careful decision charting your course,Make it much closer to the second of those.
Extended Markov Tracking – p.5/28
Ulysses journey
There are two dangers awaiting for you:Centuries took, but they know what they do.Sailors, who dared them - none to be seen,
For all they have perished, if stories are keen.
Mortal, of dangers first, is for all,Your crew will be fewer by second’s one toll.Take careful decision charting your course,Make it much closer to the second of those.
Extended Markov Tracking – p.5/28
Ulysses journey
There are two dangers awaiting for you:Centuries took, but they know what they do.Sailors, who dared them - none to be seen,
For all they have perished, if stories are keen.
Mortal, of dangers first, is for all,Your crew will be fewer by second’s one toll.Take careful decision charting your course,Make it much closer to the second of those.
Extended Markov Tracking – p.5/28
Ulysses journey due to Strategy
It is the task of the Strategic level of our paradigm toformalize the problem, and it has to procure two:
Environment formal descriptionThe ideal development of the formal environment:the tactical target
The environment, and this will be our preferred choice,will be described by a Partially Observable MarkovEnvironment Model.
Extended Markov Tracking – p.6/28
Formal Model
Partially Observable Markov Environment Model< S, s0, A, T,O,Ω >
S - the set of the system states, s0 ∈ S the initial system state
A - the set of action applicable
T : S × A → Π(S) - the system transition function
O - the set of possible observations
Ω : S × A × S → Π(O) - the observations probability distribution
The system develops in epochs, at each epoch anobservation received chosen with respect to theobservations probability distribution, parametrized bythe transition the system underwent.
Basic beliefs about the system state at time t areexpressed by a probability vector ~pt ∈ Π(S).
Extended Markov Tracking – p.7/28
How model fits Strategy
It is quite easy to see how ship’s motion is describedby a Markovian model < S, s0, A, T,O,Ω >:
The system states S represent different distancesfrom Scylla and Charybdis.Violent sea randomly throws the ship around - T ,but steering rod - A - can influence tendency ofmotion left and right.There’s so much commotion Ω, you’ll be lucky tohave a distant clue of what’s going on - O.
This makes it easy to describe the second output ofthe Strategic level: tactical target. We’d like the ship toalways return to a prescribed distance between Scyllaand Charybdis.
Under Markov Environment Model it’s simply the(conditional) distribution p(s′|s) = 1 ⇐⇒ s′ = ideal.
Extended Markov Tracking – p.8/28
How model fits Strategy
It is quite easy to see how ship’s motion is describedby a Markovian model < S, s0, A, T,O,Ω >:
This makes it easy to describe the second output ofthe Strategic level: tactical target. We’d like the ship toalways return to a prescribed distance between Scyllaand Charybdis.
Under Markov Environment Model it’s simply the(conditional) distribution p(s′|s) = 1 ⇐⇒ s′ = ideal.
Extended Markov Tracking – p.8/28
Ulysses journey due to Tactics
Tactical level deals not with the problem, but only withits abstract representation.
Given a tactical target, make formal system developdue to that target.
How?Try to understand how the system actuallydevelopsCorrect it by means of action application
Extended Markov Tracking – p.9/28
Ulysses journey due to Tactics
What journey??
Tactical level deals not with theproblem, but only with its abstract representation.
Given a tactical target, make formal system developdue to that target.
How?Try to understand how the system actuallydevelopsCorrect it by means of action application
Extended Markov Tracking – p.9/28
Ulysses journey due to Tactics
Tactical level deals not with the problem, but only withits abstract representation.
Given a tactical target, make formal system developdue to that target.
How?
Try to understand how the system actuallydevelopsCorrect it by means of action application
Extended Markov Tracking – p.9/28
Ulysses journey due to Tactics
Tactical level deals not with the problem, but only withits abstract representation.
Given a tactical target, make formal system developdue to that target.
How?Try to understand how the system actuallydevelopsCorrect it by means of action application
Extended Markov Tracking – p.9/28
The need to know what others see
It is important to underline that we attempt to correctthe dynamics impression created by actual systemstate development.
But we do not have the exact knowledge of thatimpression and have to estimate it from noisy data.
First, estimating the state itself - Markov TrackingSecond, estimating the cause of the change -Dynamics Tracking
We term the overall combination the Extended MarkovTracking (EMT).
Extended Markov Tracking – p.10/28
Markov Tracking
In Markov Model ’tracking’ simply means fusion ofobservation data, action data and current beliefs aboutthe system state into new beliefs.
Since the system develops in discrete time epochs,and observation distribution is known it can be doneusing, so called Bayesian Update
pt+1(s) ∝ p(o|s, a)∑
s′
T (s|a, s′)pt(s′)
Markov tracking is used on its own merits in varioussolutions of Markov Decision Problems (MDPs)
Extended Markov Tracking – p.11/28
Markov Tracking
In Markov Model ’tracking’ simply means fusion ofobservation data, action data and current beliefs aboutthe system state into new beliefs.
Since the system develops in discrete time epochs,and observation distribution is known it can be doneusing, so called Bayesian Update
pt+1(s) ∝ p(o|s, a)∑
s′
T (s|a, s′)pt(s′)
Markov tracking is used on its own merits in varioussolutions of Markov Decision Problems (MDPs)
Extended Markov Tracking – p.11/28
Markov Tracking
In Markov Model ’tracking’ simply means fusion ofobservation data, action data and current beliefs aboutthe system state into new beliefs.
Since the system develops in discrete time epochs,and observation distribution is known it can be doneusing, so called Bayesian Update
pt+1(s) ∝ p(o|s, a)∑
s′
T (s|a, s′)pt(s′)
Markov tracking is used on its own merits in varioussolutions of Markov Decision Problems (MDPs)
Extended Markov Tracking – p.11/28
Markov Decision Problem (MDP)
Econometric extension of Partially Observable MarkovEnvironment Model
Reward function R : S × A → R
Optimality measure, e.g., maximize averageaccumulated discounted reward
Algorithms exist that can solve MDPs and provide anaction selection procedure to obtain required optimality
Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.Problem: high computational complexity of solvingMDPs in partially observable environments
Extended Markov Tracking – p.12/28
Markov Decision Problem (MDP)
Econometric extension of Partially Observable MarkovEnvironment Model
Reward function R : S × A → R
Optimality measure, e.g., maximize averageaccumulated discounted reward
Algorithms exist that can solve MDPs and provide anaction selection procedure to obtain required optimality
Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.Problem: high computational complexity of solvingMDPs in partially observable environments
Extended Markov Tracking – p.12/28
Markov Decision Problem (MDP)
Econometric extension of Partially Observable MarkovEnvironment Model
Reward function R : S × A → R
Optimality measure, e.g., maximize averageaccumulated discounted reward
Algorithms exist that can solve MDPs and provide anaction selection procedure to obtain required optimality
Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.Problem: high computational complexity of solvingMDPs in partially observable environments
Extended Markov Tracking – p.12/28
Markov Decision Problem (MDP)
Econometric extension of Partially Observable MarkovEnvironment Model
Reward function R : S × A → R
Optimality measure, e.g., maximize averageaccumulated discounted reward
Algorithms exist that can solve MDPs and provide anaction selection procedure to obtain required optimality
Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.Problem: high computational complexity of solvingMDPs in partially observable environments
Extended Markov Tracking – p.12/28
Markov Decision Problem (MDP)
Econometric extension of Partially Observable MarkovEnvironment Model
Reward function R : S × A → R
Optimality measure, e.g., maximize averageaccumulated discounted reward
Algorithms exist that can solve MDPs and provide anaction selection procedure to obtain required optimality
Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.
Problem: high computational complexity of solvingMDPs in partially observable environments
Extended Markov Tracking – p.12/28
Markov Decision Problem (MDP)
Econometric extension of Partially Observable MarkovEnvironment Model
Reward function R : S × A → R
Optimality measure, e.g., maximize averageaccumulated discounted reward
Algorithms exist that can solve MDPs and provide anaction selection procedure to obtain required optimality
Wide range of system behaviors can be induced byaction selection procedures dictated by a wisechoice of reward function.Problem: high computational complexity of solvingMDPs in partially observable environments
Extended Markov Tracking – p.12/28
Dynamics Tracking
It can be done by methods of Machine LearningGraphical ModelsDecision TreesNeural Networks
All are computationally hard
Extended Markov Tracking – p.13/28
Dynamics Tracking
It can be done by methods of Machine LearningGraphical ModelsDecision TreesNeural Networks
All are computationally hard
Extended Markov Tracking – p.13/28
Dynamics Tracking
It can be done by methods of Machine LearningGraphical ModelsDecision TreesNeural Networks
All are computationally hard
Information Theory
DKL(p‖q) =∑
xp(x) log
(
p(x)q(x)
)
measures the cost of an encoding guided bydistribution q of a data source governed bydistribution p
Extended Markov Tracking – p.13/28
Dynamics Tracking
It can be done by methods of Machine LearningGraphical ModelsDecision TreesNeural Networks
All are computationally hard
Degree of Mental Change
DKL(p‖q) =∑
xp(x) log
(
p(x)q(x)
)
measures degree of mental change required tomove from old beliefs q to new beliefs p
Extended Markov Tracking – p.13/28
Extended Markov Tracking
Given that our beliefs about the system state havechanged from pt to pt+1, and our previous beliefs aboutthe exhibited system dynamics are PDt, the update ofthis belief is the solution of the following:
PDt+1 = arg minQ
Ept+1(s) [DKL(Q(·|s)‖PDt(·|s))]
s.t.
pt+1 = Q · pt
Denote PDt+1 = H[pt+1, pt, PDt].
Extended Markov Tracking – p.14/28
Tactical solution
Now that we have means of understanding how thesystem develops we can attempt to fix it to our liking,the liking of tactical target r : S → Π(S)
Tactical solution performs continual loop ofEMT estimation of system developmentAction choice
a∗ = arg mina
DKL( H[Ta ∗ pt, pt, PDt] ‖ r)
Application of a∗.
Extended Markov Tracking – p.15/28
Example: Steering the Ship
We tilt the steering rod to vary ship’s tendency to moveleft and right, so as to keep it at specified distance fromthe dangerous cliffs.
Extended Markov Tracking – p.16/28
Example: Steering the Ship
We tilt the steering rod to vary ship’s tendency to moveleft and right, so as to keep it at specified distance fromthe dangerous cliffs.
Environment Model instantiation: < S, s0, A, T,O,Ω >
S = [0 : n], s0 = bn
2c, n = 12
A ⊂ [0, 1] a finite set symmetric with respect to 0.5, |A| = 11
T (s′|s, a) is computed in such a way that the ship can makethree probabilistic steps with (p−, p+), probabilities of singlestep left or right, conform to the following:
p is constant for all s, s′ ∈ S and a ∈ A
p− = a ∗ (1 − p) and p+ = (1 − a) ∗ (1 − p)
O = S, Ω(o|s′, a, s) = 1
3iff |s′ − o| ≤ 1
Extended Markov Tracking – p.16/28
Example: Steering the Ship
We tilt the steering rod to vary ship’s tendency to moveleft and right, so as to keep it at specified distance fromthe dangerous cliffs.
Environment Model instantiation: < S, s0, A, T,O,Ω >
S = [0 : n], s0 = bn
2c, n = 12
A ⊂ [0, 1] a finite set symmetric with respect to 0.5, |A| = 11
T (s′|s, a) is computed in such a way that the ship can makethree probabilistic steps with (p−, p+), probabilities of singlestep left or right, conform to the following:
p is constant for all s, s′ ∈ S and a ∈ A
p− = a ∗ (1 − p) and p+ = (1 − a) ∗ (1 − p)
O = S, Ω(o|s′, a, s) = 1
3iff |s′ − o| ≤ 1
Ideal dynamics: r(s′|s) = 1 ⇐⇒ s′ = bn2 c
Extended Markov Tracking – p.16/28
Example: Results
0 1 2 3 4 5 60.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Distance from global goal position
Cum
ulat
ive
prob
abili
ty
TacticalPOMDP
Extended Markov Tracking – p.17/28
Strategic Morphology
Now that EMT works, we can ask: What kind of tacticaltargets can it really solve?
Those it can get from Strategic level.
There are several major dimensions for tactical taskspecifications:
Tactical target complexity
Model’s action complexityModel’s state observability
Extended Markov Tracking – p.18/28
Target complexity
Tactical target complexity refers to the structure of theconditional distribution r : S → Π(S).
State distribution: tactical target r expressespreference over states, rather then state transitions.Essentially it means that original task wasexpressed by some vector p ∈ Π(S) and p = r ∗ p.Complete Dynamics: tactical target is anunstructured set of preferences. It may have beencreated by normalization from a reward functionR : S × S → R
Extended Markov Tracking – p.19/28
Action complexity
Action complexity refers to the composition of theactions space. It is possible in fact to replace A by aCartesian product A1 × · · · × An, expressing thepresence of multiple active parties - agents.
For example in the case of Ulysses ship, one may thinkof controlling it by coordinated action of rowers.
EMT in this case may deal with different degrees ofsocial awareness
Complete social awareness: each and all agentsselect actions with respect to the presence andpotentials of other agents.Unaware scenario: agents are unaware of otheragents presence, and respond to them onlyindirectly through the environment modifications.
Extended Markov Tracking – p.20/28
Observations complexity
As part of multi-agent scenario, it is also possible tosplit observations O = O1 × · · · × On and/or systemstate S = S1 × · · · × Sn, disengaging and dis-correlatingthe knowledge of different agents withing the system.
The degree of mutual dependency between whatagents know and experience provides additionalcomplexity dimension for Strategic level to use.
Extended Markov Tracking – p.21/28
Example: Springed bar problem
Consider a long bar resting it’s ends on two equalsprings and two agents of equal mass are standing onthe bar. Their task is to shift themselves around so thatthe bar would level.
Formally the system state is described by the positionsof the two agent on the bar S = [1 : dmax]2, where dmax
is the length of the bar in “steps”, and the initial state isa dis-balanced one s0 = (1, dmax
2 + 1). The actions setsare Ai = left, stay, right, and the transitionprobability is built according to physics of motion.
Extended Markov Tracking – p.22/28
Example: Springed bar problem
Consider a long bar resting it’s ends on two equalsprings and two agents of equal mass are standing onthe bar. Their task is to shift themselves around so thatthe bar would level.
Formally the system state is described by the positionsof the two agent on the bar S = [1 : dmax]2, where dmax
is the length of the bar in “steps”, and the initial state isa dis-balanced one s0 = (1, dmax
2 + 1). The actions setsare Ai = left, stay, right, and the transitionprobability is built according to physics of motion.
Extended Markov Tracking – p.22/28
Observations
1. Oi = S = all positions of the two agents, Ω1 = Ω2 andcreates uniform noise over the immediateneighborhood of the real joint position of agents.
2. Oi = [1 : dmax] and represents the position of theobserving agent. Ωi creates a uniform noise over theimmediate neighborhood of the observing agent realposition.
Extended Markov Tracking – p.23/28
Multi-agent EMT
EMT has to be modified only slightly to admitmulti-agent scenario with complete social awarenessand dis-correlated observations and state. The agenthas to consider the complete Cartesian product ofactions A = A1 × · · · × An, but perform only it’s part.
Multi-Agent EMT loop:EMT estimation of system development due tolocal (agent specific) data.Action choice
~a∗ = arg min~a
DKL( H[Ta ∗ pt, pt, PDt] ‖ r)
Application of a∗i from a∗ = (a∗1, ..., a∗
n).
Extended Markov Tracking – p.24/28
Results: First Observational scenario
In this first observation scenario, agents converge to asymmetric position around the ideal center of mass
5 10 15 20 25 30 35 400
2
4
6
8
10
12
14
16
Time step
Pos
ition
on
the
bar 1
≤ po
s ≤
d max
=15
Agent 1Agent 2Center of Mass
Extended Markov Tracking – p.25/28
Results: First Observational scenario
In this first observation scenario, agents converge to asymmetric position around the ideal center of mass
10 20 30 40 50 60 70 80 90 100−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Time Step
Dev
iatio
n −c
onfid
ence
bar
val
ue *
104
Extended Markov Tracking – p.25/28
Results: Second Observational scenario
In the second observation scenario have found anequilibrium point, where each agent occupies the farend of the bar, thus balancing it.
5 10 15 20 25 30 35 400
2
4
6
8
10
12
14
16
Time step
Pos
ition
on
the
bar 1
≤ po
s ≤
d max
=15
Agent 1Agent 2Center of Mass
Extended Markov Tracking – p.26/28
Results: Second Observational scenario
In the second observation scenario have found anequilibrium point, where each agent occupies the farend of the bar, thus balancing it.
5 10 15 20 25 30 35 400
2
4
6
8
10
12
14
16
Time step
Pos
ition
on
the
bar 1
≤ po
s ≤
d max
=15
Agent 1Agent 2Center of Mass
Extended Markov Tracking – p.26/28
Results: Second Observational scenario
In the second observation scenario have found anequilibrium point, where each agent occupies the farend of the bar, thus balancing it.
10 20 30 40 50 60 70 80 90 100−1.5
−1
−0.5
0
0.5
1
1.5
Time Step
Dev
iatio
n
Extended Markov Tracking – p.26/28
whEre May iT lead?
There are still many questions to be answered andapplications tested.
Can EMT be optimized for parametric models?
Can EMT handle balancing between multiple targets:a∗ =arg min
a[DKL( H[. . . ] ‖ r1) + DKL( H[. . . ] ‖ r2)]
Can EMT handle negative target? We’d like to avoidcertain dynamics
a∗ = arg maxa
DKL( H[. . . ] ‖ r)
a∗ = arg mina
DKL( H[. . . ] ‖ 1 − r)
Extended Markov Tracking – p.27/28
where Else May iT lead?
There are also questions regarding Strategic/TacticalParadigm itself
Are there alternative tactical solutions?How they compare to EMT’s performance?Is there any optimality measure that we may thinkof?
More specifically to Strategic levelIf Tactical level fails to achieve its target, is thereany alternative tactical target that Strategic levelcan provide?Can Strategic level claim that this tactical target isbest? Most easily achieved?
Extended Markov Tracking – p.28/28