![Page 1: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/1.jpg)
Markov Decision Processes
AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3
![Page 2: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/2.jpg)
From utility to optimal policyβ’ The utility function U(s) allows the agent to select
the action that maximizes the expected utility of the subsequent state:
![Page 3: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/3.jpg)
The Bellman equation
Now, if the utility of a state is the expected sum of discounted rewards from that point onwards, then there is a direct relationship between the utility of a state and the utility of its neighbors:
3
The utility of a state is the immediate reward for that state plus the expected discounted utility of the next state, assuming that the agent chooses the optimal action
π (π )=π (π )+πΎ maxπβ π΄(π )
βπ β²π (π β²|π ,π )π (π β²) the Bellman
equation
![Page 4: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/4.jpg)
The Bellman equation
4
π (π )=π (π )+πΎ maxπβ π΄(π )
βπ β²π (π β²|π ,π )π (π β²)
![Page 5: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/5.jpg)
The value iteration algorithm
For problem with n states, there are n Bellman equations, and n unknowns, however NOT linear
5
π (π )=π (π )+πΎ maxπβ π΄(π )
βπ β²π (π β²|π ,π )π (π β²)
π π+1(π )βπ (π )+πΎ maxπβπ΄ (π )
βπ β²π (π β²|π ,π )π π(π
β² )
Start with random U(s), update iteratively Guaranteed to converge to the unique solution
Demo: http://people.cs.ubc.ca/~poole/demos/mdp/vi.html
![Page 6: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/6.jpg)
Policy iteration algorithm
It is possible to get an optimal policy even when the utility function estimate is inaccurate
If one action is clearly better than all others, then the exact magnitude of the utilities on the states involved need not be precise
6
Compute utilities of states
Compute optimal policy
Compute utilities of states for a given policy
Compute policy for the given state utilities
Value iteration Policy iteration
![Page 7: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/7.jpg)
Policy iteration algorithm
7
πβ=argmaxπβπ΄ (π )
βπ β²π (π β²|π ,π )π (π β²)
π (π )=π (π )+πΎ maxπβ π΄(π )
βπ β²π (π β²|π ,π )π (π β²)
π (π )=π (π )+πΎβπ β²π ( π β²|π ,ππ (π ) )π (π β²) Linear
equation
![Page 8: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/8.jpg)
Policy evaluation
n linear equations, n unknowns for problem with n states, solved in n cubic time, can also use iterative scheme 8
![Page 9: Markov Decision Processes AIMA: 17.1, 17.2 (excluding 17.2.3), 17.3](https://reader036.vdocuments.us/reader036/viewer/2022081803/56812baf550346895d8fe862/html5/thumbnails/9.jpg)
Summary
Markov decision processes Utility of state sequence Utility of states Value iteration algorithm Policy iteration algorithm
9