q. administrivia final project proposals back today (w/ comments) evaluated on 4 axes: w&c ==...
Post on 22-Dec-2015
216 views
TRANSCRIPT
Administrivia•Final project proposals back today (w/
comments)
•Evaluated on 4 axes:
•W&C == Writing & Clarity
•M&P == Motivation & Problem statement
•B&R == Background & Related work
•RP == Research Plan
Reminders...•Last time:
•Bellman equation
•Examples (pictures)
•Solution of planning problem: policy iteration
•Today:
•Q functions
•The Q learning algorithm
•Discussion of R2
The policy iteration alg.Function: policy_iteration
Input: MDP M= 〈 S,A,T,R 〈 discount γ
Output: optimal policy π*; opt. value func. V*Initialization: choose π
0 arbitrarily
Repeat {Vi=eval_policy(M,π
i,γ) // from Bellman eqn
πi+1=local_update_policy(π
i,V
i)
} Until (πi+1==π
i)
Function: π’=local_update_policy(π,V)for i=1..|S| {π’(s
i)=argmax
a∈A( sum
j(T(s
i,a,s
j)*V(s
j)) )
}
Function: policy_iteration
Input: MDP M= 〈 S,A,T,R 〈 discount γ
Output: optimal policy π*; opt. value func. V*Initialization: choose π
0 arbitrarily
Repeat {Vi=eval_policy(M,π
i,γ) // from Bellman eqn
πi+1=local_update_policy(π
i,V
i)
} Until (πi+1==π
i)
Function: π’=local_update_policy(π,V)for i=1..|S| {π’(s
i)=argmax
a∈A( sum
j(T(s
i,a,s
j)*V(s
j)) )
}
The policy iteration alg.
Q: A key operative
•Critical step in policy iteration
• π’(si)=argmax
a∈A( sum
j(T(s
i,a,s
j)*V(s
j)) )
•Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?”
•Alt: regardless of current π, what would be the best a I could pick for the next timestep (greedily)
Q: A key operative•Commonly used operation. Gets a special
name:
•Definition: the Q function, is:
•Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”
What to do with Q•Can think of Q as a big table: one entry
for each state/action pair
•“If I’m in state s and take action a, this is my expected discounted reward...”
•A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?”
•Can get V and π from Q:
Policy iteration, restatedFunction: policy_iteration
Input: MDP M= 〈 S,A,T,R 〈 discount γ
Output: optimal policy π*; opt. value func. V*Initialization: choose π
0 arbitrarily
Repeat {Qi=eval_policy(M,π
i,γ) // from Bellman eqn
πi+1=local_update_policy(π
i,Q
i)
} Until (πi+1==π
i)
Function: π’=local_update_policy(π,Q)for i=1..|S| {π’(s
i)=argmax
a∈A( Q(s
i,a) )
}
Learning with Q
•Q and the notion of policy evaluation give us a nice way to do actual learning
•Use Q table to represent policy
•Update Q through experience
•Every time you see a (s,a,r,s’) tuple, update Q
Learning with Q
•Each example of (s,a,r,s’) is a sample from T(s,a,s’) and from R
•W/ enough samples, can get a good idea of how the world works, where reward is, etc.
•Note: Never actually learn T or R; let Q encode everything you need to know about the world
The Q-learning algorithmAlgorithm: Q_learn
Inputs: State space S; Act. space A
Discount γ (0<=γ<1); Learning rate α (0<=α<1)
Outputs: Q
Repeat {
s=get_current_world_state()
a=pick_next_action(Q,s)
(r,s’)=act_in_world(a)
Q(s,a)=Q(s,a)+α*(r+γ*max_a’(Q(s’,a’))-Q(s,a))
} Until (bored)
Well, it looks good anyway•But are we sure it’s actually learning?
•How to measure whether it’s actually getting any better at the task? (Finding the goal state)
Well, it looks good anyway•But are we sure it’s actually learning?
•How to measure whether it’s actually getting any better at the task? (Finding the goal state)
•Every 10 episodes, “freeze” policy (turn off learning)
•Measure avg time to goal from a number of starting states
•Average over a number of test episodes to iron out noise
•Plot learning curve: #episodes of learning vs. avg performance