© daniel s. weld 1 logistics no reading first tournament wed details tba
TRANSCRIPT
© Daniel S. Weld 1
Logistics
• No Reading• First Tournament Wed
Details TBA
© Daniel S. Weld 2
573 Topics
Agency
Problem Spaces
SearchKnowledge
Representation
Planning
Perception
NLPMulti-agent
Robotics
Uncertainty
MDPs SupervisedLearning
Reinforcement
Learning
© Daniel S. Weld 3
Review: MDPsS = set of states set (|S| = n)
A = set of actions (|A| = m)
Pr = transition function Pr(s,a,s’)represented by set of m n x n stochastic matriceseach defines a distribution over SxS
R(s) = bounded, real-valued reward funrepresented by an n-vector
© Daniel S. Weld 4
Reinforcement Learning
© Daniel S. Weld 5
How is learning to act possible when…
• Actions have non-deterministic effects And are initially unknown
• Rewards / punishments are infrequent Often at the end of long sequences of actions
• Learner must decide what actions to take• World is large and complex
© Daniel S. Weld 6
RL Techniques• Temporal-difference learning
Learns a utility function on states • or on [state,action] pairs
Similar to backpropagation• treats the difference between expected / actual
reward as an error signal, that is propagated backward in time
• Exploration functions Balance exploration / exploitation
• Function approximation Compress a large state space into a small one Linear function approximation, neural nets, … Generalization
© Daniel S. Weld 7
Example:
© Daniel S. Weld 8
Passive RL
• Given policy , estimate U(s)
• Not given transition matrix, nor reward function!
• Epochs: training sequences
(1,1)(1,2)(1,3)(1,2)(1,3)(1,2)(1,1)(1,2)(2,2)(3,2) –1(1,1)(1,2)(1,3)(2,3)(2,2)(2,3)(3,3) +1(1,1)(1,2)(1,1)(1,2)(1,1)(2,1)(2,2)(2,3)(3,3) +1(1,1)(1,2)(2,2)(1,2)(1,3)(2,3)(1,3)(2,3)(3,3) +1(1,1)(2,1)(2,2)(2,1)(1,1)(1,2)(1,3)(2,3)(2,2)(3,2) -1(1,1)(2,1)(1,1)(1,2)(2,2)(3,2) -1
© Daniel S. Weld 9
Value Function
© Daniel S. Weld 10
Approach 1
• Direct estimation Estimate U(s) as average total reward of epochs
containing s (calculating from s to end of epoch)• Pros / Cons?
Requires huge amount of data doesn’t exploit Bellman constraints!
Expected utility of a state = its own reward + expected utility of successors
© Daniel S. Weld 11
Approach 2Adaptive Dynamic Programming
Requires fully observable environmentEstimate transition function M from training dataSolve Bellman eqn w/ modified policy iteration
Pros / Cons: ,( ) ( )s ss
U R s M U s
Requires complete observationsDon’t usually need value of all states
© Daniel S. Weld 12
Approach 3
• Temporal Difference Learning Do backups on a per-action basis Don’t try to estimate entire transition function! For each transition from s to s’, update:
( )( )( ) ( ) ( )) (R s s UU s s sUU
=
=
© Daniel S. Weld 13
Notes
• Once U is learned, updates become 0:
0 ( ( ) ( ) ( )) when ( ) ( ) ( )R s U s U s U s R s U s
( )( )( ) ( ) ( )) (R s s UU s s sUU
• “TD(0)” One step lookahead
Can do 2 step, 3 step…
© Daniel S. Weld 14
TD()• Or, … take it to the limit!• Compute weighted average of all future
states1( )( )) (( )( ( )) t t tt t U sU s U s R s U s
10
( ) (1( ) ( ) ) ) )( )( (t tt ii
ti
t R s U sUU s s sU
becomes
weighted average
• Implementation Propagate current weighted TD onto past
states Must memorize states visited from start of
epoch
© Daniel S. Weld 15
TD()• TD rule just described does 1-step
lookahead – called TD(0) Could also look ahead 2 steps = TD(1), 3 steps
= TD(2), … • TD() – compute a weighted average of all
future states• Implementation
Propagate current weighted temporal difference onto past states
Requires memorizing states visited from beginning of epoch (or from point i is not too tiny)
© Daniel S. Weld 16
Notes II• Online: update immediately after each action
Works even if epochs are infinitely long• Offline: wait until the end of an epoch
Can perform updates in any order E.g. backwards in time (Converges faster if rewards come only at epoch
end ) Why?!?
• Prioritized sweeping heuristic
© Daniel S. Weld 17
Q-Learning
• Version of TD-learning where instead of learning value function on
states we learn one on [state,action] pairs
• Why do this?
( ) ( )
(
( )( ) ( )
(, ) ( , ) ) (max ( , )
( )
be
( , ))
comes
a
U s U s
Q a s
U sR s U s
R s Q as sa sa QQ
© Daniel S. Weld 18
Part II
• So far, we have assumed agent had policy
• Now, suppose agent must learn it While acting in uncertain world
© Daniel S. Weld 19
Active Reinforcement Learning
Suppose agent must make policy while learning
First approach:Start with arbitrary policyApply Q-LearningNew policy: in state s, choose action a that
maximizes Q(a,s)
Problem?
© Daniel S. Weld 20
Utility of Exploration• Too easily stuck in non-optimal space
“Exploration versus exploitation tradeoff”
• Solution 1 With fixed probability perform a random action
• Solution 2 Increase est expected value of infrequent states
Properties of f(u, n) ?? U+(s) R(s) + maxa f(s’ P(s’ | a,s) U+(s’), N(a,s))
If n > Ne U i.e. normal utility Else, R+ i.e. max possible reward
© Daniel S. Weld 21
Part III
• Problem of large state spaces remain Never enough training data! Learning takes too long
• What to do??
© Daniel S. Weld 22
Function Approximation
• Never enough training data! Must generalize what learning to new
situations• Idea:
Replace large state table by a smaller, parameterized function
Updating the value of state will change the value assigned to many other similar states
© Daniel S. Weld 23
Linear Function Approximation
• Represent U(s) as a weighted sum of features (basis functions) of s
• Update each parameter separately, e.g:
( )ˆ( ) ( )ˆ (ˆ ( )
)i i
i
U sR s UU
ss
1 1 2 2ˆ ( ) ( ) ( ) ... ( )n nU s f s f s f s
© Daniel S. Weld 24
Example• U(s) = 0 + 1 x + 2 y• Learns good approximation
10
© Daniel S. Weld 25
But What If…• U(s) = 0 + 1 x + 2 y
10
+ 3 z
• Computed Features
z= (xg-x)2 + (yg-y)2
© Daniel S. Weld 26
Neural Nets
• Can create powerful function approximators Nonlinear Possibly unstable
• For TD-learning, apply difference signal to neural net output and perform back- propagation
© Daniel S. Weld 27
Policy Search• Represent policy in terms of Q functions• Gradient search
Requires differentiability Stochastic policies; softmax
• Hillclimbing Tweak policy and evaluate by running
• Replaying experience
© Daniel S. Weld 28
~Worlds Best Player
• Neural network with 80 hidden units Used computed features
• 300,000 games against self
© Daniel S. Weld 29
Pole Balancing Demo
© Daniel S. Weld 30
Applications to the Web Focused Crawling
• Limited resources Fetch most important pages first
• Topic specific search engines Only want pages which are relevant to topic
• Minimize stale pages Efficient re-fetch to keep index timely How track the rate of change for pages?
© Daniel S. Weld 31
Standard Web Search Engine Architecture
crawl theweb
create an inverted
index
store documents,check for duplicates,
extract links
inverted index
DocIds
Slide adapted from Marty Hearst / UC Berkeley]
Search engine servers
userquery
show results To user
© Daniel S. Weld 32
RL for Focussed Crawling
• Huge, unknown environment• Large, dynamic number of “actions”• Consumable rewards
• Questions: How represent states, actions? What’s the right reward? Discount factor? How represent Q functions? Dynamic programming on state space the size of
WWW?
© Daniel S. Weld 33
• Disregard state• Action == bag of words in link
neighborhood• Q function
Train offline on crawled corpus Compute reward as f(action), =0.5 Discretize rewards into bins Learn Q function directly (supervised learning)
• Add words link neighborhood to bin bag• Train naïve Bayes classifier on bins
Rennie & McCallum (ICML-99)
© Daniel S. Weld 34
Performance
© Daniel S. Weld 35
Zoom on Performance
© Daniel S. Weld 36
Applications to the Web II
• Improving Pagerank
© Daniel S. Weld 37
Methods• Agent Types
Utility-based Action-value based (Q function) Reflex
• Passive Learning Direct utility estimation Adaptive dynamic programming Temporal difference learning
• Active Learning Choose random action 1/nth of the time Exploration by adding to utility function Q-learning (learn action/value f directly – model free)
• Generalization ADDs Function approximation (linear function or neural networks)
• Policy Search Stochastic policy repr / Softmax Reusing past experience
© Daniel S. Weld 38
Summary
• Use reinforcement learning when Model of world is unknown and/or rewards are
delayed• Temporal difference learning
Simple and efficient training rule• Q-learning eliminates need for explicit T model• Large state spaces can (sometimes!) be
handled Function approximation, using linear functions Or neural nets