© daniel s. weld 1 logistics no reading first tournament wed details tba

38
© Daniel S. Weld 1 Logistics • No Reading • First Tournament Wed Details TBA

Upload: molly-conley

Post on 18-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 1

Logistics

• No Reading• First Tournament Wed

  Details TBA

Page 2: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 2

573 Topics

Agency

Problem Spaces

SearchKnowledge

Representation

Planning

Perception

NLPMulti-agent

Robotics

Uncertainty

MDPs SupervisedLearning

Reinforcement

Learning

Page 3: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 3

Review: MDPsS = set of states set (|S| = n)

A = set of actions (|A| = m)

Pr = transition function Pr(s,a,s’)represented by set of m n x n stochastic matriceseach defines a distribution over SxS

R(s) = bounded, real-valued reward funrepresented by an n-vector

Page 4: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 4

Reinforcement Learning

Page 5: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 5

How is learning to act possible when…

• Actions have non-deterministic effects  And are initially unknown

• Rewards / punishments are infrequent  Often at the end of long sequences of actions

• Learner must decide what actions to take• World is large and complex

Page 6: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 6

RL Techniques• Temporal-difference learning

  Learns a utility function on states • or on [state,action] pairs

  Similar to backpropagation• treats the difference between expected / actual

reward as an error signal, that is propagated backward in time

• Exploration functions  Balance exploration / exploitation

• Function approximation  Compress a large state space into a small one  Linear function approximation, neural nets, …  Generalization

Page 7: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 7

Example:

Page 8: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 8

Passive RL

• Given policy ,   estimate U(s)

• Not given   transition matrix, nor   reward function!

• Epochs: training sequences

(1,1)(1,2)(1,3)(1,2)(1,3)(1,2)(1,1)(1,2)(2,2)(3,2) –1(1,1)(1,2)(1,3)(2,3)(2,2)(2,3)(3,3) +1(1,1)(1,2)(1,1)(1,2)(1,1)(2,1)(2,2)(2,3)(3,3) +1(1,1)(1,2)(2,2)(1,2)(1,3)(2,3)(1,3)(2,3)(3,3) +1(1,1)(2,1)(2,2)(2,1)(1,1)(1,2)(1,3)(2,3)(2,2)(3,2) -1(1,1)(2,1)(1,1)(1,2)(2,2)(3,2) -1

Page 9: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 9

Value Function

Page 10: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 10

Approach 1

• Direct estimation  Estimate U(s) as average total reward of epochs

containing s (calculating from s to end of epoch)• Pros / Cons?

Requires huge amount of data doesn’t exploit Bellman constraints!

Expected utility of a state = its own reward + expected utility of successors

Page 11: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 11

Approach 2Adaptive Dynamic Programming

Requires fully observable environmentEstimate transition function M from training dataSolve Bellman eqn w/ modified policy iteration

Pros / Cons: ,( ) ( )s ss

U R s M U s

Requires complete observationsDon’t usually need value of all states

Page 12: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 12

Approach 3

• Temporal Difference Learning   Do backups on a per-action basis  Don’t try to estimate entire transition function!  For each transition from s to s’, update:

( )( )( ) ( ) ( )) (R s s UU s s sUU

=

=

Page 13: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 13

Notes

• Once U is learned, updates become 0:

0 ( ( ) ( ) ( )) when ( ) ( ) ( )R s U s U s U s R s U s

( )( )( ) ( ) ( )) (R s s UU s s sUU

• “TD(0)”  One step lookahead

  Can do 2 step, 3 step…

Page 14: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 14

TD()• Or, … take it to the limit!• Compute weighted average of all future

states1( )( )) (( )( ( )) t t tt t U sU s U s R s U s

10

( ) (1( ) ( ) ) ) )( )( (t tt ii

ti

t R s U sUU s s sU

becomes

weighted average

• Implementation  Propagate current weighted TD onto past

states  Must memorize states visited from start of

epoch

Page 15: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 15

TD()• TD rule just described does 1-step

lookahead – called TD(0)  Could also look ahead 2 steps = TD(1), 3 steps

= TD(2), … • TD() – compute a weighted average of all

future states• Implementation

  Propagate current weighted temporal difference onto past states

  Requires memorizing states visited from beginning of epoch (or from point i is not too tiny)

Page 16: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 16

Notes II• Online: update immediately after each action

  Works even if epochs are infinitely long• Offline: wait until the end of an epoch

  Can perform updates in any order  E.g. backwards in time  (Converges faster if rewards come only at epoch

end )  Why?!?

• Prioritized sweeping heuristic

Page 17: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 17

Q-Learning

• Version of TD-learning where   instead of learning value function on

states  we learn one on [state,action] pairs

• Why do this?

( ) ( )

(

( )( ) ( )

(, ) ( , ) ) (max ( , )

( )

be

( , ))

comes

a

U s U s

Q a s

U sR s U s

R s Q as sa sa QQ

Page 18: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 18

Part II

• So far, we have assumed agent had policy

• Now, suppose agent must learn it  While acting in uncertain world

Page 19: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 19

Active Reinforcement Learning

Suppose agent must make policy while learning

First approach:Start with arbitrary policyApply Q-LearningNew policy: in state s, choose action a that

maximizes Q(a,s)

Problem?

Page 20: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 20

Utility of Exploration• Too easily stuck in non-optimal space

  “Exploration versus exploitation tradeoff”

• Solution 1  With fixed probability perform a random action

• Solution 2  Increase est expected value of infrequent states

Properties of f(u, n) ?? U+(s) R(s) + maxa f(s’ P(s’ | a,s) U+(s’), N(a,s))

  If n > Ne U i.e. normal utility  Else, R+ i.e. max possible reward

Page 21: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 21

Part III

• Problem of large state spaces remain  Never enough training data!  Learning takes too long

• What to do??

Page 22: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 22

Function Approximation

• Never enough training data!  Must generalize what learning to new

situations• Idea:

  Replace large state table by a smaller, parameterized function

  Updating the value of state will change the value assigned to many other similar states

Page 23: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 23

Linear Function Approximation

• Represent U(s) as a weighted sum of features (basis functions) of s

• Update each parameter separately, e.g:

( )ˆ( ) ( )ˆ (ˆ ( )

)i i

i

U sR s UU

ss

1 1 2 2ˆ ( ) ( ) ( ) ... ( )n nU s f s f s f s

Page 24: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 24

Example• U(s) = 0 + 1 x + 2 y• Learns good approximation

10

Page 25: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 25

But What If…• U(s) = 0 + 1 x + 2 y

10

+ 3 z

• Computed Features

  z= (xg-x)2 + (yg-y)2

Page 26: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 26

Neural Nets

• Can create powerful function approximators  Nonlinear  Possibly unstable

• For TD-learning, apply difference signal to neural net output and perform back- propagation

Page 27: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 27

Policy Search• Represent policy in terms of Q functions• Gradient search

  Requires differentiability  Stochastic policies; softmax

• Hillclimbing  Tweak policy and evaluate by running

• Replaying experience

Page 28: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 28

~Worlds Best Player

• Neural network with 80 hidden units  Used computed features

• 300,000 games against self

Page 29: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 29

Pole Balancing Demo

Page 30: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 30

Applications to the Web Focused Crawling

• Limited resources  Fetch most important pages first

• Topic specific search engines  Only want pages which are relevant to topic

• Minimize stale pages  Efficient re-fetch to keep index timely  How track the rate of change for pages?

Page 31: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 31

Standard Web Search Engine Architecture

crawl theweb

create an inverted

index

store documents,check for duplicates,

extract links

inverted index

DocIds

Slide adapted from Marty Hearst / UC Berkeley]

Search engine servers

userquery

show results To user

Page 32: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 32

RL for Focussed Crawling

• Huge, unknown environment• Large, dynamic number of “actions”• Consumable rewards

• Questions:  How represent states, actions?  What’s the right reward?   Discount factor?  How represent Q functions?  Dynamic programming on state space the size of

WWW?

Page 33: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 33

• Disregard state• Action == bag of words in link

neighborhood• Q function

  Train offline on crawled corpus  Compute reward as f(action), =0.5  Discretize rewards into bins  Learn Q function directly (supervised learning)

• Add words link neighborhood to bin bag• Train naïve Bayes classifier on bins

Rennie & McCallum (ICML-99)

Page 34: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 34

Performance

Page 35: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 35

Zoom on Performance

Page 36: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 36

Applications to the Web II

• Improving Pagerank

Page 37: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 37

Methods• Agent Types

  Utility-based  Action-value based (Q function)  Reflex

• Passive Learning  Direct utility estimation  Adaptive dynamic programming  Temporal difference learning

• Active Learning  Choose random action 1/nth of the time  Exploration by adding to utility function  Q-learning (learn action/value f directly – model free)

• Generalization  ADDs  Function approximation (linear function or neural networks)

• Policy Search  Stochastic policy repr / Softmax  Reusing past experience

Page 38: © Daniel S. Weld 1 Logistics No Reading First Tournament Wed Details TBA

© Daniel S. Weld 38

Summary

• Use reinforcement learning when   Model of world is unknown and/or rewards are

delayed• Temporal difference learning

  Simple and efficient training rule• Q-learning eliminates need for explicit T model• Large state spaces can (sometimes!) be

handled  Function approximation, using linear functions  Or neural nets