university paderborn 16 january 2009 rg knowledge based systems hans kleine büning reinforcement...
TRANSCRIPT
UniversityPaderbor
n
16 January 2009
RG Knowledge Based Systems
Hans Kleine Büning
Reinforcement LearningReinforcement Learning
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 2
UniversityPaderbor
n
OutlineOutline
• Motivation• Applications• Markov Decision Processes• Q-learning• Examples
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 3
UniversityPaderbor
n
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 4
UniversityPaderbor
n
Reinforcement Learning: The Idea
• A way of programming agents by reward and punishment without specifying how the task is to be achieved
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 5
UniversityPaderbor
n
Learning to Ride a Bicycle
Environment
Environment
state
action
€€€€€€
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 6
UniversityPaderbor
n
Learning to Ride a Bicycle
• States:– Angle of handle bars
– Angular velocity of handle bars
– Angle of bicycle to vertical
– Angular velocity of bicycle to vertical
– Acceleration of angle of bicycle to vertical
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 7
UniversityPaderbor
n
Learning to Ride a Bicycle
Environment
Environment
state
action
€€€€€€
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 8
UniversityPaderbor
n
Learning to Ride a Bicycle
• Actions:– Torque to be applied to the
handle bars
– Displacement of the center of mass from the bicycle’s plan (in cm)
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 9
UniversityPaderbor
n
Learning to Ride a Bicycle
Environment
Environment
state
action
€€€€€€
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 10
UniversityPaderbor
n
Angle of bicycle to vertical is greater
than 12°
Reward = 0
Reward = -1
no yes
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 11
UniversityPaderbor
n
Learning To Ride a Bicycle
Reinforcement Learning
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 12
UniversityPaderbor
n
Reinforcement Learning: Applications
• Board Games– TD-Gammon program, based on reinforcement learning, has
become a world-class backgammon player
• Mobile Robot Controlling– Learning to Drive a Bicycle– Navigation– Pole-balancing– Acrobot
• Sequential Process Controlling– Elevator Dispatching
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 13
UniversityPaderbor
n
Key Features of Reinforcement Learning
• Learner is not told which actions to take• Trial and error search• Possibility of delayed reward:
– Sacrifice of short-term gains for greater long-term gains
• Explore/Exploit trade-off• Considers the whole problem of a goal-directed
agent interacting with an uncertain environment
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 14
UniversityPaderbor
n
The Agent-Environment Interaction
• Agent and environment interact at discrete time steps: t = 0,1, 2, …– Agent observes state at step t :
st 2 S
– produces action at step t: at 2 A
– gets resulting reward : rt +1 2 ℜ
– and resulting next state: st +1 2 S
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 15
UniversityPaderbor
n
The Agent’s Goal:
• Coarsely, the agent’s goal is to get as much reward as it
can over the long run
Policy is• a mapping from states to action s) = a
• Reinforcement learning methods specify how the agent changes its policy as a result of experience experience
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 16
UniversityPaderbor
n
Deterministic Markov Decision Process
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 17
UniversityPaderbor
n
Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 18
UniversityPaderbor
n
Example: Corresponding MDP
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 19
UniversityPaderbor
n
Example: Corresponding MDP
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 20
UniversityPaderbor
n
Example: Corresponding MDP
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 21
UniversityPaderbor
n
Example: Policy
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 22
UniversityPaderbor
n
Value of Policy and Rewards
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 23
UniversityPaderbor
n
Value of Policy and Agent’s Task
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 24
UniversityPaderbor
n
Nondeterministic Markov Decision Process
P = 0
.8
P = 0.1
P = 0.1
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 25
UniversityPaderbor
n
Nondeterministic Markov Decision Process
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 26
UniversityPaderbor
n
Nondeterministic Markov Decision Process
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 27
UniversityPaderbor
n
Example with South-Easten Wind
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 28
UniversityPaderbor
n
Example with South-Easten Wind
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 29
UniversityPaderbor
n
Methods
Dynamic Programming
ValueFunction
Approximation+
DynamicProgramming
ReinforcementLearning,
Monte Carlo Methods
ValuationFunction
Approximation+
ReinforcementLearning
continuousstates
discrete states discrete statescontinuous
states
Model (reward function and transitionprobabilities) is known
Model (reward function or transitionprobabilities) is unknown
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 30
UniversityPaderbor
n
Q-learning Algorithm
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 31
UniversityPaderbor
n
Q-learning Algorithm
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 32
UniversityPaderbor
n
Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 33
UniversityPaderbor
n
Example: Q-table Initialization
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 34
UniversityPaderbor
n
Example: Episode 1
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 35
UniversityPaderbor
n
Example: Episode 1
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 36
UniversityPaderbor
n
Example: Episode 1
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 37
UniversityPaderbor
n
Example: Episode 1
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 38
UniversityPaderbor
n
Example: Episode 1
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 39
UniversityPaderbor
n
Example: Q-table
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 40
UniversityPaderbor
n
Example: Episode 1
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 41
UniversityPaderbor
n
Episode 1
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 42
UniversityPaderbor
n
Example: Q-table
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 43
UniversityPaderbor
n
Example: Episode 2
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 44
UniversityPaderbor
n
Example: Episode 2
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 45
UniversityPaderbor
n
Example: Episode 2
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 46
UniversityPaderbor
n
Example: Q-table after Convergence
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 47
UniversityPaderbor
n
Example: Value Function after Convergence
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 48
UniversityPaderbor
n
Example: Optimal Policy
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 49
UniversityPaderbor
n
Example: Optimal Policy
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 50
UniversityPaderbor
n
Q-learning
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 51
UniversityPaderbor
n
Convergence of Q-learning
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 52
UniversityPaderbor
n
Blackjack• Standard rules of blackjack hold• State space:
– element[0] - current value of player's hand (4-21)
– element[1] - value of dealer's face -up card (2-11)
– element[2] - player does not have usable ace (0/1)
• Starting states:– player has any 2 cards (uniformly
distributed), dealer has any 1 card (uniformly distributed)
• Actions: – HIT– STICK
• Rewards: – 1 for a loss– 0 for a draw– 1 for a win
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 53
UniversityPaderbor
n
Blackjack: Optimal Policy
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 54
UniversityPaderbor
n
Reinforcement Learning: Example
• States– Grids
• Actions– Left– Up– Right– Down
• Rewards– Bonus 20– Food 1– Predator -10– Empty grid -0.1
• Transition probabilities– 0.80 – agent goes where he
intends to go– 0.20 – to any other adjacent
grid or remains where it was (in case he is on the board of the grid world he goes to the other side)
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 55
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 56
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 57
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 58
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 59
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 60
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 61
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 62
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 63
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 64
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 65
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 66
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 67
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 68
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 69
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 70
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 71
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 72
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 73
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 74
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 75
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 76
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 77
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 78
UniversityPaderbor
n
Reinforcement Learning: Example
Reinforcement Learning Prof. Dr. Hans
Kleine Büning 79
UniversityPaderbor
n
Reinforcement Learning: Example