assignment 1 - university of...

39
Assignment 1 } Highest original score: 92.25, so everyone got 7.75 bonus point } How well everyone is doing? } 7: 20% } 6: 10% } 5: 20% } 4: 23% } There’s a couple of submissions that originally has no report mark (fixed) } Bulk download from turnitin does not download these the submissions (it seems if the author starts not with a letter, it’s excluded) } A couple of submissions have issue with demo mark (fixed) } 1 group mark is still pending due to group problem

Upload: others

Post on 05-Jun-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Assignment 1} Highest original score: 92.25, so everyone got 7.75 bonus

point} How well everyone is doing?

} 7: 20%} 6: 10%} 5: 20%} 4: 23%

} There’s a couple of submissions that originally has no report mark (fixed) } Bulk download from turnitin does not download these the

submissions (it seems if the author starts not with a letter, it’s excluded)

} A couple of submissions have issue with demo mark (fixed)} 1 group mark is still pending due to group problem

Page 2: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the
Page 3: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Assignment 2} Support code out yesterday} Amendment for one of the inputs, it’ll make computing

transition easier} Group registration due last Tue! If you still want to work in

a group, please register in the group registration website ASAP. We’ll leave it open until this Monday morning.

} Help sessions during swotvac} Tue 11am-1pm & Thu 11am-1pm, usual tutorial room} Depending on how students go, we may have 1-2 additional help

sessions during week-1 of exam (if we do, we’ll announce it in piazza)

} Questions?

Page 4: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

COMP3702/7702 Artificial IntelligenceLecture 13: Introduction to Machine Learning

and Reinforcement Learning

Hanna Kurniawati

Page 5: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Today} What is machine learning?} Where is it used?} Types of machine learning algorithms.

} Supervised learning} Unsupervised learning} Reinforcement learning

Page 6: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Reinforcement Learning} What is Reinforcement Learning?} Methods for solving

Page 7: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

More general approaches for solving RL} Data from interacting with the world: <s, a, r, s’>} Model-based vs model-free: What’s being

learned?} Passive vs Active: How the data are being

generated?

Passive ActiveModel-based ✔

Model-free

Page 8: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Model-based, Active} We need a way to decide which data to use

} Classic: Interact with the world directly.} Decide the action we use to interact with the world, so as to

balance gaining information and reaching the goal} Nowadays: Can perform the trials in high fidelity

simulator} Decide the action we use to interact with the simulated world (of

course the hope is the simulation is close to reality), so as to balance gaining information and reaching the goal

} Need to consider how well transfers from simulator to the real world is.

Page 9: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Bayesian Reinforcement Learning} Bayesian view:

} The parameters (T & R) we want to estimate are represented as random variables

} Start with a prior over models} Compute posterior based on data

} Quite useful when the agent actively gather data} Can decide how to balance exploration &

exploitation or how to improve the model & solve the problem optimally} Often represented as Partially Observable Markov

Decision Processes (POMDPs)

Page 10: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Bayesian Reinforcement Learning} The problem of finding solving MDP with unknown T &

R can be represented as a POMDP with partially observed MDP model

} POMDP model:} S: MDP states X T X R} A: MDP action} T(s, a, s’): The transition assuming the MDP model is as

described by POMDP state s} Ω: The resulting next state and reward of the MDP} Z(s, a, o): Perceived next state & reward assuming the MDP

model is as described by POMDP state s} R(s, a): The reward assuming the MDP model is as described

by POMDP state s

Page 11: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Bayesian Reinforcement Learning} Optimal policy of the POMDP is optimal

exploration vs exploitation} It will try to balance building the most accurate model &

working directly towards achieving the goal.} Will make the MDP agent receives the maximum

reward given the initially unknown T & R.} Building the best model is just an intermediate

step, not the end goal!

Page 12: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

More general approaches for solving RL} Data from interacting with the world: <s, a, r, s’>} Model-based vs model-free: What’s being

learned?} Passive vs Active: How the data are being

generated?

Passive ActiveModel-based ✔ ✔

Model-free

Page 13: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Model-free} Two flavors:

} Learn the value functions / Q-value functions and then compute the policy.

} Learn the policy directly

} First, we’ll see learning (estimating) the value/Q-value functions} Monte Carlo} Temporal Difference

Page 14: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Monte Carlo} Goal: Given a policy, learn the value of the policy

when T & R are unknown} Assumption: Episodic MDP

} Each episode (i.e., each run) is guaranteed to terminate within a finite amount of time.

} Loop over:} Generate an episode} Compute the total discounted reward for the episode} Update the value

Page 15: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Monte Carlo update} Suppose we have the following episode

} For each si, R(si) = ri + γri+1 + … + γn-1rn

} Value update:

s1 s2 s3 s4 s5

r1 r2 r3 r4 r5

sn

rn

…π(s1) π(s2) π(s3) π(s4) π(sn-1)

Page 16: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Monte Carlo} First visit:

} Only update the value of a state if it is visited the first time in the sampled episode

} Every visit:} Update the value of a state whenever it is visited

(regardless at which time step)} Converge to the true value by law of large

number

Page 17: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Law of Large Numbers} Weak law of large numbers.

} Strong law of large numbers.

} What’s the difference ?

Page 18: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

} One of the most famous Reinforcement Learning approach

} Idea: Iteratively reduce the difference between the value or Q-value estimates

𝑄 𝑠#, 𝑎# = 𝑄 𝑠#, 𝑎# + 𝛼 𝑟# + 𝛾𝑄 𝑠#+,, 𝑎#+, − 𝑄 𝑠#, 𝑎#V 𝑠# = 𝑉 𝑠# + 𝛼 𝑟# + 𝛾𝑉 𝑠#+, − 𝑉 𝑠#

where 𝛼 is a constant in [0, 1], representing the learning rate. In some implementations, it decreases as #data increases, e.g., set to 1/(#visit + 1).

Temporal Difference (TD) Learning

Page 19: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Temporal Difference (TD) Learning} Given a policy π , suppose the reinforcement

learning agent traverse the following episode

} Value update:

s1 s2 s3 s4 s5

r1 r2 r3 r4 r5

sn

rn

…π(s1) π(s2) π(s3) π(s4) π(sn-1)

Page 20: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Monte Carlo vs Temporal Difference Learning} Monte Carlo approach updates value after the

episode is finished} Temporal difference updates value after each step

Page 21: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Monte Carlo vs Temporal Difference Learning} Suppose you want to predict the time you need

to get to your home} Monte Carlo:

Record time in day-i

Next day, update based on recorded time in day-i

α = 1

Page 22: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Monte Carlo vs Temporal Difference Learning} Suppose you want to predict the time you need

to get to your home} Temporal Difference:

Record time in day-iUpdate as the episode progresses

α = 1

Page 23: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

TD Learning - Variants } Q-learning} SARSA (State-Action-Reward-State-Action)} General: TD(𝜆)

Page 24: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Q-Learning: Off-Policy TD Control} Off-policy: Update the Q-value based on the

(estimated) best next actions, even though it’s not the action performed} The policy being followed is not the same as the policy

being evaluated.

Page 25: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

SARSA: On-policy TD Control } Consider the actual action that the agent will take

at the next state} Data is (s, a, r, s’, a’)

Page 26: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Example: Q-learning vs SARSA} Mouse moving next to a cliff

} Blue: Mouse, Green: Cheese, Red: Cliff

Q-learning SARSA

Page 27: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

TD Learning - Variants ✔Q-learning✔SARSA (State-Action-Reward-State-Action)} General: TD(𝜆)

Page 28: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

A more General TD Learning} We can actually do more steps

Page 29: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

n-steps TD method for Value Estimation} Given a policy π , suppose the reinforcement

learning agent traverse the following episode

} For each si, Rn(si) = ri + γri+1 + … + γnri+n} Value update:

s1 s2 s3 s4 s5

r1 r2 r3 r4 r5

sn

rn

…π(s1) π(s2) π(s3) π(s4) π(sn-1)

V 𝑠# = 𝑉 𝑠# + 𝛼 𝑅3 𝑠# + 𝛾3+,𝑉 𝑠#+3+, − 𝑉 𝑠#

V 𝑠# = 𝑉 𝑠# + 𝛼 𝑟# + 𝛾𝑟#+, + 𝛾4𝑟4 + ⋯+ 𝛾#+3𝑟#+3 + 𝛾3+,𝑉 𝑠#+3+, − 𝑉 𝑠#

Page 30: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

TD(𝜆) – Update Rule} Weighted sum of the total reward over sequences of

different length

(1- 𝜆)

(1- 𝜆)𝜆

(1- 𝜆)𝜆4

(1- 𝜆)𝜆37,

Page 31: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

TD(𝜆) – Update Rule} Let

𝑅8 𝑠# = (1 − 𝜆);𝜆<#3=

<>?

𝑅<(𝑠#)

The update rule of TD(𝜆) is

V 𝑠# = 𝑉 𝑠# + 𝛼 𝑅8(𝑠#) − 𝑉 𝑠#

Page 32: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

TD(𝜆) – Update Rule} Let

𝑅8 𝑠# = (1 − 𝜆);𝜆<#3=

<>?

𝑅<(𝑠#)

The update rule of TD(𝜆) is

V 𝑠# = 𝑉 𝑠# + 𝛼 𝑅8(𝑠#) − 𝑉 𝑠#

Page 33: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

TD(𝜆) – Implementation } Use eligibility trace as weight of a visited state

} The eligibility trace of a state at time t, denoted as et(s), represents the eligibility of the state for undergoing learning changes.

} Defined as:

} When the state is visited recently, the eligibility is high, as time progresses, its eligibility decreases.

𝑒< 𝑠 = A𝛾𝜆𝑒<7, 𝑠 𝑖𝑓𝑠 ≠ 𝑠<𝛾𝜆𝑒<7, 𝑠 + 1𝑖𝑓𝑠 = 𝑠<

Page 34: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

TD(𝜆) – Algorithm

Page 35: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Reinforcement Learning✔What is Reinforcement Learning✔Methods for Solving

Passive ActiveModel-based Some supervised

learning can be usedPOMDP for Bayes RL

Model-free Monte CarloTemporal Difference (TD) & its variants (Q-learning, SARSA)TD(𝜆)

✔ ✔

✔✔

Page 36: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

COMP3702/7702: Artificial IntelligenceHow to develop agents that can:

1. Make good decisions when information about the problem is accurate and abundant.} Agent design problem, search in discrete space, search in continuous

space with application to motion planning, logical representation, validity (model checking, theorem proving), satisfiability (DPLL, GSAT).

2. Make good decisions when information about the problem is inaccurate and limited.} Worst case: And-Or tree, min-max tree, minimax algorithm, alpha-beta

pruning} Stochastic: Utility theory, MDP, value iteration, policy iteration, online

solving, POMDP.3. Learn and improve their decision-making capability over

time.} Reinforcement learning: Multi-arm bandit, Bayes-RL (POMDP), TD

learning, Monte Carlo, Q-learning, SARSA

Page 37: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

We didn’t talk about ethic much, but …} With knowledge comes responsibility } AI (including machine learning) system can be very

biased!!!} AI system (at least at the moment) cannot and should not

be used as a justification for discriminatory behavior} At the very least, please make sure your users are

aware of the possible bias} AI system, as any other system, is a tool. It can be

used for good or for bad. Hopefully you’ll use it for good J

Page 38: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

What’s next?} Machine Learning (COMP4702 / COMP7702)} Research?

} Decision making under uncertainty, including its relation to machine learning

} But, this is my last semester at UQ} I’ll be moving to ANU in January, and so are my research

Page 39: Assignment 1 - University of Queenslandrobotics.itee.uq.edu.au/~ai/lib/exe/fetch.php/wiki/introtoml-rl-end.pdf · Assignment 2}Support code out yesterday}Amendment for one of the

Thank youHope you learn a thing or two