assignment 1 - university of...

Assignment 1} Highest original score: 92.25, so everyone got 7.75 bonus

point} How well everyone is doing?

} 7: 20%} 6: 10%} 5: 20%} 4: 23%

} There’s a couple of submissions that originally has no report mark (fixed) } Bulk download from turnitin does not download these the

submissions (it seems if the author starts not with a letter, it’s excluded)

} A couple of submissions have issue with demo mark (fixed)} 1 group mark is still pending due to group problem

Assignment 2} Support code out yesterday} Amendment for one of the inputs, it’ll make computing

transition easier} Group registration due last Tue! If you still want to work in

a group, please register in the group registration website ASAP. We’ll leave it open until this Monday morning.

} Help sessions during swotvac} Tue 11am-1pm & Thu 11am-1pm, usual tutorial room} Depending on how students go, we may have 1-2 additional help

sessions during week-1 of exam (if we do, we’ll announce it in piazza)

} Questions?

COMP3702/7702 Artificial IntelligenceLecture 13: Introduction to Machine Learning

and Reinforcement Learning

Hanna Kurniawati

Today} What is machine learning?} Where is it used?} Types of machine learning algorithms.

} Supervised learning} Unsupervised learning} Reinforcement learning

Reinforcement Learning} What is Reinforcement Learning?} Methods for solving

More general approaches for solving RL} Data from interacting with the world: <s, a, r, s’>} Model-based vs model-free: What’s being

learned?} Passive vs Active: How the data are being

generated?

Passive ActiveModel-based ✔

Model-free

Model-based, Active} We need a way to decide which data to use

} Classic: Interact with the world directly.} Decide the action we use to interact with the world, so as to

balance gaining information and reaching the goal} Nowadays: Can perform the trials in high fidelity

simulator} Decide the action we use to interact with the simulated world (of

course the hope is the simulation is close to reality), so as to balance gaining information and reaching the goal

} Need to consider how well transfers from simulator to the real world is.

Bayesian Reinforcement Learning} Bayesian view:

} The parameters (T & R) we want to estimate are represented as random variables

} Start with a prior over models} Compute posterior based on data

} Quite useful when the agent actively gather data} Can decide how to balance exploration &

exploitation or how to improve the model & solve the problem optimally} Often represented as Partially Observable Markov

Decision Processes (POMDPs)

Bayesian Reinforcement Learning} The problem of finding solving MDP with unknown T &

R can be represented as a POMDP with partially observed MDP model

} POMDP model:} S: MDP states X T X R} A: MDP action} T(s, a, s’): The transition assuming the MDP model is as

described by POMDP state s} Ω: The resulting next state and reward of the MDP} Z(s, a, o): Perceived next state & reward assuming the MDP

model is as described by POMDP state s} R(s, a): The reward assuming the MDP model is as described

by POMDP state s

Bayesian Reinforcement Learning} Optimal policy of the POMDP is optimal

exploration vs exploitation} It will try to balance building the most accurate model &

working directly towards achieving the goal.} Will make the MDP agent receives the maximum

reward given the initially unknown T & R.} Building the best model is just an intermediate

step, not the end goal!

More general approaches for solving RL} Data from interacting with the world: <s, a, r, s’>} Model-based vs model-free: What’s being

learned?} Passive vs Active: How the data are being

generated?

Passive ActiveModel-based ✔ ✔

Model-free

Model-free} Two flavors:

} Learn the value functions / Q-value functions and then compute the policy.

} Learn the policy directly

} First, we’ll see learning (estimating) the value/Q-value functions} Monte Carlo} Temporal Difference

Monte Carlo} Goal: Given a policy, learn the value of the policy

when T & R are unknown} Assumption: Episodic MDP

} Each episode (i.e., each run) is guaranteed to terminate within a finite amount of time.

} Loop over:} Generate an episode} Compute the total discounted reward for the episode} Update the value

Monte Carlo update} Suppose we have the following episode

} For each si, R(si) = ri + γri+1 + … + γn-1rn

} Value update:

s1 s2 s3 s4 s5

r1 r2 r3 r4 r5

sn

rn

…π(s1) π(s2) π(s3) π(s4) π(sn-1)

Monte Carlo} First visit:

} Only update the value of a state if it is visited the first time in the sampled episode

} Every visit:} Update the value of a state whenever it is visited

(regardless at which time step)} Converge to the true value by law of large

number

Law of Large Numbers} Weak law of large numbers.

} Strong law of large numbers.

} What’s the difference ?

} One of the most famous Reinforcement Learning approach

} Idea: Iteratively reduce the difference between the value or Q-value estimates

𝑄 𝑠#, 𝑎# = 𝑄 𝑠#, 𝑎# + 𝛼 𝑟# + 𝛾𝑄 𝑠#+,, 𝑎#+, − 𝑄 𝑠#, 𝑎#V 𝑠# = 𝑉 𝑠# + 𝛼 𝑟# + 𝛾𝑉 𝑠#+, − 𝑉 𝑠#

where 𝛼 is a constant in [0, 1], representing the learning rate. In some implementations, it decreases as #data increases, e.g., set to 1/(#visit + 1).

Temporal Difference (TD) Learning

Temporal Difference (TD) Learning} Given a policy π , suppose the reinforcement

learning agent traverse the following episode

} Value update:

s1 s2 s3 s4 s5

r1 r2 r3 r4 r5

sn

rn

…π(s1) π(s2) π(s3) π(s4) π(sn-1)

Monte Carlo vs Temporal Difference Learning} Monte Carlo approach updates value after the

episode is finished} Temporal difference updates value after each step

Monte Carlo vs Temporal Difference Learning} Suppose you want to predict the time you need

to get to your home} Monte Carlo:

Record time in day-i

Next day, update based on recorded time in day-i

α = 1

Monte Carlo vs Temporal Difference Learning} Suppose you want to predict the time you need

to get to your home} Temporal Difference:

Record time in day-iUpdate as the episode progresses

α = 1

TD Learning - Variants } Q-learning} SARSA (State-Action-Reward-State-Action)} General: TD(𝜆)

Q-Learning: Off-Policy TD Control} Off-policy: Update the Q-value based on the

(estimated) best next actions, even though it’s not the action performed} The policy being followed is not the same as the policy

being evaluated.

SARSA: On-policy TD Control } Consider the actual action that the agent will take

at the next state} Data is (s, a, r, s’, a’)

Example: Q-learning vs SARSA} Mouse moving next to a cliff

} Blue: Mouse, Green: Cheese, Red: Cliff

Q-learning SARSA

TD Learning - Variants ✔Q-learning✔SARSA (State-Action-Reward-State-Action)} General: TD(𝜆)

A more General TD Learning} We can actually do more steps

n-steps TD method for Value Estimation} Given a policy π , suppose the reinforcement

learning agent traverse the following episode

} For each si, Rn(si) = ri + γri+1 + … + γnri+n} Value update:

s1 s2 s3 s4 s5

r1 r2 r3 r4 r5

sn

rn

…π(s1) π(s2) π(s3) π(s4) π(sn-1)

V 𝑠# = 𝑉 𝑠# + 𝛼 𝑅3 𝑠# + 𝛾3+,𝑉 𝑠#+3+, − 𝑉 𝑠#

V 𝑠# = 𝑉 𝑠# + 𝛼 𝑟# + 𝛾𝑟#+, + 𝛾4𝑟4 + ⋯+ 𝛾#+3𝑟#+3 + 𝛾3+,𝑉 𝑠#+3+, − 𝑉 𝑠#

TD(𝜆) – Update Rule} Weighted sum of the total reward over sequences of

different length

(1- 𝜆)

(1- 𝜆)𝜆

(1- 𝜆)𝜆4

(1- 𝜆)𝜆37,

TD(𝜆) – Update Rule} Let

𝑅8 𝑠# = (1 − 𝜆);𝜆<#3=

<>?

𝑅<(𝑠#)

The update rule of TD(𝜆) is

V 𝑠# = 𝑉 𝑠# + 𝛼 𝑅8(𝑠#) − 𝑉 𝑠#

TD(𝜆) – Implementation } Use eligibility trace as weight of a visited state

} The eligibility trace of a state at time t, denoted as et(s), represents the eligibility of the state for undergoing learning changes.

} Defined as:

} When the state is visited recently, the eligibility is high, as time progresses, its eligibility decreases.

𝑒< 𝑠 = A𝛾𝜆𝑒<7, 𝑠 𝑖𝑓𝑠 ≠ 𝑠<𝛾𝜆𝑒<7, 𝑠 + 1𝑖𝑓𝑠 = 𝑠<

TD(𝜆) – Algorithm

Reinforcement Learning✔What is Reinforcement Learning✔Methods for Solving

Passive ActiveModel-based Some supervised

learning can be usedPOMDP for Bayes RL

Model-free Monte CarloTemporal Difference (TD) & its variants (Q-learning, SARSA)TD(𝜆)

✔ ✔

✔✔

✔

COMP3702/7702: Artificial IntelligenceHow to develop agents that can:

1. Make good decisions when information about the problem is accurate and abundant.} Agent design problem, search in discrete space, search in continuous

space with application to motion planning, logical representation, validity (model checking, theorem proving), satisfiability (DPLL, GSAT).

2. Make good decisions when information about the problem is inaccurate and limited.} Worst case: And-Or tree, min-max tree, minimax algorithm, alpha-beta

pruning} Stochastic: Utility theory, MDP, value iteration, policy iteration, online

solving, POMDP.3. Learn and improve their decision-making capability over

time.} Reinforcement learning: Multi-arm bandit, Bayes-RL (POMDP), TD

learning, Monte Carlo, Q-learning, SARSA

We didn’t talk about ethic much, but …} With knowledge comes responsibility } AI (including machine learning) system can be very

biased!!!} AI system (at least at the moment) cannot and should not

be used as a justification for discriminatory behavior} At the very least, please make sure your users are

aware of the possible bias} AI system, as any other system, is a tool. It can be

used for good or for bad. Hopefully you’ll use it for good J

What’s next?} Machine Learning (COMP4702 / COMP7702)} Research?

} Decision making under uncertainty, including its relation to machine learning

} But, this is my last semester at UQ} I’ll be moving to ANU in January, and so are my research

Thank youHope you learn a thing or two

assignment 1 - university of...

Documents