rl 2 it’s 2:00 am. do you know where your mouse is?

Post on 21-Dec-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RL 2It’s 2:00 AM. Do you know where your

mouse is?

First up: Vote!•Albuquerque Municipal Election today (Oct 4)

•Not all of you are eligible to vote, I know...

•... But if you are, you should.

•Educate yourself first!

•Mayor

•City councilors

•Bonds (what will ABQ spend its money on?)

•Propositions (election finance, min wage, voter ID)

•Polls close at 7:00 PM today...

Voting resources•City of Albuquerque web site: www.cabq.gov

•League of Women Voters web site:

•http://www.lwvabc.org/elections/2005VG_English.html

News o’ the day•Wall Street Journal reports: “Microsoft Windows Officially Broken”

•In 2004, MS Longhorn (successor to XP) bogged down

•Whole code base had to be scrapped & started afresh ⇒ Vista

•Point: not MS bashing (much)

•Importance of software process

•MS moved to a more agile process for Vista

•Test first

•Rigorous regression testing

•Better coding infrastructure

Administrivia•Grading: P1 rollout finished grading

•I will send grade reports this afternoon & tomorrow morning

•Prof Lane out of town Oct 11

•Andree Jacobsen will cover

•Stefano Markidis out of town Oct 19

•Will announce new office hours presently

Your place in History•Last time:

•Q2

•Introduction to Reinforcement Learning (RL)

Your place in History•This time:

✓P2M1 due

✓Voting

✓News

✓Administriva

✓Q&A

•More on RL

•Design exercise: WorldSimulator and Terrains

Recall: Mack & his maze

•Mack lives a hard life as a psychology test subject

•Has to run around mazes all day, finding food and avoiding electric shocks

•Needs to know how to find cheese quickly, while getting shocked as little as possible

•Q: How can Mack learn to find his way around?

?

Reward over times1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

Reward over times1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

V(s1)=R(s

1)+R(s

4)+R(s

11)+R(s

10)+...

Reward over times1

s2

s3

s4

s5

s6

s4

s2

s7

s11

s8

s9

s10

V(s1)=R(s

1)+R(s

2)+R(s

6)+...

Where can you go?•Definition: Complete set of all states agent could be in is called the state space: S

•Could be discrete or continuous

•For Project 2: states are discrete

•Q: what is the state space for P2?

•Size of state space: |S|

•Q: How big is the state space for P2?

Where can you go?•Definition: Complete set of actions an agent could take is called the action space: A

•Again, discrete or cont.

•Again, P2: A is discrete

•Q: What is A for P2? Size?

•Again, size: |A|

What is it worth to you?•Idea of “good” and “bad” places to go

•Quantified as “rewards”

•(This is where term “reinforcement learning” comes from. Originated in psychology.)

•Formally: R : S → Reals

•R(s) == reward for getting to state s

•How good or bad it is to reach state s

•Larger (more positive) is better

•Agent “wants” to get more positive rwd

How does it happen?•Dynamics of agent defined by transition function

•T: S x A x S → [0,1]

•T(s,a,s’)==Pr[next state is s’ | curr state is s, act a]

•Examples from P2?

How does it happen?•Dynamics of agent defined by transition function

•T: S x A x S → [0,1]

•T(s,a,s’)==Pr[next state is s’ | curr state is s, act a]

•Examples from P2?

•In practice: Don’t write T down explicitly. Encoded by WorldSimulator and Terrain/agent interactions.

The MDP•Entire RL environment defined by a Markov decision process:

•M= 〈 S,A,T,R 〈

•S: state space

•A: action space

•T: transition function

•R: reward function

•Q: What modules represent these in P2?

Policies•Total accumulated reward (value, V) depends on

•Where agent starts

•What agent does at each step (duh)

Policies•Total accumulated reward (value, V) depends on

•Where agent starts

•What agent does at each step (duh)

•Plan of action is called a policy, π

•Policy defines what action to take in every state of the system:

Experience & histories•Fundamental unit of experience in RL:

•At time t in some state si, take action a

j,

get reward rt, end up in state s

k

•Called an experience tuple or SARSA tuple

•Set of all experience during a single episode up to time T is a history or trajectory:

How good is a policy?•Value is a function of start state and policy:

•Value measures:

•How good is policy π, averaged over all time, if agent starts at state s1 and runs forever?

The goal of RL

•Agent’s goal:

•Find the best possible policy: π*

•Find policy, π*, that maximizes Vπ(s) for all s

Design Exercise:WorldSimulator &

Friends

Design exercise•Q1:

•Design the act() method in WorldSimulator

•What objects does it need to access?

•How can it take different terrains/agents into account?

•Q2:

•GridWorld2d<T> could be really large

•Most of the terrain tiles are the same everywhere

•How can you avoid millions of copies of same tile?

top related