rl 2 it’s 2:00 am. do you know where your mouse is?

RL 2It’s 2:00 AM. Do you know where your

mouse is?

First up: Vote!•Albuquerque Municipal Election today (Oct 4)

•Not all of you are eligible to vote, I know...

•... But if you are, you should.

•Educate yourself first!

•Mayor

•City councilors

•Bonds (what will ABQ spend its money on?)

•Propositions (election finance, min wage, voter ID)

•Polls close at 7:00 PM today...

Voting resources•City of Albuquerque web site: www.cabq.gov

•League of Women Voters web site:

•http://www.lwvabc.org/elections/2005VG_English.html

News o’ the day•Wall Street Journal reports: “Microsoft Windows Officially Broken”

•In 2004, MS Longhorn (successor to XP) bogged down

•Whole code base had to be scrapped & started afresh ⇒ Vista

•Point: not MS bashing (much)

•Importance of software process

•MS moved to a more agile process for Vista

•Test first

•Rigorous regression testing

•Better coding infrastructure

Administrivia•Grading: P1 rollout finished grading

•I will send grade reports this afternoon & tomorrow morning

•Prof Lane out of town Oct 11

•Andree Jacobsen will cover

•Stefano Markidis out of town Oct 19

•Will announce new office hours presently

Your place in History•Last time:

•Introduction to Reinforcement Learning (RL)

Your place in History•This time:

✓P2M1 due

✓Voting

✓News

✓Administriva

✓Q&A

•More on RL

•Design exercise: WorldSimulator and Terrains

Recall: Mack & his maze

•Mack lives a hard life as a psychology test subject

•Has to run around mazes all day, finding food and avoiding electric shocks

•Needs to know how to find cheese quickly, while getting shocked as little as possible

•Q: How can Mack learn to find his way around?

Reward over times1

V(s1)=R(s

1)+R(s

4)+R(s

11)+R(s

10)+...

Reward over times1

V(s1)=R(s

1)+R(s

2)+R(s

6)+...

Where can you go?•Definition: Complete set of all states agent could be in is called the state space: S

•Could be discrete or continuous

•For Project 2: states are discrete

•Q: what is the state space for P2?

•Size of state space: |S|

•Q: How big is the state space for P2?

Where can you go?•Definition: Complete set of actions an agent could take is called the action space: A

•Again, discrete or cont.

•Again, P2: A is discrete

•Q: What is A for P2? Size?

•Again, size: |A|

What is it worth to you?•Idea of “good” and “bad” places to go

•Quantified as “rewards”

•(This is where term “reinforcement learning” comes from. Originated in psychology.)

•Formally: R : S → Reals

•R(s) == reward for getting to state s

•How good or bad it is to reach state s

•Larger (more positive) is better

•Agent “wants” to get more positive rwd

How does it happen?•Dynamics of agent defined by transition function

•T: S x A x S → [0,1]

•T(s,a,s’)==Pr[next state is s’ | curr state is s, act a]

•Examples from P2?

How does it happen?•Dynamics of agent defined by transition function

•T: S x A x S → [0,1]

•T(s,a,s’)==Pr[next state is s’ | curr state is s, act a]

•Examples from P2?

•In practice: Don’t write T down explicitly. Encoded by WorldSimulator and Terrain/agent interactions.

The MDP•Entire RL environment defined by a Markov decision process:

•M= 〈 S,A,T,R 〈

•S: state space

•A: action space

•T: transition function

•R: reward function

•Q: What modules represent these in P2?

Policies•Total accumulated reward (value, V) depends on

•Where agent starts

•What agent does at each step (duh)

Policies•Total accumulated reward (value, V) depends on

•Where agent starts

•What agent does at each step (duh)

•Plan of action is called a policy, π

•Policy defines what action to take in every state of the system:

Experience & histories•Fundamental unit of experience in RL:

•At time t in some state si, take action a

get reward rt, end up in state s

•Called an experience tuple or SARSA tuple

•Set of all experience during a single episode up to time T is a history or trajectory:

How good is a policy?•Value is a function of start state and policy:

•Value measures:

•How good is policy π, averaged over all time, if agent starts at state s1 and runs forever?

The goal of RL

•Agent’s goal:

•Find the best possible policy: π*

•Find policy, π*, that maximizes Vπ(s) for all s

Design Exercise:WorldSimulator &

Friends

Design exercise•Q1:

•Design the act() method in WorldSimulator

•What objects does it need to access?

•How can it take different terrains/agents into account?

•Q2:

•GridWorld2d<T> could be really large

•Most of the terrain tiles are the same everywhere

•How can you avoid millions of copies of same tile?

rl 2 it’s 2:00 am. do you know where your mouse is?

Documents

chi turf, chicago, il streeterville-rpt... · 2017. 3....

new doc 2020-04-01 13.11 - tuscaloosa county school...

mcs third grade ms ccr ela standards by nine weeks first...

the lion & the mouse: a reverse logistics story - rl … is...

wireless ergonomic trackball mouse jbtracmse · • to...

compare and contrast zippity and george to the stray dogs...

cadmouse - 3dconnexion · documents. it’s a seamless...

1 scale: 1:200 · 2020. 8. 26. ·...

01763774808 rl 1299 1299 mahe overseas 0255110707 rl …

left click mouse or hit space bar to continue worldvox...

nominated rl list2201016983 vinod kumar rl\1837 2402501072...

deep learning: past, present and future · 2018-09-06 ·...

snareline disco show (part...

excerpt from blue jasmine - edl · 2018-07-28 · 205053p...

my first grammar 3 lesson 18 test name€¦ · my first...

nominated rl list2201528586gunpal rl\827 2406502093riyasat...

rl-sm02b-8189etv specification rl-sm02b-8189etv-v1

map 9-1 draft · rh/hr co rmh rl rl rl rl mg mg mg ml rl rl...

series ‘fm’ (nfpa) cylinder withrod lock · 1.000...

rl 4 rl 6 what makes you c suspicious?