life: play and win in 20 trillion moves stuart russell

56
1 Life: play and win in 20 trillion moves Stuart Russell Based on work with Ron Parr, David Andre, Carlos Guestrin, Bhaskara Marthi, and Jason Wolfe

Upload: marlee

Post on 21-Jan-2016

37 views

Category:

Documents


1 download

DESCRIPTION

Life: play and win in 20 trillion moves Stuart Russell. White to play and win in 2 moves. Life. 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions (Not to mention choosing brain activities!) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Life: play and win in 20 trillion moves Stuart Russell

1

Life: play and win in 20 trillion movesStuart Russell

Based on work with Ron Parr, David Andre, Carlos Guestrin, Bhaskara Marthi, and Jason Wolfe

Page 2: Life: play and win in 20 trillion moves Stuart Russell

White to play and win in 2 moves

2

Page 3: Life: play and win in 20 trillion moves Stuart Russell

Life

100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions

(Not to mention choosing brain activities!) The world has a very large, partially observable

state space and an unknown model So, being intelligent is “provably” hard How on earth do we manage?

3

Page 4: Life: play and win in 20 trillion moves Stuart Russell

4

Hierarchical structure!! (among other things)

Deeply nested behaviors give modularity E.g., tongue controls for [t] independent of

everything else given decision to say “tongue” High-level decisions give scale

E.g., “go to IJCAI 13” 3,000,000,000 actions Look further ahead, reduce computation

exponentially[NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05,

ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11]

Page 5: Life: play and win in 20 trillion moves Stuart Russell

Reminder: MDPs Markov decision process (MDP) defined by

Set of states S Set of actions A Transition model P(s’ | s,a)

Markovian, i.e., depends only on st and not on history

Reward function R(s,a,s’) Discount factor γ

Optimal policy π* maximizes the value function Vπ

= Σt γt R(st,π(st),st+1)

Q(s,a) = Σs’ P(s’ | s,a)[R(s,a,s’) + γV(s’)] 5

Page 6: Life: play and win in 20 trillion moves Stuart Russell

Reminder: policy invariance

MDP M with reward function R(s,a,s’) Any potential function Φ(s)

then Any policy that is optimal for M is optimal for

any version M’ with reward function R’(s,a,s’) = R(s,a,s’) + γΦ(s’) - Φ(s)

So we can provide hints to speed up RL by shaping the reward function

6

Page 7: Life: play and win in 20 trillion moves Stuart Russell

Reminder: Bellman equation

Bellman equation for a fixed policy π :

Vπ(s) = Σs’ P(s’ | s,π(s)) [ R(s,π(s),s’) + γ Vπ (s’) ] Bellman equation for optimal value function:

V (s) = maxa Σs’ P(s’ | s,a) [ R(s,a,s’) + γ V (s’) ]

7

Page 8: Life: play and win in 20 trillion moves Stuart Russell

Reminder: Reinforcement learning

Agent observes transitions and rewards: Doing a in s leads to s’ with reward r

Temporal-difference (TD) learning: V(s) <- (1-α)V(s) + α[r + γ V(s’) ]

Q-learning: Q(s,a) <- (1-α)Q(s,a) + α[r + γ maxa’ Q(s’,a’) ]

SARSA (after observing the next action a’ ): Q(s,a) <- (1-α)Q(s,a) + α[r + γ Q(s’,a’) ]

8

Page 9: Life: play and win in 20 trillion moves Stuart Russell

Reminder: RL issues

Credit assignment (problem: Generalization problem:

9

Page 10: Life: play and win in 20 trillion moves Stuart Russell

Abstract states

First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

10

Page 11: Life: play and win in 20 trillion moves Stuart Russell

Abstract states

First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

11

Page 12: Life: play and win in 20 trillion moves Stuart Russell

Abstract states

First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

12

Oops – lost theMarkov property!Action outcome depends on thelow-level state,which dependson history

Page 13: Life: play and win in 20 trillion moves Stuart Russell

Abstract actions Forestier & Varaiya, 1978: supervisory control:

Abstract action is a fixed policy for a region “Choice states” where policy exits region

Decision process on choice states is semi-Markov

Page 14: Life: play and win in 20 trillion moves Stuart Russell

Semi-Markov decision processes

Add random duration N for action executions:

P(s’,N | s,a)

Modified Bellman equation:

Vπ = Σs’,N P(s’,N | s,π(s)) [ R(s,π(s),s’,N) + γN Vπ (s’) ] Modified Q-learning with abstract actions:

For transitions within an abstract action πa:

rc = rc + γcR(s,πa(s),s’) ; γc = γcγ

For transitions that terminate action begun at s0

Q(s0,πa) = (1-α)Q(s0,πa) + α(rc + γc maxa’ Q(s’,πa’) ]14

Page 15: Life: play and win in 20 trillion moves Stuart Russell

Hierarchical abstract machines

Hierarchy of NDFAs; states come in 4 types: Action states execute a physical action Call states invoke a lower-level NDFA Choice points nondeterministically select next state Stop states return control to invoking NDFA

NDFA internal transitions dictated by percept

15

Page 16: Life: play and win in 20 trillion moves Stuart Russell

16

Running example

Peasants can move, pickup and dropoff

Penalty for collision Cost-of-living each step Reward for dropping off

resources Goal : gather 10 gold +

10 wood (3L)n++ states s 7n primitive actions a

Page 17: Life: play and win in 20 trillion moves Stuart Russell

Part of 1-peasant HAM

17

Page 18: Life: play and win in 20 trillion moves Stuart Russell

Joint SMDP

Physical states S, machine states Θ, joint state space Ω = S x Θ

Joint MDP: Transition model for physical state (from physical

MDP) and machine state (as defined by HAM) A(ω) is fixed except when θ is at a choice point

Reduction to SMDP: States are just those where θ is at a choice point

18

Page 19: Life: play and win in 20 trillion moves Stuart Russell

Hierarchical optimality

Theorem: the optimal solution to the SMDP is hierarchically optimal for the original MDP, i.e., maximizes value among all behaviors consistent with the HAM

19

Page 20: Life: play and win in 20 trillion moves Stuart Russell

Partial programs

HAMs (and Dietterich’s MAXQ hierarchies and Sutton’s Options) are all partially constrained descriptions of behaviors with internal state

A partial program is like an ordinary program in any programming language, with a nondeterministic choice operator added to allow for ignorance about what to do

HAMs, MAXQs, Options are trivially expressible as partial programs

Standard MDP: loop forever choose an action20

Page 21: Life: play and win in 20 trillion moves Stuart Russell

21

RL and partial programs

Learning algorithm

Completion

a

s,r

Partial program

Page 22: Life: play and win in 20 trillion moves Stuart Russell

22

RL and partial programs

Learning algorithm

Completion

a

s,r

Partial program

Hierarchically optimalfor all terminating programs

Page 23: Life: play and win in 20 trillion moves Stuart Russell

23

(defun top () (loop do (choose ‘top-choice

(gather-wood)(gather-gold))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff)))

(defun nav (dest) (until (= (pos (get-state))

dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

Single-threaded Alisp program

Program state θ includes Program counter Call stack Global variables

Page 24: Life: play and win in 20 trillion moves Stuart Russell

24

Q-functions

Represent completions using Q-function Joint state ω = [s,θ] env state + program state MDP + partial program = SMDP over {ω} Qπ(ω,u) = “Expected total reward if I make choice u in ω

and follow completion π thereafter” Modified Q-learning [AR 02] finds optimal completion

ω u Q(ω,u)... ... ...

At nav-choice, Pos=(2,3), Dest=(6,5), ... North 15

At resource-choice, Pos=(2,3), Gold=7, Wood=3, ...

Gather-wood -42

... ... ...

Page 25: Life: play and win in 20 trillion moves Stuart Russell

25

Internal state

Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies

E.g., while navigating to location (x,y), moving towards (x,y) is a good idea

Natural local shaping potential (distance from destination) impossible to express in external terms

Page 26: Life: play and win in 20 trillion moves Stuart Russell

26

Top TopTop

GatherWood GatherGold

Nav(Forest1) Nav(Mine2)

Qr Qc Qe

Q

• Temporal decomposition Q = Qr+Qc+Qe where

• Qr(ω,u) = reward while doing u (may be many steps)

• Qc(ω,u) = reward in current subroutine after doing u

• Qe(ω,u) = reward after current subroutine

Temporal Q-decomposition

Page 27: Life: play and win in 20 trillion moves Stuart Russell

27

State abstraction

• Temporal decomposition => state abstraction E.g., while navigating, Qc independent of

gold reserves In general, local Q-components can depend

on few variables => fast learning

Page 28: Life: play and win in 20 trillion moves Stuart Russell

28

Get-wood

Handling multiple effectors

Multithreaded agent programs Threads = tasks Each effector assigned to a thread Threads can be created/destroyed Effectors can be reassigned Effectors can be created/destroyed

Get-gold Get-wood

(Defend-Base)

Page 29: Life: play and win in 20 trillion moves Stuart Russell

29

(defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas

(first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff)))

(defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

An example Concurrent ALisp programAn example single-threaded ALisp program

Page 30: Life: play and win in 20 trillion moves Stuart Russell

30

defun gather-wood ()

(setf dest (choose *forests*))

(nav dest)

(action ‘get-wood)

(nav *home-base-loc*)

(action ‘put-wood))

(defun gather-gold ()

(setf dest (choose *goldmines*))

(nav dest)

(action ‘get-gold)

(nav *home-base-loc*)

(action ‘put-gold))

(defun nav (Wood1)

(loop

until (at-pos Wood1)

(setf d (choose ‘(N S E W R)))

do (action d)

))

(defun nav (Gold2)

(loop

until (at-pos Gold2)

(setf d (choose ‘(N S E W R)))

do (action d)

))

Thread 2

Thread 1

Environment timestep 27Environment timestep 26

Peasant 1

Peasant 2

PausedMaking joint choice

PausedMaking joint choice

Running

Running

Waiting for joint action

Waiting for joint action

Concurrent Alisp semantics

Page 31: Life: play and win in 20 trillion moves Stuart Russell

31

Threads execute independently until they hit a choice or action

Wait until all threads are at a choice or action If all effectors have been assigned an action, do

that joint action in environment Otherwise, make joint choice

Concurrent Alisp semantics

Page 32: Life: play and win in 20 trillion moves Stuart Russell

32

Q-functions

To complete partial program, at each choice state ω, need to specify choices for all choosing threads

So Q(ω,u) as before, except u is a joint choice Suitable SMDP Q-learning gives optimal completion

Example Q-function

ω u Q(ω,u)... ... ...

Peas1 at NavChoice, Peas2 at DropoffGold, Peas3 at ForestChoice, Pos1=(2,3), Pos3=(7,4), Gold=12, Wood=14

(Peas1:East, Peas3:Forest2)

15.7

... ... ...

Page 33: Life: play and win in 20 trillion moves Stuart Russell

34

Problems with concurrent activities

Temporal decomposition of Q-function lost No credit assignment among threads

Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly

Peasant 2 thinks he’s done very well!! Significantly slows learning as number of peasants

increases

Page 34: Life: play and win in 20 trillion moves Stuart Russell

35

Threadwise decomposition

Idea : decompose reward among threads (Russell+Zimdars, 2003)

E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants

Qjπ(ω,u) = “Expected total reward received by

thread j if we make joint choice u and then do π” Threadwise Q-decomposition Q = Q1+…Qn

Recursively distributed SARSA => global optimality

Page 35: Life: play and win in 20 trillion moves Stuart Russell

36

Peasant 3 thread

Learning threadwise decomposition

Peasant 2 thread

Peasant 1 thread

Top thread

a

r

r1

r2 r3

Action state

decomp

Page 36: Life: play and win in 20 trillion moves Stuart Russell

37

Peasant 3 thread

Learning threadwise decomposition

Peasant 2 thread

Peasant 1 thread

Top thread

Choice state ω

Q1(ω,·) Q3(ω,·)

Q2(ω,·)

+

Q (ω,·)argmax = u

Q-update

Q-update

Q-update

Page 37: Life: play and win in 20 trillion moves Stuart Russell

38

Peasant 3 thread

Threadwise and temporal decomposition

Qj = Qj,r + Qj,c + Qj,e where Qj,r (ω,u) = Expected reward gained by thread j while

doing u Qj,c (ω,u) = Expected reward gained by thread j after u

but before leaving current subroutine Qj,e (ω,u) = Expected reward gained by thread j after

current subroutine

Peasant 2 thread

Peasant 1 thread

Top thread

Page 38: Life: play and win in 20 trillion moves Stuart Russell

39Num steps learning (x 1000)

Reward of learnt policy

Flat

Undecomposed

Threadwise

Threadwise + Temporal

Resource gathering with 15 peasants

Page 39: Life: play and win in 20 trillion moves Stuart Russell

40

Page 40: Life: play and win in 20 trillion moves Stuart Russell

41

Summary of Part I

Structure in behavior seems essential for scaling up Partial programs

Provide natural structural constraints on policies Decompose value functions into simple components Include internal state (e.g., “goals”) that further simplifies

value functions, shaping rewards

Concurrency Simplifies description of multieffector behavior Messes up temporal decomposition and credit assignment

(but threadwise reward decomposition restores it)

Page 41: Life: play and win in 20 trillion moves Stuart Russell

42

Outline

Hierarchical RL with abstract actions Hierarchical RL with partial programs Efficient hierarchical planning

Page 42: Life: play and win in 20 trillion moves Stuart Russell

43

Abstract Lookahead

k-step lookahead >> 1-step lookahead e.g., chess

k-step lookahead no use if steps are too small e.g., first k characters of a NIPS paper

Abstract plans (high-level executions of partial programs) are shorter Much shorter plans for given goals => exponential savings Can look ahead much further into future

Kc3 Rxa1

Qa4 c2 c2 Kxa1

‘a’ ‘b’

‘a’ ‘b’ ‘a’ ‘b’

a b

aa ab ba bb

Page 43: Life: play and win in 20 trillion moves Stuart Russell

44

High-level actions

Start with restricted classical HTN form of partial program A high-level action (HLA)

Has a set of possible immediate refinements into sequences of actions, primitive or high-level

Each refinement may have a precondition on its use Executable plan space = all primitive refinements of Act

[Act]

[GoSFO, Act] [GoOAK, Act]

[WalkToBART, BARTtoSFO, Act]

Page 44: Life: play and win in 20 trillion moves Stuart Russell

Lookahead with HLAs

To do offline or online planning with HLAs, need a model

See classical HTN planning literature

45

Page 45: Life: play and win in 20 trillion moves Stuart Russell

Lookahead with HLAs

To do offline or online planning with HLAs, need a model

See classical HTN planning literature – not! Drew McDermott (AI Magazine, 2000):

The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions

46

Page 46: Life: play and win in 20 trillion moves Stuart Russell

47

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Page 47: Life: play and win in 20 trillion moves Stuart Russell

48

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

Page 48: Life: play and win in 20 trillion moves Stuart Russell

49

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

Page 49: Life: play and win in 20 trillion moves Stuart Russell

50

Downward refinement

Downward refinement property (DRP) Every apparently successful high-level plan has a successful

primitive refinement

“Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

Problem: how to say true things about effects of HLAs?

Page 50: Life: play and win in 20 trillion moves Stuart Russell

51

s0

g

S

s0

g

S

s0

g

S

h2h1 h1h2

h2

h1h2 is a solution

Angelic semantics for HLAs Begin with atomic state-space view, start in s0, goal state g Central idea is the reachable set of an HLA from each state

When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal

May seem related to nondeterminism But the nondeterminism is angelic: the “uncertainty” will be resolved

by the agent, not an adversary or nature

a4 a1 a3 a2

State space

Page 51: Life: play and win in 20 trillion moves Stuart Russell

52

Technical development NCSTRIPS to describe reachable sets

STRIPS add/delete plus “possibly add”, “possibly delete”, etc E.g., GoSFO adds AtSFO, possibly adds CarAtSFO

Reachable sets may be too hard to describe exactly Upper and lower descriptions bound the exact

reachable set (and its cost) above and below Still support proofs of plan success/failure Possibly-successful plans must be refined

Sound, complete, optimal offline planning algorithms Complete, eventually-optimal online (real-time) search

Page 52: Life: play and win in 20 trillion moves Stuart Russell

53

Example – Warehouse World

Has similarities to blocks and taxi domains, but more choices and constraints Gripper must stay in bounds Can’t pass through blocks Can only turn around at top row

Goal: have C on T4 Can’t just move directly Final plan has 22 steps

Left, Down, Pickup,Up, Turn, Down, Putdown, Right, Right, Down, Pickup, Left, Put, Up, Left, Pickup, Up, Turn, Right, Down, Down, Putdown

Page 53: Life: play and win in 20 trillion moves Stuart Russell

54

Navigate(xt,yt) (Pre: At(xs,ys))

Upper: -At(xs,ys), +At(xt,yt), FacingRight

Lower: IF (Free(xt,yt) x Free(x,ymax)):

-At(xs,ys), +At(xt,yt), FacingRight, ELSE: nil

Representing descriptions: NCSTRIPS Assume S = set of truth assignments to propositions Descriptions specify propositions (possibly) added/deleted by HLA An efficient algorithm exists to progress state sets (represented as DNF

formulae) through descriptions

~ st

~ st

st

x

Page 54: Life: play and win in 20 trillion moves Stuart Russell

55

Experiment Instance 3

5x8 world 90-step plan

Flat/hierarchical without descriptions did not terminate within 10,000 seconds

Page 55: Life: play and win in 20 trillion moves Stuart Russell

56

Online Search: Preliminary ResultsS

tep

s to

rea

ch g

oal

Page 56: Life: play and win in 20 trillion moves Stuart Russell