forth - university of texas at austinpstone/courses/394rfall19/resources/week2b-… · 0 i • d o...
TRANSCRIPT
0 S,
I•
d OO
Sg
It )I ••⑨• It ALLd O o o o 0 O
" Bellman eqn .forVyO S
,s
I Dynamic programming
po⑧Iiaa
Monte Carloest ,
of Vt
Sample of return :
Gse =
'
Ii ri⇒ VE ) =E[G± )
0
I•
d O Mc :
O L f )- only sampled transitions
I •••- All the way to end of episode
• Lf Lf If - no bootstrappingI O o o o 0 O
'
" Bellman eqn .
forthO , DP
:I Dynamic programming
← All possible tmsitiins
.•-
only one stepI -
bootstrappingian
Monte Carloest ,
of Vt
Sample of return :
Gsi ri⇒ VE )=E[G± )
ITG ! : ith return starting
from state s,
collected
from policy IT
On policy : VCs ) = ut If Gi
""
TT
predictionQq ( s
, a) = In Ea,
Gis ' " T
itG ! : ith return starting
from state s,
collected
from policy IT 5 I
On policy : VCs ) = ut E.,
Gis'
"
TT
predictionQ , ( s
, a) = In,
Gis ' " 'T
ITG ! : ith return starting
from state s,
collected
from policy IT
5 I
On policy : VCs ) = ut E.,
Gis'
"
TT
predictionQq ( s
, a) = In E. ,
Gis ' " T'
II- '
Totalsa )off policy : V
,. ( s ) = IT ⇐
,
Gis"
. Ci where Ci = It.
.False )
predictionQa .( s
, a) = In ¥,
Gfi" ". Ci
ITG ! : ith return starting
from state s,
collected
from policy IT
soI
On policy : VCs ) = ut E.,
Gis'
"
TT
predictionQq ( s
, a) = In E. ,
Gis ' " T'
II'
TiLarisa )off policy : V
,. ( s ) = IT ⇐
,
Gis"
. Ci where Ci = III. False )prediction
Qa .( s, a) = In ¥
,Gfi" "
. Ci
Consider : On - policy Control vs . off - policy control w/ E -greedy explanation
what are Va and The us.
T. and The ?- -
offpolicy on - policy
Safe off policy evaluation :
lb
Return probabilistic lower boundUK ,such that :
Vt > VIPwith prob .
I - 5 Given : IT,
d,
data from Ttb
without ever running policy IT !
Safe off policy evaluation :
,.
Return probabilistic lower bound VV,
such that :
Vt > VIPwith prob .
I - 5 Given : IT,
d,
data from Tb
without ever running policy IT !
Confidence bounds : Chernoff - Hoeffding inequality
with probability at least I - s :
µ I I Xi - b for OI Xi E b
2h
Safe off policy evaluation :
,.
Return probabilistic lower boundUK ,such that :
Vt > VIPwith prob .
I - 5 Given : IT,
or,
data from Ttb
without ever running policy IT !
Confidence bounds : Chernoff - Hoeffding inequality
with probability at least I - s :
u I I Xi - b for OI Xi E b
2h
I
the I th §,
Gi .
'T b
- Gma,
for OSGi
EGma×
ItITLAKISK)
Given returns G,
- . . Gn ANDp ,= IT THEN : Vt ( s ) =
from policy Tlb Kei
ItITLAKISK)
Given returns G,
- . . Gn andp ,= IT - THEN : Vials ) =
from policy Tlb k=tiTb(Aulsk )
Ols :
II.Gili
ItITLAKISK)
Given returns G,
- . . Gn andp ,= IT - THEN : Vials ) =
from policy Tlb k=tiTb(Aulsk )
Ols : Wis :
'
a E.Giei
TEEja j
ItITLAKISK)
Given returns G,
- . . Gn andp ,= IT - THEN : Vt ( s ) =
from policy Tlb k=tiTb(Aulsk )
Ols : Wis :
'
a E.Giei
EEja j
PD IS :
ht E GT,
where :
Ii = ?,
R,
t 8 ?.kzt . .it 8
" "
fRtand
Pa:b
= It TetrisTho C Anka )