practice theory · 2020. 1. 3. · practice theory powerful modeling, simple exploration...
TRANSCRIPT
Practice Theory
Powerfulmodeling,simpleexploration Sophisticatedexploration insmall-stateMDPs
e.g.:AtariDeepReinforcement Learning e.g.𝐸",R-MAXalgorithms
Limitedtheoryforrichobservations
Goal
DevelopReinforcementLearningapproachesguaranteed tolearnanoptimalpolicy withasmallnumberofsamples despiterichobservations.
Model PACGuarantees
Small-state MDPs Known
Structured large-stateMDPs New
ReactivePOMDPs Extended
ReactivePSRs New
LQR (continuousactions) Known
Model PACGuarantees
Small-state MDPs Known
Structured large-stateMDPs New
ReactivePOMDPs Extended
ReactivePSRs New
LQR (continuousactions) Known
§
§
§
𝐻
§
§
𝜋(𝑥')
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
𝑥
§
§
𝜋 𝑥 ) *
Distributionofinitialstate
Distributionofnextstate
Instantaneousreward
§
max/E0~23 𝑟 𝑎 + E*7~8 *,/ 𝑉⋆(𝑥<)
Distributionofinitialstate
DistributionofnextstateInstantaneous
reward
Optimalaction
§
max/E0~23 𝑟 𝑎 + E*7~8 *,/ 𝑉⋆(𝑥<)
𝑄⋆(𝑥, 𝑎)
𝜋⋆ 𝑥 = argmax/
𝑄⋆ 𝑥, 𝑎
§
§
§
§
§
§
§
§
§
§
E0~23 𝑟 𝑎 + E*7~8 *,/ 𝑉⋆ 𝑥<
§
E0~23 𝑟 𝑎 + E*7~8 *,/ max/7𝑄⋆(𝑥<, 𝑎<)
§
E0~23 𝑟 𝑎 + E*7~8 *,/ 𝑄⋆(𝑥<, 𝜋⋆ 𝑥< )
§
§
E 𝑓 𝑥', 𝑎' − 𝑟' − 𝑓 𝑥'CD, 𝑎'CD ,
𝑥'
E 𝑓 𝑥', 𝑎' − 𝑟' − 𝑓 𝑥'CD, 𝑎'CD ,
§
§
§ Validitycondition
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§§
§
§
§
§
§E*∼8F max/ [𝑄⋆ 𝑥, 𝑎 ]
E*∼8F𝑄⋆(𝑥, 𝜋⋆ 𝑥 )
§
§
§ 𝑉I = E𝒙∼𝚪𝟏[𝒇 𝒙, 𝝅𝒇 𝒙 ]
§
§
§
§
§
Optimismunderuncertainty,guessfor𝑉 𝜋⋆ if𝑓 = 𝑄⋆
Checkingouroptimisticbelief
Prunethepossiblesolutions
§§
§§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
§
Detailsat:https://arxiv.org/abs/1610.09512