![Page 1: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/1.jpg)
AnIntroductiontoReinforcementLearning
AnandSubramoneyanand [at]igi.tugraz.at
InstituteforTheoreticalComputerScience,TUGrazhttp://www.igi.tugraz.at/
MachineLearningGrazMeetup12th October2017
![Page 2: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/2.jpg)
Outline
• Introduction• Valueestimation• Q-learning• Policygradient• DQN• A3C
![Page 3: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/3.jpg)
WhatisReinforcementLearning?
• Learninganagentwhileinteracting withtheenvironment• Theagentreceivesa“reward”foreachactionittakes• Thegoaloftheagentistomaximizetherewarditreceives• Theagentisnottoldwhatthe”right”actionis.i.e.itisnotsupervised
![Page 4: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/4.jpg)
Notation
• Thestateoftheenvironmentis𝑠" attime𝑡• Examplesofstate:the(x,y)coordinates,imagepixelsetc.
• Ateachtimestep𝑡,theagenttakesaction𝑎" (knowing𝑠")• Examplesofaction:Moveright/left/up/down,accelerationofcaretc.
• Thentheagentgetsareward𝑟"• Couldbe0/1orpointsinthegame
• Theagentplaysforone“episode”• Called“episodic”RL• E.g.onegameuntilitwins/losesetc.• Non-episodicalsopossible
![Page 5: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/5.jpg)
Notation
• Model:𝒫''() = Pr{𝑠"/0 = 𝑠1|𝑠" = 𝑠, 𝑎" = 𝑎}
• Whatisthenextstategiventhecurrentstateandactiontaken?• Theenvironmentcanbestochastic,inwhichcasethisisaprobabilitydistribution
• Reward:ℛ''() = 𝐸{𝑟"/0|𝑠" = 𝑠, 𝑎" = 𝑎, 𝑠"/0 = 𝑠1}
• Expectedvalueofrewardwhengoingfromonestatetoanothertakingacertainaction• Inthemostgeneralcase,therewardisnotdeterministic
![Page 6: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/6.jpg)
Policy
• Theagenthasacertainmappingbetweenstateandaction• Thisiscalledthepolicy oftheagent• Denotedby𝜋(𝑠, 𝑎)• Inthestochasticcase,it’stheprobabilitydistributionoveractionsatagivenstate𝜋 𝒔, 𝒂 = P(𝒂"|𝒔")
![Page 7: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/7.jpg)
Thegoalofreinforcementlearning
• Istofindapolicythatmaximizesthetotalexpectedreward• alsocalledthe“return”
• Inanepisode• 𝛾 iscalledthe“discountingfactor”
• Small 𝛾 producesshortsighted,largeg far-sightedpolicies.• Risalwaysfiniteif𝛾 < 1 andthelocalrewardsrarefromaboundedsetofnumbers.
𝑅" = 𝑟"/0 + 𝛾𝑟"/Y + 𝛾Y𝑟"/Z + ⋯ = \𝛾]𝑟"/]/0
^
]_`
![Page 8: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/8.jpg)
Exampleenvironment
Theagentreceives-0.001rewardeverystep.Whenitreachesthegoalorapit,itobtainsrewardsof+1.0or-1.0resp.andtheepisodeisterminated.
![Page 9: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/9.jpg)
Thegoalofreinforcementlearning
• Howcantheagentquantifythedesirabilityofintermediatestates(whereno,ornorelevantrewardisgiven)?
• Thedifficultyis,thatthedesirabilityofintermediatestatesdependson:• TheconcreteselectionofactionsAFTERbeinginsuchanintermediatestate,• ANDonthedesirabilityofsubsequentintermediatestates.
• Thevaluefunctionallowsustodothis
![Page 10: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/10.jpg)
Thevaluefunction
• Definedas:• 𝑉b 𝑠 = 𝐸b 𝑅" 𝑠" = 𝑠 = 𝐸b{∑ 𝛾]𝑟"/]/0|𝑠" = 𝑠^
]_` }
• Thevalueofastatesistheexpectedreturnstartingfromthatstatesandfollowingpolicy𝜋• SatisfiestheBellmanequations
Bellman equation for Vp :
Vp (s) = p (s,a) Ps ¢ s a Rs ¢ s
a + gV p( ¢ s )[ ]¢ s å
aå
– a system of S simultaneous linear equations
Notethatit’sarecursiveformulationofthevaluefunction
![Page 11: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/11.jpg)
Examplevaluefunction
![Page 12: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/12.jpg)
Calculatingthevaluefunction
• Ifthemodel𝒫''() andrewardℛ''(
) areknown,calculate𝑉b 𝑠 usingiterativepolicyevaluation.
http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html
![Page 13: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/13.jpg)
Whyvaluefunction?
• There existsanaturalpartialorderonallpossiblepolicies:
𝜋1 ≥ 𝜋𝑖𝑓𝑎𝑛𝑑𝑜𝑛𝑙𝑦𝑖𝑓𝑉b( 𝑠 ≥ 𝑉b 𝑠 𝑓𝑜𝑟𝑎𝑙𝑙𝑠 ∈ 𝑆
• Definition: Apolicy 𝜋1 iscalledoptimalif 𝜋1 ≥ 𝜋forallpolicies 𝜋
• Existenceofatleastoneoptimalpolicyisguaranteed,andtheysatisfyBellmanOptimalityequations.
![Page 14: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/14.jpg)
Theaction-valuefunction
• Definedas:• 𝑄b 𝑠, 𝑎 = 𝐸b 𝑅" 𝑠" = 𝑠, 𝑎" = 𝑎 = 𝐸b{∑ 𝛾]𝑟"/]/0|𝑠" = 𝑠^
]_` , 𝑎" = 𝑎}
• Thisiscalledthe“Qfunction”• Thevalueoftakingaction𝑎instate𝑠 followingpolicy𝜋 thereafter• AlsosatisfiestheBellmanequations
Qp (s,a) = Ep rt +1 + gV p(st +1 ) st = s, at = a{ }= Ps ¢ s
a
¢ s å Rs ¢ s
a +g Vp ( ¢ s )[ ]
![Page 15: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/15.jpg)
Findinganoptimalpolicy
• Defineanewpolicy𝜋1 thatisgreedywithrespectto𝑉b
• Forallstates𝑠:𝜋1 = 𝑎𝑟𝑔𝑚𝑎𝑥)𝑄b 𝑠, 𝑎• Thispolicysatisfies𝑄b 𝑠, 𝜋1 𝑠 ≥ 𝑉b 𝑠• Canbeshownthat:• 𝜋1 ≥ 𝜋 for𝛾 < 1• Eventuallyconvergestoanoptimalpolicy
• Thisworksonlyif𝑉b 𝑠 canbecalculated
![Page 16: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/16.jpg)
OtherwaystocalculateV/Q
• Monte-carlo policyevaluation• Sampleoneepisodeandupdatethevaluefunctionforeachstate• 𝑉 𝑠" ⟵ 𝑉 𝑠" + 𝛼 𝑅" − 𝑉 𝑠"• Asymptoticallyconvergestothetruevaluefunction
• TemporalDifference(TD)Learning• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑉 𝑠" ⟵ 𝑉 𝑠" + 𝛼(𝑟"/0 + 𝛾𝑉 𝑠"/0 − 𝑉 𝑠" )
TemporalDifference
![Page 17: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/17.jpg)
LearningQ-function(SARSA)
• Qcanbeusedtodefineapolicy• takeactiona = 𝑎𝑟𝑔𝑚𝑎𝑥)𝑄(𝑠, 𝑎) ateverystatewithprobability1 − 𝜖• Withprobability𝜖 takearandomaction(exploration)
• UsetemporaldifferencelearningtolearnQ-function• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑄 𝑠", 𝑎" ⟵ 𝑄 𝑠", 𝑎" + 𝛼(𝑟"/0 + 𝛾𝑄 𝑠"/0, 𝑎"/0 − 𝑄 𝑠", 𝑎" )
• 𝑎"/0forlearningcanbeusedfromthispolicy• CalledSARSA
![Page 18: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/18.jpg)
Q-learning
• UsetemporaldifferencelearningtolearnQ-function• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑄 𝑠", 𝑎" ⟵ 𝑄 𝑠", 𝑎" + 𝛼(𝑟"/0 + 𝛾max) 𝑄 𝑠"/0, 𝑎 − 𝑄 𝑠", 𝑎" )
• Q-learningrequiresforconvergencetotheoptimalpolicythatrewardsaresampledforeachpair(s,a)infinitelyoften.
• http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html
![Page 19: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/19.jpg)
Functionapproximation
• TheQ-functioncanbeapproximatedwithaneuralnetwork(oranyotherfunctionapproximator)
• Thetargetsforthenetworkwouldbe𝑟"/0 + 𝛾max) 𝑄 𝑠"/0, 𝑎
• Traintheneuralnetworkwithbackpropagation
![Page 20: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/20.jpg)
Thegoalofreinforcementlearning(repeated)
• Istofindapolicythatmaximizesthetotalexpectedreward• alsocalledthe“return”
• 𝛾 iscalledthe“discountingfactor”
• Small 𝛾 producesshortsighted,largeg far-sightedpolicies.• Risalwaysfiniteif𝛾 < 1 andthelocalrewardsrarefromaboundedsetofnumbers.
𝑅" = 𝑟"/0 + 𝛾𝑟"/Y + 𝛾Y𝑟"/Z + ⋯ = \𝛾]𝑟"/]/0
^
]_`
![Page 21: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/21.jpg)
PolicyGradient
• Whynotlearnthepolicydirectly?• Definecostfunctionasthetotalexpectedreward:
𝐽 𝜃 = 𝐸 \𝑎]𝑟]
}
]_`
= 𝐸{𝑟 𝜏 }
• 𝑎] issomediscountingfactor• 𝑟] isrewardatstepk• 𝜏 isatrajectoryand𝑟 𝜏 =∑ 𝑎]𝑟]}
]_`
• Learnthisusinggradientascent:
𝜃"/0 = 𝜃" + 𝜂𝛻�𝐽 𝜃
• Problems?• CannotcalculategradientofJ
![Page 22: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/22.jpg)
PolicyGradient
• Itispossibletoempiricallyestimatethegradient(Williams1992)
𝛻�𝐽 𝜃 = 𝐸{𝛻� log 𝑝�(𝜏)(𝑟 𝜏 − 𝑏)}
=\𝛻� log 𝜋�(𝑎"|𝑠")�
"_`
(𝑅" − 𝑏)
• Usesthelog-likelihoodtrick(orREINFORCEtrick)• Baselineisusedtoreducevarianceofgradientestimator• Baselinedoesn’tintroducebias• DEMO
![Page 23: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/23.jpg)
DQNandA3C
![Page 24: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/24.jpg)
DQN
• Mnih,V.etal. Human-levelcontrolthroughdeepreinforcementlearning.Nature 518, 529–533(2015).• UsesadeepneuralnetworktolearntheQ-values
![Page 25: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/25.jpg)
DQN:Twokeyideas
• Episodereplay:• StoreearlierstepsandapplyQ-learningupdatesinrandombatchesfromthismemory
• UpdatepolicynetworkonlyonceeveryCsteps
![Page 26: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/26.jpg)
DQN
![Page 27: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/27.jpg)
A3C
• Mnih,V.etal. AsynchronousMethodsforDeepReinforcementLearning.arXiv:1602.01783[cs] (2016).
• A3C:AsynchronousAdvantageActorCritic• Usespolicygradientwithabaselinethatisthevaluefunction
𝛻�𝐽 𝜃 =\𝛻� log 𝜋�(𝑎"|𝑠")�
"_`
(𝑅" − 𝑉(𝑠"))
AdvantageActor
Critic
![Page 28: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/28.jpg)
A3C
![Page 29: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action](https://reader033.vdocuments.us/reader033/viewer/2022060523/6052852649228b406832b06f/html5/thumbnails/29.jpg)
Resources
• Book:ReinforcementLearningAnIntroduction,RichardSuttonandAndrewBarto• AvailableonlineonAndrewBarto’s website:http://www.incompleteideas.net/sutton/book/the-book-1st.html
• Course:AutonomouslyLearningSystemsIGITUGraz• 2016website:http://www.igi.tugraz.at/lehre/Autonomously_learning_systems/WS16/• Nextcoursein2018• Lectureslidesavailablethere
• DQN:https://deepmind.com/research/dqn/• OpenAI Gym:https://gym.openai.com/envs• DeepReinforcementLearning:PongfromPixels(AndrejKarpathy):https://karpathy.github.io/2016/05/31/rl/• Book:DeepLearning,IanGoodfellow,Yoshua Bengio andAaronCourville
• Availableonline:http://www.deeplearningbook.org• RLPy:https://rlpy.readthedocs.io/en/latest/ (python2.7only)