1 ece-517: reinforcement learning in artificial intelligence lecture 1: course logistics,...
TRANSCRIPT
11
ECE-517: Reinforcement Learning in ECE-517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence
Lecture 1: Course Logistics, IntroductionLecture 1: Course Logistics, Introduction
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringElectrical Engineering and Computer Science DepartmentElectrical Engineering and Computer Science Department
The University of TennesseeThe University of TennesseeFall 2012Fall 2012
August 23, 2012August 23, 2012
ECE-517 - Reinforcement Learning in AI
But first, a quick anonymous survey But first, a quick anonymous survey ……
22
ECE-517 - Reinforcement Learning in AI 33
OutlineOutline
Course logistics and requirementsCourse logistics and requirements
Course roadmap & outlineCourse roadmap & outline
IntroductionIntroduction
ECE-517 - Reinforcement Learning in AI 44
Course ObjectivesCourse Objectives
Introduce the concepts & principles governing Introduce the concepts & principles governing reinforcement- based machine learning systemsreinforcement- based machine learning systemsReview fundamental theory Review fundamental theory
Markov Decision Processes (MDPs) Markov Decision Processes (MDPs) Dynamic Programming (DP)Dynamic Programming (DP) Practical systems; role of Neural Networks in NDPPractical systems; role of Neural Networks in NDP
RL learning schemes (Q-Learning, TD-Learning, etc.)RL learning schemes (Q-Learning, TD-Learning, etc.)Limitations of existing techniques and how they can be Limitations of existing techniques and how they can be improved improved Discuss software and hardware implementation Discuss software and hardware implementation considerationsconsiderations
Long-term goal: Long-term goal: to contribute to your understanding of to contribute to your understanding of the formalism, trends and challenges in constructing RL-the formalism, trends and challenges in constructing RL-based agentsbased agents
ECE-517 - Reinforcement Learning in AI 55
Course PrerequisitesCourse Prerequisites
A course on probability theory or background in A course on probability theory or background in probability theory is requiredprobability theory is required
Matlab/C/C++ competencyMatlab/C/C++ competency A Matlab tutorial has been posted on the course A Matlab tutorial has been posted on the course
website (under the schedule page)website (under the schedule page)
Open mindedness & imagination …Open mindedness & imagination …
ECE-517 - Reinforcement Learning in AI 66
Course AssignmentsCourse Assignments
2 small projects 2 small projects Main goal – provide students with basic hands-on Main goal – provide students with basic hands-on
experience in ML behavioral simulation and result experience in ML behavioral simulation and result interpretation interpretation
Analysis complemented by simulation Analysis complemented by simulation MATLAB programming orientedMATLAB programming oriented
Reports should include all background, explanations and Reports should include all background, explanations and resultsresults
5 problem assignment sets5 problem assignment sets Will cover the majority of the topics discussed in classWill cover the majority of the topics discussed in class Assignments should be handed in Assignments should be handed in before before the beginning the beginning
of the classof the class
Final projectFinal project Each student/group is assigned a topic Each student/group is assigned a topic Project report & in-class presentationProject report & in-class presentation
ECE-517 - Reinforcement Learning in AI 77
Sony AIBO LabSony AIBO Lab
Located at MK 630Located at MK 630
6 Sony AIBO dog robots (36 Sony AIBO dog robots (3rdrd generation) generation)
Local wireless network (for communicating with Local wireless network (for communicating with the dogs)the dogs)
Code for lab project/s will be written in MatlabCode for lab project/s will be written in Matlab Interface has been preparedInterface has been prepared
Time slots should be coordinated withTime slots should be coordinated withInstructor & TAInstructor & TA
ECE-517 - Reinforcement Learning in AI 88
Textbooks & Reference MaterialTextbooks & Reference Material
Lecture notes will be posted weekly on the course Lecture notes will be posted weekly on the course website (website (web.eecs.utk.edu/~itamar/courses/ECE-517web.eecs.utk.edu/~itamar/courses/ECE-517) as well ) as well as …as …
Updated scheduleUpdated schedule Assignment sets, sample codesAssignment sets, sample codes GradesGrades General announcementsGeneral announcements
Reading assignments will be posted on schedule pageReading assignments will be posted on schedule page
Textbook:Textbook: R. Sutton and A. Barto, “R. Sutton and A. Barto, “Reinforcement Learning: An Reinforcement Learning: An
IntroductionIntroduction,” 1998. (available online!),” 1998. (available online!)
ECE-517 - Reinforcement Learning in AI 99
Grading policy & office hoursGrading policy & office hours
2 Small Projects – 25% (12.5 points each)2 Small Projects – 25% (12.5 points each)
5 Assignment sets – 25% (5 points each)5 Assignment sets – 25% (5 points each)
Midterm (in-class) – 20%Midterm (in-class) – 20%
Final project – 30%Final project – 30%
Instructor: Dr. Itamar ArelInstructor: Dr. Itamar Arel Office Hours: T/Tr 2:00 – 3:00 PM (MK 608)Office Hours: T/Tr 2:00 – 3:00 PM (MK 608) TA: Derek Rose (TA: Derek Rose ([email protected]) – office @ MK 606) – office @ MK 606
Office hours: contact TAOffice hours: contact TA My email: My email: [email protected]@eecs.utk.edu
Students are strongly encouraged to visit the Students are strongly encouraged to visit the course website (course website (web.eecs.utk.edu/~itamar/courses/ECE-517web.eecs.utk.edu/~itamar/courses/ECE-517) ) for announcements, lecture notes, updates etc.for announcements, lecture notes, updates etc.
ECE-517 - Reinforcement Learning in AI 1010
UTK Academic Honesty Statement UTK Academic Honesty Statement
An essential feature of the University of Tennessee, An essential feature of the University of Tennessee, Knoxville, is a commitment to maintaining an Knoxville, is a commitment to maintaining an atmosphere of intellectual integrity and academic atmosphere of intellectual integrity and academic honesty. As a student of the university, honesty. As a student of the university, I pledge that I I pledge that I will neither knowingly give nor receive any will neither knowingly give nor receive any inappropriate assistance in academic workinappropriate assistance in academic work, thus , thus affirming my own personal commitment to honor and affirming my own personal commitment to honor and integrity.integrity.
Bottom lineBottom line: DO YOUR OWN WORK!: DO YOUR OWN WORK!
ECE-517 - Reinforcement Learning in AI 1111
Understanding and constructing intelligenceUnderstanding and constructing intelligence
What is intelligence? How do we defined/evaluate it?What is intelligence? How do we defined/evaluate it?
How can we design adaptive systems that optimize their How can we design adaptive systems that optimize their performance as time goes by?performance as time goes by?
What are the limitations of RL based algorithms?What are the limitations of RL based algorithms?
How can artificial Neural Networks help us scale intelligent How can artificial Neural Networks help us scale intelligent systems?systems?
In what ways can knowledge be efficiently represented?In what ways can knowledge be efficiently represented?
This course is NOT about …This course is NOT about … RoboticsRobotics Machine learning (in the general sense)Machine learning (in the general sense) Legacy AI – symbolic reasoning, logic, etc.Legacy AI – symbolic reasoning, logic, etc. Image/vision/signal processingImage/vision/signal processing Control systems theoryControl systems theory
Why the course name “RL in AI” ?Why the course name “RL in AI” ?
ECE-517 - Reinforcement Learning in AI 1212
Course Outline & RoadmapCourse Outline & Roadmap
IntroductionIntroduction
Review of basic probability theoryReview of basic probability theory Discrete-time/space probability theoryDiscrete-time/space probability theory Discrete Markov ChainsDiscrete Markov Chains
Dynamic ProgrammingDynamic Programming Markov Decision Processes (MDPs)Markov Decision Processes (MDPs) Partially Observable Markov Decision Processes (POMDPs)Partially Observable Markov Decision Processes (POMDPs)
Approximate Dynamic Programming (a.k.a. Reinforcement Approximate Dynamic Programming (a.k.a. Reinforcement Learning)Learning)
Temporal Difference (TD) Learning, PlanningTemporal Difference (TD) Learning, Planning
Midterm – Tuesday, Oct 9, 2012Midterm – Tuesday, Oct 9, 2012
Neuro-Dynamic ProgrammingNeuro-Dynamic Programming Feedfoward & Recurrent Neural NetworksFeedfoward & Recurrent Neural Networks Neuro-dynamic RL architectureNeuro-dynamic RL architecture
Applications and case studiesApplications and case studies
Final project presentations – Nov 27– Dec 4, 2012Final project presentations – Nov 27– Dec 4, 2012
A detailed schedule is posted at the course websiteA detailed schedule is posted at the course website
ECE-517 - Reinforcement Learning in AI 1313
OutlineOutline
Course logistics and requirementsCourse logistics and requirements
Course outline & roadmapCourse outline & roadmap
IntroductionIntroduction
ECE-517 - Reinforcement Learning in AI
What is Machine Learning?What is Machine Learning?
Discipline focusing on computer algorithms that learn Discipline focusing on computer algorithms that learn to perform to perform “intelligent” tasks“intelligent” tasks
Learning is based on observation of dataLearning is based on observation of data
GenerallyGenerally: learning to do better in the future based on : learning to do better in the future based on what has been observed/experienced in the pastwhat has been observed/experienced in the past
ML is a core subarea of AI, which also intersects with ML is a core subarea of AI, which also intersects with physics, statistics, theoretical CS, etc.physics, statistics, theoretical CS, etc.
Examples of “ML” ProblemsExamples of “ML” Problems Optical character recognitionOptical character recognition Face detectionFace detection Spoken language understandingSpoken language understanding Customer segmentationCustomer segmentation Weather prediction, etc.Weather prediction, etc.
1414
ECE-517 - Reinforcement Learning in AI 1515
IntroductionIntroduction
Why do we need good ML technology?Why do we need good ML technology? Human beings are lazy creatures …Human beings are lazy creatures … Service roboticsService robotics
$10B market in 2015 – Japan only!$10B market in 2015 – Japan only! Pattern recognition (speech, vision)Pattern recognition (speech, vision) Data miningData mining Military applicationsMilitary applications … … many moremany more
Many ML problems can be Many ML problems can be formulated as RL problems …formulated as RL problems …
ECE-517 - Reinforcement Learning in AI 1616
Introduction (cont.)Introduction (cont.)
Learning by Learning by interacting with our environment interacting with our environment is is probably the first to occur to us when we think about probably the first to occur to us when we think about the nature of learningthe nature of learning
Humans have no direct teachersHumans have no direct teachers
We do have direct sensormotor connection to the We do have direct sensormotor connection to the environmentenvironment
We learn as we go alongWe learn as we go along Interaction with environment teaches us what “works” Interaction with environment teaches us what “works”
and what doesn’tand what doesn’t We construct a “model” of our environmentWe construct a “model” of our environment
This course explores a This course explores a computationalcomputational approach to approach to learning from interaction with the environmentlearning from interaction with the environment
ECE-517 - Reinforcement Learning in AI 1717
What is Reinforcement LearningWhat is Reinforcement Learning
Reinforcement learning is learning what to do – how to Reinforcement learning is learning what to do – how to map map situationssituations to to actionsactions – in order to maximize a long- – in order to maximize a long-term objective function driven by rewards.term objective function driven by rewards.
It is a form of unsupervised learningIt is a form of unsupervised learning
Two key components at the core of RL: Two key components at the core of RL: Trial-and-errorTrial-and-error – adapting internal representation, based – adapting internal representation, based
on experience, to improve future performanceon experience, to improve future performance Delayed rewardDelayed reward – actions are produced so as to yield long- – actions are produced so as to yield long-
term (not just short-term) rewardsterm (not just short-term) rewards
The “The “agentagent” must be able to:” must be able to: Sense its environmentSense its environment Produce actions that can affect the environmentProduce actions that can affect the environment Have a goal (“momentary” cost metric) relating to its stateHave a goal (“momentary” cost metric) relating to its state
ECE-517 - Reinforcement Learning in AI 1818
What is Reinforcement Learning (cont’)What is Reinforcement Learning (cont’)
RL attempts to solve the RL attempts to solve the Credit AssignmentCredit Assignment problem problem
what is the what is the long-term impact long-term impact of an action taken nowof an action taken now
Unique to RL systems - major challenge in MLUnique to RL systems - major challenge in ML Necessitates an accurate model of the Necessitates an accurate model of the
environment being controlled/interacted-withenvironment being controlled/interacted-with Something animals and humans do very well, Something animals and humans do very well,
and computers do very poorlyand computers do very poorlyWe’ll spend most of the semester formalizing solutions We’ll spend most of the semester formalizing solutions to this problemto this problem
Philosophically Philosophically ComputationallyComputationally Practically (implementation considerations)Practically (implementation considerations)
ECE-517 - Reinforcement Learning in AI 1919
The Big PictureThe Big Picture
Artificial Intelligence Artificial Intelligence Machine Learning Machine Learning Reinforcement Learning Reinforcement Learning
Types of Machine LearningTypes of Machine Learning
Supervised LearningSupervised Learning: learn from : learn from labeled exampleslabeled examples Unsupervised LearningUnsupervised Learning: process : process unlabeled unlabeled
examplesexamples Example: clustering data into groupsExample: clustering data into groups
Reinforcement LearningReinforcement Learning: learn from : learn from interactioninteraction Defined by the problemDefined by the problem Many approaches are possible (including evolutionary)Many approaches are possible (including evolutionary) Here we will focus on a particular family of approachesHere we will focus on a particular family of approaches Autonomous learningAutonomous learning
ECE-517 - Reinforcement Learning in AI 2020
Software vs. HardwareSoftware vs. Hardware
Historically, ML has been in CS turfHistorically, ML has been in CS turf Confinement to the Von Neumann architectureConfinement to the Von Neumann architecture
Software limits scalability …Software limits scalability … Human brain has 10Human brain has 101111 processors operating processors operating
at onceat once However, each runs at ~150 HzHowever, each runs at ~150 Hz It’s the massive parallelism that gives it powerIt’s the massive parallelism that gives it power
Even 256 processors is not “massive parallelism”Even 256 processors is not “massive parallelism”
““Computer Engineering” perspectiveComputer Engineering” perspective FPGA devices (reconfigurable computing)FPGA devices (reconfigurable computing) GPUsGPUs ASIC ProspectASIC Prospect UTK/MIL group focusUTK/MIL group focus
ECE-517 - Reinforcement Learning in AI 2121
Exploration vs. ExploitationExploration vs. Exploitation
A fundamental trade-off in RLA fundamental trade-off in RL ExploitationExploitation of what worked in the past (to yield high of what worked in the past (to yield high
reward)reward) ExplorationExploration of new, alternative action paths so as to of new, alternative action paths so as to
learn how to make better action selections in the learn how to make better action selections in the futurefuture
The dilemma is that neither exploration nor The dilemma is that neither exploration nor exploitation can be pursued exclusively without exploitation can be pursued exclusively without failing at the taskfailing at the taskOn a On a stochasticstochastic task, each action must be tried task, each action must be tried many times to gain a reliable estimate of its many times to gain a reliable estimate of its expected rewardexpected rewardWe will review mathematical methods proposed to We will review mathematical methods proposed to address this basic issueaddress this basic issue
ECE-517 - Reinforcement Learning in AI 2222
The Reinforcement Learning FrameworkThe Reinforcement Learning Framework
Environment(a.k.a. Plant)
Agent(a.k.a. Controller)
Obse
rvati
on
sA
ction
s
Rew
ard
ECE-517 - Reinforcement Learning in AI 2323
Some Examples or RLSome Examples or RL
A master chess playerA master chess player Planning-anticipating replies and counter-repliesPlanning-anticipating replies and counter-replies Immediate, intuitive judgmentImmediate, intuitive judgment
A mobile robot decides whether to enter a room or try A mobile robot decides whether to enter a room or try to find its way back to a battery-charging stationto find its way back to a battery-charging stationPlaying backgammonPlaying backgammon
Obviously, a strategy is necessaryObviously, a strategy is necessary Some luck involved (stochastic game)Some luck involved (stochastic game)
In all cases, the agent tries to achieve a goal despite In all cases, the agent tries to achieve a goal despite uncertainty about its environmentuncertainty about its environmentThe affect of an action cannot be fully predictedThe affect of an action cannot be fully predictedExperience allows the agent to improve its Experience allows the agent to improve its performance over timeperformance over time
ECE-517 - Reinforcement Learning in AI 2424
Origins of Reinforcement LearningOrigins of Reinforcement Learning
Artificial Intelligence Artificial Intelligence
Control Theory (MDP)Control Theory (MDP)
Operations ResearchOperations Research
Cognitive Science and PsychologyCognitive Science and Psychology
More recently, NeuroscienceMore recently, Neuroscience
RL has solid foundations and is a well-RL has solid foundations and is a well-established research fieldestablished research field
ECE-517 - Reinforcement Learning in AI 2525
Elements of Reinforcement LearningElements of Reinforcement Learning
Beyond the agent and the environment, we have the Beyond the agent and the environment, we have the following following fourfour main elements in RL main elements in RL
1)1) PolicyPolicy - defines the learning agent's way of - defines the learning agent's way of behavingbehaving at at any given time. Roughly speaking, a policy is a mapping any given time. Roughly speaking, a policy is a mapping from (perceived) from (perceived) statesstates of the environment to of the environment to actionsactions to to be taken when in those states.be taken when in those states.o Usually stochastic (adapts as you go along)Usually stochastic (adapts as you go along)o Enough to determine the agent’s behaviorEnough to determine the agent’s behavior
2)2) Reward function Reward function - defines the - defines the goalgoal in a RL learning in a RL learning problem. Roughly speaking, it maps each perceived state problem. Roughly speaking, it maps each perceived state (or state-action pair) of the environment to a single (or state-action pair) of the environment to a single number, a number, a rewardreward, indicating the intrinsic desirability of , indicating the intrinsic desirability of that statethat stateo Agent’s goal is to maximize the reward Agent’s goal is to maximize the reward over timeover timeo May be stochasticMay be stochastico Drive the policy employed and its adaptationDrive the policy employed and its adaptation
ECE-517 - Reinforcement Learning in AI 2626
Elements of Reinforcement Learning (cont.)Elements of Reinforcement Learning (cont.)
3)3) Value functionValue function Whereas a reward function indicates what Whereas a reward function indicates what is good in an immediate sense, a is good in an immediate sense, a value functionvalue function specifies specifies what is good in the what is good in the long runlong run. .
Roughly speaking, the Roughly speaking, the valuevalue of a state is the total amount of a state is the total amount of reward an agent can expect to accumulate over the of reward an agent can expect to accumulate over the future, starting from that state.future, starting from that state.
It allows the agent to look over the “horizon”It allows the agent to look over the “horizon”Actions are derived from value estimations, not rewardsActions are derived from value estimations, not rewardsWe measure rewards, but we estimate and act upon values We measure rewards, but we estimate and act upon values – corresponds to strategic/long-term thinking– corresponds to strategic/long-term thinking
Intuitively a prerequisite for intelligence/intelligent-Intuitively a prerequisite for intelligence/intelligent-control (plants vs. animals)control (plants vs. animals)
Obtaining a good value function is a key challenge in Obtaining a good value function is a key challenge in designing good RL systemsdesigning good RL systems
ECE-517 - Reinforcement Learning in AI 2727
Elements of Reinforcement Learning (cont.)Elements of Reinforcement Learning (cont.)
4)4) Model Model – an observable entity that mimics the behavior of – an observable entity that mimics the behavior of the environment. the environment. For example, given a state and action, the model might For example, given a state and action, the model might
predict the resultant next state and next reward predict the resultant next state and next reward As we will later discuss, predictability and auto-associative As we will later discuss, predictability and auto-associative
memory, are key attributes of the mammal brainmemory, are key attributes of the mammal brain Models are used for Models are used for planningplanning – any way of deciding on a – any way of deciding on a
course of action by considering possible future scenarios course of action by considering possible future scenarios prior to them actually occurringprior to them actually occurring
Note that RL can work (sometimes very well) with an Note that RL can work (sometimes very well) with an incomplete modelincomplete model
We’ll go over a range of model platforms to achieve the We’ll go over a range of model platforms to achieve the aboveabove
As a side note: RL is essentially an optimization problem. As a side note: RL is essentially an optimization problem. However, it is one of the many optimization problems However, it is one of the many optimization problems that are extremely hard to (optimally) solve.that are extremely hard to (optimally) solve.
ECE-517 - Reinforcement Learning in AI 2828
An Extended Example: Tic-Tac-ToeAn Extended Example: Tic-Tac-Toe
Consider a classical tic-tac-toe game, whereby the Consider a classical tic-tac-toe game, whereby the winner places three marks in a row, horizontally, winner places three marks in a row, horizontally, vertically or diagonallyvertically or diagonally
Let’s assume:Let’s assume: We are playing against an imperfect playerWe are playing against an imperfect player Draws and losses are equally bad for usDraws and losses are equally bad for us
Q: Can we design a player that’ll find imperfections in Q: Can we design a player that’ll find imperfections in the opponent’s play and learn to maximize chances of the opponent’s play and learn to maximize chances of winning?winning?
Classical machine learning schemes would never visit a Classical machine learning schemes would never visit a state that has the potential to lead to a lossstate that has the potential to lead to a loss
We want to exploit the weaknesses of the opponent, so We want to exploit the weaknesses of the opponent, so we may decide to visit a state that has the potential of we may decide to visit a state that has the potential of leading to a lossleading to a loss
ECE-517 - Reinforcement Learning in AI 2929
An Extended Example: Tic-Tac-Toe (cont.)An Extended Example: Tic-Tac-Toe (cont.)
Using Using dynamic programmingdynamic programming (DP), we can (DP), we can computecompute an optimal solution for any opponentan optimal solution for any opponent
However, we would need specifications of the opponent However, we would need specifications of the opponent (e.g. state-action probabilities)(e.g. state-action probabilities)
Such information is usually unavailable to usSuch information is usually unavailable to us
In In RLRL we estimate this information from experience we estimate this information from experience
We later apply DP, or other sequential decision We later apply DP, or other sequential decision making schemes, based on the model we obtained making schemes, based on the model we obtained by experienceby experience
A A policypolicy tells the agent how to make its next move tells the agent how to make its next move based on the state of the boardbased on the state of the board
Winning probabilities can be derived by knowing opponentWinning probabilities can be derived by knowing opponent
ECE-517 - Reinforcement Learning in AI 3030
An Extended Example: Tic-Tac-Toe (cont.)An Extended Example: Tic-Tac-Toe (cont.)
How do we solve this in RL …How do we solve this in RL … Set up a table of numbers – one for each state of the gameSet up a table of numbers – one for each state of the game
This number will reflect the probability of winning from that This number will reflect the probability of winning from that particular stateparticular state
This is treated as the state’s This is treated as the state’s valuevalue, and the entire learned table , and the entire learned table denotes the denotes the value functionvalue function
If V(If V(aa)>V()>V(bb) ) state state aa is preferred over state is preferred over state bb All states with three X’s in a row have win prob. of 1All states with three X’s in a row have win prob. of 1 All states with three O’s in a row have win prob. of 0All states with three O’s in a row have win prob. of 0 All other states are preset to prob. 0.5All other states are preset to prob. 0.5 When playing the game, we make a move that we predict When playing the game, we make a move that we predict
would result in a state with the highest value (would result in a state with the highest value (exploitationexploitation)) Occasionally, we chose randomly among the non-zero Occasionally, we chose randomly among the non-zero
valued states (valued states (exploratory movesexploratory moves))
ECE-517 - Reinforcement Learning in AI 3131
An Extended Example: Tic-Tac-Toe (cont.)An Extended Example: Tic-Tac-Toe (cont.)
ECE-517 - Reinforcement Learning in AI 3232
An Extended Example: Tic-Tac-Toe (cont.)An Extended Example: Tic-Tac-Toe (cont.)
While playing, we update the values of the statesWhile playing, we update the values of the states
The current value of the earlier state is adjusted to be The current value of the earlier state is adjusted to be closer to the value of the later statecloser to the value of the later state
V(s) V(s) V(s) + V(s) + [V(s’) –V(s)][V(s’) –V(s)]
where: 0<where: 0< is a learning parameter (step-size param.) is a learning parameter (step-size param.)
s – state before move / s’ – state after moves – state before move / s’ – state after moveThis update rule is an example of This update rule is an example of Temporal-Difference Temporal-Difference LearningLearning method method
This method performs quite well – converges to the This method performs quite well – converges to the optimal policy (for a fixed opponent)optimal policy (for a fixed opponent)
Can be adjusted to allow for slowly-changing opponentsCan be adjusted to allow for slowly-changing opponents
ECE-517 - Reinforcement Learning in AI 3333
An Extended Example: Tic-Tac-Toe (cont.)An Extended Example: Tic-Tac-Toe (cont.)
RL features encountered …RL features encountered … Emphasis on Emphasis on learning from interactionlearning from interaction – in this case with – in this case with
the opponentthe opponent Clear goalClear goal – correct planning takes into account delayed – correct planning takes into account delayed
rewardsrewardsFor example setting up traps for a shortsighted opponentFor example setting up traps for a shortsighted opponent
No modelNo model of opponent exists of opponent exists a prioria priori
Although this example is a good one, RL methods can …Although this example is a good one, RL methods can … be applied to infinite horizon problems (not state-terminal)be applied to infinite horizon problems (not state-terminal) be applied to cases where there is no external adversary be applied to cases where there is no external adversary
(e.g. “game against nature”)(e.g. “game against nature”)
Backgammon example Backgammon example 10 102020 states using Neural Nets states using Neural Nets