staffan järn. intelligent learning algortithm doesn’t require the presence of a teacher the...
TRANSCRIPT
![Page 1: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/1.jpg)
Staffan Järn
Reinforcement Learning
and Genetic Algorithms
![Page 2: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/2.jpg)
Reinforcement learning
Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for
good actions The algortithm tries to figure out what is the best
action to take in a given state, without knowing the final optimal solution.
The actions are based on rewards and penalties.
![Page 3: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/3.jpg)
Areas
Robot control Elevator scheduling (search for patterns) Telecommunications (finding networks) Games (Chess, Backgammon) Financial trading
![Page 4: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/4.jpg)
Cliffwalker program in Matlab Gridworld (4 x 12) The walker (agent) is supposed to find the shortest or
safest way to the finish, without falling into the cliff (blue area)
Falling into to cliff gives 100 penalty points, and the walker has to start over again
![Page 5: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/5.jpg)
Q-learning algorithm
Matrix, called the Q-matrix 48 x 4 matrix (12x4 gridworld) x 4 (four directions) The Q-matrix contains a ”price” for taking a certain
action Initialized randomly in the beginning The walker has two options:
• Take the optimal action, according to smallest Q-value
• Explore the gridworld by taking a random step (cannot walk into the wall)
Q-value is updated according to the equation every time the walker takes an action
![Page 6: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/6.jpg)
The new value in the Q-matrix for the previous state and taking the previously taken action will be updated based on: what it was before multiplied by (1-α), plus a factor (alfa) multiplied by the sum of the cost to take a step (usually 1, cliff 100) and another factor (gamma) multiplied by the best action the walker can take (optimal action)
New value Previous step Best actionSum of the cost
Alfa = learning factor Gamma = reward factor
![Page 7: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/7.jpg)
SARSA-algorithm
Another way of updating the Q-matrix Not based on the next optimal move, but on the next
actual move Means that it will take into account the risk of falling into
the cliff, and will eventually arrive at a safer path Longer, but safer path
![Page 8: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/8.jpg)
The program...
is based on 3 parameters• learning factor, the higher the faster the walker learns
• reward factor, the higher the more reward is give for good actions
• exploration factor, a high value leads to more randomness
• In the following example these values were used:
• α= [0.1], γ=[1], ε=[0.05]
![Page 9: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/9.jpg)
Results
2 4 6 8 10 120.5
1
1.5
2
2.5
3
3.5
4
4.5
Fig 1) Q-learning, the 100-th walk
2 4 6 8 10 120.5
1
1.5
2
2.5
3
3.5
4
4.5
Fig 2) Q-learning, optimal solution
2 4 6 8 10 120.5
1
1.5
2
2.5
3
3.5
4
4.5
Fig 3) SARSA, the 100-th walk2 4 6 8 10 12
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Fig 4) SARSA, optimal solution
![Page 10: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/10.jpg)
Results
0 50 100 150 200 250 300 350 4000
200
400
600
800
1000
1200
1400
1600
1800
2000
Simulations
Ste
ps a
nd p
enal
ty
Number of steps
Penalty
Random steps over the cliff
![Page 11: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/11.jpg)
Genetic Algorithms
GA can be applied to the Cliffwalker problem by:
• replacing the Reinforcement learning algorithm by GA’s to find the best path in the gridworld, or
• finding the best learning parameters for the Reinforcement learning algorithm
The conclusion is that GA’s will probably not improve the results remarkably to Reinforcement learning algortihms. Since it will very soon find out which are the best parameters..
![Page 12: Staffan Järn. Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for good](https://reader035.vdocuments.us/reader035/viewer/2022072006/56649cfd5503460f949cd8b7/html5/thumbnails/12.jpg)
Sources
Reinforcement Learning (pdf), Jonas Waller [2005] Cliffwalker program, Jonas Waller [2005] Reinforcement Learning, An Introduction. Sutton and
Barto