staffan järn. intelligent learning algortithm doesn’t require the presence of a teacher the...

Staffan Järn

Reinforcement Learning

and Genetic Algorithms

Reinforcement learning

Intelligent learning algortithm Doesn’t require the presence of a teacher The algorithm is given a reward (a reinforcement) for

good actions The algortithm tries to figure out what is the best

action to take in a given state, without knowing the final optimal solution.

The actions are based on rewards and penalties.

Areas

Robot control Elevator scheduling (search for patterns) Telecommunications (finding networks) Games (Chess, Backgammon) Financial trading

Cliffwalker program in Matlab Gridworld (4 x 12) The walker (agent) is supposed to find the shortest or

safest way to the finish, without falling into the cliff (blue area)

Falling into to cliff gives 100 penalty points, and the walker has to start over again

Q-learning algorithm

Matrix, called the Q-matrix 48 x 4 matrix (12x4 gridworld) x 4 (four directions) The Q-matrix contains a ”price” for taking a certain

action Initialized randomly in the beginning The walker has two options:

• Take the optimal action, according to smallest Q-value

• Explore the gridworld by taking a random step (cannot walk into the wall)

Q-value is updated according to the equation every time the walker takes an action

The new value in the Q-matrix for the previous state and taking the previously taken action will be updated based on: what it was before multiplied by (1-α), plus a factor (alfa) multiplied by the sum of the cost to take a step (usually 1, cliff 100) and another factor (gamma) multiplied by the best action the walker can take (optimal action)

New value Previous step Best actionSum of the cost

Alfa = learning factor Gamma = reward factor

SARSA-algorithm

Another way of updating the Q-matrix Not based on the next optimal move, but on the next

actual move Means that it will take into account the risk of falling into

the cliff, and will eventually arrive at a safer path Longer, but safer path

The program...

is based on 3 parameters• learning factor, the higher the faster the walker learns

• reward factor, the higher the more reward is give for good actions

• exploration factor, a high value leads to more randomness

• In the following example these values were used:

• α= [0.1], γ=[1], ε=[0.05]

Results

2 4 6 8 10 120.5

1

1.5

2

2.5

3

3.5

4

4.5

Fig 1) Q-learning, the 100-th walk

2 4 6 8 10 120.5

1

1.5

2

2.5

3

3.5

4

4.5

Fig 2) Q-learning, optimal solution

2 4 6 8 10 120.5

1

1.5

2

2.5

3

3.5

4

4.5

Fig 3) SARSA, the 100-th walk2 4 6 8 10 12

0.5

1

1.5

2

2.5

3

3.5

4

4.5

Fig 4) SARSA, optimal solution

Results

0 50 100 150 200 250 300 350 4000

200

400

600

800

1000

1200

1400

1600

1800

2000

Simulations

Ste

ps a

nd p

enal

ty

Number of steps

Penalty

Random steps over the cliff

Genetic Algorithms

GA can be applied to the Cliffwalker problem by:

• replacing the Reinforcement learning algorithm by GA’s to find the best path in the gridworld, or

• finding the best learning parameters for the Reinforcement learning algorithm

The conclusion is that GA’s will probably not improve the results remarkably to Reinforcement learning algortihms. Since it will very soon find out which are the best parameters..

Sources

Reinforcement Learning (pdf), Jonas Waller [2005] Cliffwalker program, Jonas Waller [2005] Reinforcement Learning, An Introduction. Sutton and

Barto

staffan järn. intelligent learning algortithm doesn’t require the presence of a teacher the...

Documents