implementation of dot & box game agent using reinforcement
Post on 13-Jan-2022
7 Views
Preview:
TRANSCRIPT
UNIVERSITY OF TENNESSEE
Implementation of Dot & Box Game Agent using Reinforcement Learning
Technique
Submitted By: Md.
Badruddoja Majumder (Student ID# 000427702) Gangotree
Chakma (StudentID# 000428264)
1. Introduction:
Machine learning is a great tool to solve any maze or analytical problem with efficiency. It is useful when
usual algorithms are too complex to be developed and it is also useful in such cases where the
environment isn’t stagnant. Adaptability gives it a very important characteristics to solve upcoming
challenges. In our project we have utilized two separate algorithms of machine learning to create a Dot
and Box game.
2. Background of the project:
The project is based on Dots & boxes, which is a simple game usually played with paper and pencil. It was
first published by Édouard Lucas in 1889. Typically the game board consists of nine squares, each with
four dots on each side as shown in figure 1a. However game boards of arbitrary size can be formed. The
game is mostly played by two players but there is no upper limit on the amount of players that can
participate. Between two adjacent dots a player can draw either a vertical or horizontal line if there isn’t
a line between them already. Each time four lines form a small box, a point is rewarded to the player who
draw the last line in that box. When a player has finished a box, he must draw another line. The player
with the most points, when all boxes in the grid have been completed, wins. If the game ends in a tie, the
player who drew the first line loses.
We have implemented the Dot and Box game using two of the reinforcement learning algorithms. One of
them is Monte Carlo Method and the other one is Q learning method.
2.1. Monte Carlo Method:
2.2. Q Learning Method: Q-learning is a reinforcement learning algorithm. It was developed by Chris Watkins
developed in 1989. The main advantage of this algorithm is that it is simple and thus easy to
implement; the actual algorithm consists of only one line of code. According to Q learning
method almost all problems can be organized into several situations called states. For
instance in our each different set of lines make up a different state and there are certain legal
moves which are called actions here. The simplest form of Q-learning stores a value for each
state-action pair in a matrix or table. The algorithm first checks the Q-value of every state it
has the possibility to get to in one step. It then takes the maximum of these future values and
incorporates them into the current Q-value. When feedback is given, the only value that is
updated is the Q-value corresponding to the state-action pair that gave the feedback in
question. The algorithm for updating Q-values is shown below,
𝑄(𝑠𝑡, 𝑎𝑡) ← 𝑄(𝑠𝑡, 𝑎𝑡) + 𝛼 × [𝑟𝑡+1 + 𝛾max𝑎
𝑄 (𝑠𝑡+1, 𝑎𝑡) − 𝑄(𝑠𝑡, 𝑎𝑡)] [1]
Here, st and at corresponds to the state and action at a given time. The fact that only one
value is updated when feedback is given, gives Q-learning an interesting property. It takes a
while for the feedback to propagate backwards through the matrix. The next time a certain
state-action pair is evaluated, the value will be updated with regards to the states it can lead
to directly. The algorithm then finds one best solution or stable solution to which it converges
through each iteration.
3. Project Design:
3.1. Design objective:
Dots & boxes is a traditional board game which is normally played by two people at a time.
Often one may not have anybody to play with, thus requiring a computer player acting
opponent. To make the game enjoyable, the computer player should be appropriately skilled.
Using two reinforcement learning algorithm, called Monte Carlo Method and Q-learning, to
create a computer player, the aim is to analyze the performance and efficiency of this player
when faced against different opponents. The project is designed to give an overview of how
these opponents affect and improve the rate of progress and end result of the reinforcement
learning agent.
3.2. Design Challenges:
One of the most important design challenges was the state space size or the dimension of the
Dot and Box matrix. The states are defined as game configurations, where each line is unique
and can be either drawn or not drawn. In a normal size dots & boxes game, the grid consists
of 16 dots. This means a grid can contain up to 24 lines. The number of states is thus 224 (about
16 million). An action is defined as drawing a line. Thus there are 24 different actions that may
be chosen. Not all lines may be drawn in each state however, as some lines already have been
drawn. The mean number of lines possible to draw in a state is 12. Hence the number of state
action pairs are 224 ∗ 12 (about 200 million). But it is difficult to work with a huge amount of
data. So, we choose the size of the grid to be 9 dots and 4 boxes which reduces the state
action pairs to 212*6 (25 thousand).
Another challenge faced was the exploration phase of the agent. The design lets the agent to
explore all possible states to determine the highest value of the state and thus it provides a
better solution
3.3. Technical Approach:
To solve the problem with reinforcement learning algorithm, we have initialized the states,
agents, actions and also the symmetry by following method.
3.3.1. State: Whole board configuration represents the game’s state. All the lines possible to draw
on the board by connecting any two dots are numbered as 1, 2, 3,……. , 12 (for 3 by 3 board).
Every drawn line is represented as 1 in the corresponding index of a row matrix.
Figure: States
This row matrix is essentially the state for the game. If no lines are drawn yet i.e at the
beginning of the game state, S= [0 0 0 0 0 0 0 0 0 0 0 0]. If all of the lines are already drawn i.e
at the end of the game state, S= [1 1 1 1 1 1 1 1 1 1 1 1]. Total no. of states possible is 212 .
3.3.2. Actions: In a particular game state, the no. of available lines are the action spaces for that
state. If a game is in a state, S=[1 0 1 1 0 0 1 1 1 1 0 0], then the available actions are selecting
line 2,5,6,11 or 12 to draw. At the beginning of the game there are 12 actions possible and at
the end of the game there are no more actions. Average no. of state action pair for this 3 by
3 configuration of the game is 212𝑥6 =24576.
3.3.3. Agents: The game is played by two agents. These agents can be of three types.
i. Random agent
ii. Reinforcement Learning agent
iii. Graphical agent
3.3.4. Symmetry: There are 4 axis of symmetry in the board. Every state has four additional
symmetrical states. If we number all the lines of the board in the above four different ways,
then all of the resulting states will indicate the same states. This is the idea we can leverage
to make the agent learn quickly about the states of the board. Whenever an agent visits a
state it updates all of the symmetrical states as well.
Figure: Symmetry
3.4. Algorithm: The flowchart of working principle in Monte Carlo Method is given below
Figure: Flowchart of Monte Carlo Method
Agent play move
Move=Move+1
Reward>0?
Yes
s
Save Reward
Yes
Yes Move=No. of lines
possible to draw
on the board?
ALl
No End of the episode
Update all states-action value
visited by agent using Monte
Carlo Method
No
Opponent play move
Move=Move+1
Agent played
previous
move?
No
The flowchart of working principle in Q learning method is given below
Figure: Flowchart of Q learning Method
Agent play move
Move=Move+1
Reward>0?
Yes
s
Save Reward
Yes
End of the episode
Update all states-action value
visited by agent using Q learning
method
No
Agent played
previous
move?
No
Opponent play move
Move=Move+1
STEP-1 STEP-2
STEP-3 STEP-4
STEP-5
3.5. A typical Game flow:
4. Experimental Results: Training against self-play: After 20000 training games using self play, agent visited only around
3500 states where it was close to 15000 when played against random opponent and not
leveraging symmetry.
As visited states are very low compared to the overall state-action spaces, the agent does not
know about those states at all. In this case agent’s performance against a strong player is
expected to be low. To verify this, we arranged 10 games where agent played against human
opponent. The result is as follows:
Won: 1
Lose: 7
Draw: 2
Figure: Winning Percentage of agent in every 10 games against human opponent. These games are
played after every training games with random opponent.
Figure 2: State Visited by the agent after every 1000 games.
Figure 3: No. of visited states after every 1000 games played against a copy of itself
top related