implementing reinforcement learning for a hockey …ai/projects/old/hockey.pdf · 2018-11-20 ·...
TRANSCRIPT
HEBREW UNIVERSITY OF JERUSALEM
INTRODUCTION TO ARTIFICIAL INTELEGENCE
IMPLEMENTING
REINFORCEMENT LEARNING
FOR A HOCKEY TEAM
WRITERS:
YONI NIR
MARCH 2010
INDEX
Hockey Game ................................................................................................. 4
Introduction and Features ........................................................................... 4
Our Place in the Game ................................................................................ 6
Given Implementation of AI ............................................................................ 7
Our Implementation of the team AI ................................................................. 9
Reinforcement Learning .............................................................................. 9
Reinforcement Learning - Approaches ...................................................... 10
Individual Learning – Un-Coached ......................................................... 10
Group Learning – Coached ................................................................... 10
State Abstraction ....................................................................................... 11
Challenges ............................................................................................ 11
Response Times ................................................................................ 11
Number of States ............................................................................... 11
State Size .......................................................................................... 12
Solution ................................................................................................. 12
Policy ........................................................................................................ 13
Q-Function Representation .................................................................... 13
Forming the Best Policy ......................................................................... 13
Challenges ......................................................................................... 13
Update Occasions .......................................................................... 13
Reward Calculation ........................................................................ 14
Implementation ...................................................................................... 15
Results ......................................................................................................... 16
Learning Process ...................................................................................... 16
Choice of Action Improvement .................................................................. 16
Q-Value Inspection ................................................................................ 16
Development of Q-Values: ................................................................. 17
Did We Win? ............................................................................................. 19
Un-Coached Team ................................................................................ 19
Coached Team ...................................................................................... 20
Approaches Comparison ....................................................................... 20
Conclusions .................................................................................................. 21
HOCKEY GAME
INTRODUCTION AND FEATURES
AIsHockey is a simplified hockey computer game, written in Java by Adjunct
Professor Jouni Smed from the University of Turku1 in Finland.
A hockey game has two teams, each with six players: goalie, right defense,
left defense, center, right wing and left wing. The target of the game is scoring
as much as possible, and guarding the home team's goal.
Important events in the game include: goal, offside, icing (a certain hockey
violation), interfering with the goalie, keeping the puck, and more. When an
event occurs, the game stops – then the match is restarted from one of the
faceoff locations on the rink.
The rules of hockey are not followed exactly in the game engine; for instance,
player penalties do not occur. However, most rules are observed.
The implementation consists of a server and two clients, each an autonomous
process. The server multicasts the game status to the clients, who send
desired actions to the server according to the game status.
1 http://www.utu.fi/
The representation of the state of the game received by the client describes
the following variables: players' positions and their heading angles, the puck's
position and heading angle, score, time, goal announcement and rule-
violation type.
The actions available to players include shooting/passing the puck, dashing
(advancing) somewhere, braking, or keeping the puck (goalie). Certain
actions need an angle and power value.
The game can be run in two configurations:
1. Real time (synchronized): the clients' actions are performed by the
server in synchronization with a real time clock. This mode may be run
with a GUI, displaying the full game with animated players, or in
console mode only, displaying the score on the event of a goal.
2. Unsynchronized: the actions are performed instantly, upon their
reception by the server, allowing the game to run as fast as possible.
The GUI is not supported in this mode.
OUR PLACE IN THE GAME
Our objective was writing smart clients that perform the best action in each
state. As stated earlier, the client consists of six hockey players. Each player
consists of two threads – one thread communicates with the server, while the
other extracts information regarding the game state and chooses actions
based on this state. Everything is done concurrently in multiple threads.
The main mission was implementing the react() method. A quick description
of its flow: abstracting the game state → investigating the state → choosing
the best action based on the state and any additional knowledge →
performing the best action.
A simple implementation may look like:
public class MyAI extends AI implements Constants {
public void react() {
if (isPuckWithinReach()) {
head(headingTo(0.0, THEIR_GOAL_LINE));
brake(1.0);
shoot(0.4);
} else {
dash(1.0);
head(headingTo(puck()));
}
}
}
Each player in the game may run a different class, allowing different AI
strategies for the goalie and the left wing player, for example.
GIVEN IMPLEMENTATION OF AI
Jouni Smed implemented a simple but killer algorithm for playing the game.
Three classes were provided: GoalieAI.java, DefenseAI.java and
ForwardAI.java.
Below are simplified versions of the players' react() method.
The goalie always stands in a strategic location, passes the puck to another
player, or keeps the puck:
GoalieAI.java
stand on goal area between puck and goal
if (isPuckWithinReach()) {
if (ourMinY < theirMinY) {
head(headingTo(ourClosestPlayer));
shoot(1.0);
}
else
keepPuck();
}
The defensive player passes to his forward player, guards an opponent
player, dashes toward the puck or stands in a strategic location:
DefenseAI.java
if (isPuckWithinReach()) {
head(headingTo(player(OUR_LEFT_WING)));
brake(1.0);
shoot(0.7);
}
else {
if (player(OPPONENT_RIGHT_WING)[Y] < OUR_ENDZONE_FACEOFF)
head(headingTo(player(OPPONENT_RIGHT_WING)));
else {
if (puck[X] < 0.0)
head(headingTo(puck));
else
head(headingTo(FACEOFF_LEFT, puck[Y_COORDINATE]));
}
dash(0.7);
}
A forward player keeps away from an offside, shoots when he's not in offside,
passes back to his defensive player, dashes toward the puck, or head to his
strategic location when needed:
ForwardAI.java
if (isOffside(true) & me[Y_COORDINATE] > THEIR_BLUE_LINE) {
head(Math.PI);
dash(1.0);
}
else {
if (isPuckWithinReach()) {
if (!isOffside() & me[Y_COORDINATE] > CENTER_LINE) {
head(headingTo(0.0, THEIR_GOAL_LINE));
shoot(1.0);
}
else {
head(headingTo(player(OUR_LEFT_DEFENSE)));
shoot(0.5);
}
dash(0.2);
}
else {
if (puck[X_COORDINATE] >= FACEOFF_LEFT + 2.0 &
puck[X_COORDINATE] <= FACEOFF_RIGHT - 2.0)
head(headingTo(puck));
else
head(headingTo(0.0, puck[Y_COORDINATE] - 1.0));
dash(1.0);
}
}
These simple algorithms are deterministic, implementing simple reflex agents.
They define in advance which action to perform in each state of the game,
without utilizing knowledge of previous states and actions, or supervising the
performance of the agent. Despite the simplicity of these algorithms, they
were powerful and very difficult to defeat.
OUR IMPLEMENTATION OF THE TEAM AI
REINFORCEMENT LEARNING
We chose to compete with Smed's algorithm with Reinforcement Learning.
The game provides the option of running multiple episodes, learning from
each action and performing accordingly.
The new react() method is simplified as such: abstract and investigate state
→ choose best action based on learned knowledge and current state, or
randomly select an action → extract reward from state and update learning
knowledge-base.
RL attempts to find a policy that will suggest a best action for each state in the
world. This is done by inspecting states and the value given to each action in
this state (q value), and choosing the action with the maximal value.
Additional actions are explored on random occasions.
In some problems, the world is approached as a Markov decision process
(MDP), when each state and action shifts the world to a new state with a
certain probability. After comprehensive exploration, the state transition
function and the rewards should be more-or-less clear, and the agent will
know the best action at each world-state.
The computer-hockey world is very stochastic. Even if the algorithm is
completely deterministic, the result of passing or shooting the puck is slightly
randomized. In addition, the algorithms of the other team are unknown, so
predicting their future actions is only possible after extensive learning. Finally,
the rewards of every state are not always relevant to every player's action,
further complicating the learning process.
We chose to implement SARSA, an on-policy RL algorithm – it acts on its own
policy. All action and rewards are taken from the same policy and it is
updated with new rewards during each action.
The basic flow of the algorithm is as follows:
take chosen action
observe reward r and new state s'
take the best known action or a random action at probability ϵ, a'
Q(s,a) = Q(s,a) + α*(r + γ*Q(s',a') – Q(s,a))
REINFORCEMENT LEARNING - APPROACHES
We took two approaches in competing with Smed's algorithm:
INDIVIDUAL LEARNING – UN-COACHED
Each team-mate learns and reacts autonomously. The player has a set of
nine possible actions. In each state, the player performs his best known action
or explores a different action randomly, inspects its outcomes and updates the
policy with new values for the actions.
The possible actions are:
Player has the puck:
o Pass to a team player in the ith square which has a player of our
team in it
o Shoot
Player doesn't have the puck:
o Retreat to original location
o Dash to the puck
o Guard the closest opponent
AIUncoachedLearner.java
enum Actions {
SHOOT, PASS1, PASS2, PASS3, PASS4, PASS5, RETREAT, TOPUCK, GUARD
}
GROUP LEARNING – COACHED
A team coach is responsible for all actions of the team. In every move, each
player requests a suggested action from the coach. The coach can now
coordinate between all players; for example, the movement of one player can
be stopped while a second player is passing the puck to the first. The coach
has about 30 different strategies to choose from. In each state, he suggests
actions according to the chosen strategy, inspects the outcome and updates
the values of the strategies.
The possible strategies are:
Team has the puck:
o Pass to a certain player and brake the player who should
receive the puck
o Shoot
Team doesn't have the puck
o 1 player dashes toward the puck
o 2 players dash toward the puck
o In both cases, all others:
If they are defensive: retreat or guard
If they are offensive: retreat or advance toward the
opponent's goal line
Each player is suggested one of the following actions (according to the
chosen strategy):
AICoach.java
enum Actions {
SHOOT, PASS_GOALIE, PASS_LD, PASS_RD, PASS_LW, PASS_RW,
PASS_CENTER, BRAKE, RETREAT, TOPUCK, GUARD, FORWARD
}
STATE ABSTRACTION
CHALLENGES
RESPONSE TIMES
Since the game is played in real-time, the calculations cannot be too slow. In
every second, each player may be required to respond hundreds of times,
demanding short calculation times.
To achieve this, we needed to store all the policy data in memory and never
allow its transfer to disk.
Insight: Our reinforcement learning implementations couldn't handle running
the unsynchronized game mode (as-fast-as-possible) because of this issue.
NUMBER OF STATES
Considering the complex elements in each state we could reach an endless
number of states. A state may consist of:
12 players positions, heading angle and current speed
Puck position, heading angle and current speed
For each position coordinates: 0≤x≤30, 0≤y≤60
Angle: 0 – 360
Speed: 0 – 1.0
The values above describe an enormously large state space. A quick
calculation shows that: |S| = ((12+1)30*60*360*100. Even this calculation is
modest, as the values for coordinates, angles and speed are double values.
There is no way to save q-values for each state in RAM.
STATE SIZE
If we were to uniquely describe every state in the S above, we needed log2|S|
bits for each state. Adding the memory needed for the q values for each state
we couldn't compete with Smed.
SOLUTION
We had to find a compact way to identify states and heavily reduce the
number of possible states. Our goal was to insert all possible states and q-
values in RAM of a commodity PC.
Firstly, we abstracted each state by dividing the rink into squares: 3*6
possible squares, leaving us with |S| = ((12+1)3*6*360*100
. Secondly, we
neglected the heading angle and current speed for all players, and described
the heading angle of the puck as heading toward our goal line or heading
toward theirs.
Now: |S| = ((12+1)3*6*2 ≈ 1040 ≫ 100GB. Still, too large.
Thirdly, we took every square in the court and marked it with 2 bits – the first
for existence of our player in this square, and the second for the opponent's.
The other 4 bits indicate the puck square number, and another bit indicates its
general direction.
0 1 … 23 24-26 27 28-31
Existence of
our player in
square 1
Existence of
opponent player in
square 1
… … empty Puck
direction
Puck
square
number
This appears to be much better, now |S| = 232-3 = 0.5GB. Still, when we added
the q-values and the other game assets, it could be hard to compete with
Smed on a PC without a risk of memory paging.
Our fourth idea took advantage of the fact that the rink is symmetric vertically.
Every state defined by the puck on the right side of the rink has an exactly
symmetric state with the puck on the left, with the players positions mirrored
exactly. The actions taken in these states are the same as actions chosen in
the symmetric state.
Finally, working with a worst case state space of 0.25GB seemed reasonable.
POLICY
Q-FUNCTION REPRESENTATION
Considering the data above, we couldn't waste more memory and had to cut
bytes, paying with data accuracy. In the individual learning, for example, we
allowed only nine possible actions per player, each given a 1-byte q-value [-
127 – 127]. In the coached learning, we allowed 31 different strategies for
each state.
We used a HashMap to store pointers to all states and their q-values to allow
quick data retrieving and setting and to use memory only when needed.
Data saved for each visited state: 4bytes + 6players*9bytes or 62bytes ≈ 62B.
Insight: Interestingly, after running the game for several hours, the HashMap
size reached only 6MB of data, covering ~65,000 states (0.02% of all possible
states).
FORMING THE BEST POLICY
CHALLENGES
UPDATE OCCASIONS
As stated beforehand, the game has hundreds of moves every second.
Although the game state may change rapidly, it is rare for it to change
following every action executed by the player. In addition, rewards, like goals,
are given only after a relatively significant time period (~1/2 second). For this
reward to propagate to a relevant action, it would need to update hundreds of
state-action pairs backwards. This will not be useful, and significantly delay
the learning process. Additionally, updating the policy every XMS is not the
answer, as important state-action pairs may be lost.
Thereby, we chose to update the policy only at key occasions:
Individual Learning:
o The player received/lost the puck
o The game has stopped (goal / rule-violation / etc.)
o The player finished playing the same defensive action (any
action not involving the puck) for roundsToDefend rounds (set to
5000 for goalie and 10,000 for field players).
Coached Learning:
o The team received/lost the puck
o The game has stopped
o The team finished playing the same defensive strategy for
roundsToDefend rounds (set to 10,000).
REWARD CALCULATION
Since a goal in a hockey match is relatively rare, it will not be useful to reward
actions only when a goal is scored. Doing so could make rewarded actions
very sparse, making the choice of most actions in states completely random.
Thereby we needed to grant rewards in many more occasions, as well as
penalties, to create greater differentiation between actions.
We evaluated what makes a state good or bad. We came out with this
function:
AILearner.java
public int stateToScore() {
int score = 0;
if (isGoal(true))
return 40;
if (isGoal(false))
return -40;
if (isIcing(true))
return -15;
if (isInterferingGoalie(true)) //we interfere their goalie
return -5;
if (isGoalieKeepingPuck(true))
//our goalie keeps the puck - violation
return -10;
if (isFoul(true)) //we made a foul
return -5;
if (teamHasPuck()!=-1) //none of our team-mates has the puck
score+=13;
else
score-=2;
if (isOffside(true)) //we're on offside
score-=10;
if (isOffside(false)) //they're on offside
score+=2;
double puck[] = puck();
double dist = distanceToGoal(puck);
score += 16 - dist/2; //is the puck close to our or their goal?
score += (puck[2]>0) ? 5 : -7; //puck towards their goal-line?
return score;
}
In fact, for each state, this is the reward given.
Obviously, changing the reward-function will make the entire learning process
and outcome dramatically different. If, for example, a team player in offside
receives a high penalty like a goal from the opponent team, the team will
focus on avoiding offsides, perhaps at the cost of scoring goals.
IMPLEMENTATION
We used SARSA and e-greedy methods to inspect and update the policy
throughout the learning process.
A sketch of the policy update procedure of the individual learner at the react()
function.
AIUncoachedLearner.java
protected void react() {
inputToState(); //abstract state
if (!isGameStopped()) {
if (!puckWithinReach)
keep defensive actions for roundsToDefend rounds
else //now offensive
discard defensive memory
if (need a new action) {
int minAction = puckWithinReach ? 0: MaxActionWithPuck+1;
int maxAction = ...;
if (random()<epsilon)
action = minAction+random()*(maxAction-minAction);
else //get max action
action = policy.getMaxAction(state, thisPlayer,
minAction, maxAction);
//remember new state and action
saveState(state, action);
//update policy with new and old states values
updateScores(state, action);
}
perform(action);
}
else {
updateScores(state);
}
}
protected void updateScores() {
if (isGameStopped()) //new episode
lastQ = 0;
else
lastQ = policy.getQ(currentState, thisPlayer, currentAction);
updateQ = policy.getQ(oldState, thisPlayer, oldAction);
tempQ = (updateQ +
alpha*(currentStateScore + gamma*lastQ - updateQ));
policy.setQ(oldState, thisPlayer, oldAction, updateQ);
}
RESULTS
LEARNING PROCESS
Because of the complexity of our algorithm, the learning process had to be
run in the synchronized mode. This demanded a very long learning period.
We extracted a final policy from each learning approach (individual and
coached) after about 10 hours of run-time.
CHOICE OF ACTION IMPROVEMENT
This result was hard to examine, as it required watching the GUI during play
and estimating the changes in behavior. Still, the team clearly changed
behavior during the learning process.
In the individual learning, players initially choose retreating to their faceoff
position, passing to the goalie, or shooting towards the goal in impossible
states. In the final policy, the behavior is far less random. Certain actions are
clearly preferred in certain states, which also increase the possession of the
puck and the number of goals for the team.
In the coached learning, initially the team chooses inappropriate strategies.
The players might pass to the goalie, retreat to their faceoff positions in wrong
moments, or not allow enough players to dash towards the puck. Again, the
final policy exhibits a more rational team.
Q-VALUE INSPECTION
The development of a certain state and its q-values during the learning
process is shown below. The example given is from the coached learning
approach.
The game state encoding: 00010000 01010010 10101000 00010111
0 1 … 23 24-26 27 28-31
Existence of
our player in
square 1
Existence of
opponent player in
square 1
… … empty Puck
direction
Puck
square
number
DEVELOPMENT OF Q-VALUES:
After 5 goals:
Q(s) = 0, 0, 0, 0, 0, 0, 0, 0, -4, 0, 5, 0, 0, -4, 0, -4, -4, 0, 0, 0, 0, 0, -4, 0, 0, 0, 0, 0, 0, 0, 0
You can see that after 5 goals this state was already visited several times. Six
strategies were examined and updated with values, but the rewards are too
similar.
0
11
The best strategy so far is #10: 1 player goes to the puck, 1 player retreats to
his original location, and the others go forward or guard, according to their
positions.
We can assume that the puck was never yet in the team's control because
offensive strategies were not taken.
After 60 goals:
Q(S) = 90, 13, 90, 90, 0, 0, 0, 0, -4, 0, -3, -6, 0, -4, 0, -4, -4, 0, 0, -6, 0, 6, -4, -6, 0, 0, 0, 0, 0, -6, 0
The state was visited many times, many possible strategies were tested and
the highest q-values were given to the following strategies:
Defensive (team doesn’t have the puck) – strategy #21: two players dash
towards the puck, two retreat to original locations and one goes forward or
guards his closest opponent.
Offensive strategies: #0, #2 and #3: shoot the puck, or pass.
After 150 goals:
Q(S) = 90, 90, 0, 90, 0, 0, 0, -4, -4, -6, -3, -6, 27, -4, -3, -4, -4, -6, -4, -6, -6, -6, -2, -6, -4, -4, -6, -4, -4, -6, -11
The q-function developed further, covering almost all available strategies for
this state. The best strategies are now:
Defensive: #12: one player goes to the puck, 2 retreat and the others go
forward or guard.
Offensive: #0, #1 and #3: shoot the puck or pass.
DID WE WIN?
In both approaches, playing against a different team which used Smed's
simple algorithms, our teams won. The policies created teams which scored
more goals than Smed's teams. Furthermore, the difference between the
number of goals scored by the learning teams and the goals scored by
Smed's teams became larger as the learning process continued.
UN-COACHED TEAM
The final score was 360:240, after 12 hours of learning, starting from scratch.
A graph of goals correlated with the time elapsed is shown below:
0
50
100
150
200
250
Smed's Team
AI Coached Team
At first we scored 0.6 goals for each goal they scored, but eventually we
scored 1.4 times for each goal of Smed's team.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
COACHED TEAM
Eventually we won 254:151, after 8 hours of learning, starting from scratch.
0
50
100
150
200
250
300
Smed's Team
AI Coached Team
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
APPROACHES COMPARISON
We can notice from the results above that the coached approach was better,
even thought the un-coached team trained for longer. This is obviously due to
the fact that the players are synchronized by a coach, stop when the puck is
passed towards them and don't pass to an offside player. We believe that the
un-coached team could eventually learn those issues, but during our training
period, they weren't as smart as the coached team.
Watching the game after the trainings, we can notice that in both approaches
the best actions are to go to the puck, shot it as much as possible, or pass the
it to a forward player. Both teams learned to play reasonably and won the
games.
CONCLUSIONS
The challenges involved in handling this highly complex world and adapting it
to reinforcement learning were greater than expected. During our work, more
and more factors that influence the learning were discovered. The abstraction
of the states, definition of appropriate actions for players and choice of
accuracy for q-values all greatly impact the results of the learning. Each
factor was necessary for implementing a working algorithm, so we had to
make uninformed choices according to our intuition.
Thus, the exact influence of each of these factors remained unexplored – it
would demand days of learning with very minute tweaking of these details.
Perhaps, if we had defined a finer abstraction, added more possible actions,
or allowed greater accuracy in possible q-values, we could have possibly
created a better learning process. The players may have had a greater pool
of choices to make in smaller differences of time, each action clearly
differentiated by the others. However, a policy that could allow this would
have to be bulky and would not provide good response times. In addition,
maybe a larger number of state-action pairs would have created an
unmanageable space to explore, which would require an irrational learning
period in order to create a working policy. Clearly, this implementation opens
infinite possibilities to explore all the elements of reinforcement learning. This
project only shows one possibility of implementation.
Still, after viewing the results of our implementation, it clearly yielded better
teams than simple reflex agents. Both approaches (the coached and
individually acting players) showed good performance, outperforming the
simple algorithms provided by Smed. This suggests that reinforcement
learning is very powerful, and further tweaks may improve the performance
even more.