implementing reinforcement learning for a hockey …ai/projects/old/hockey.pdf · 2018-11-20 ·...

HEBREW UNIVERSITY OF JERUSALEM

INTRODUCTION TO ARTIFICIAL INTELEGENCE

IMPLEMENTING

REINFORCEMENT LEARNING

FOR A HOCKEY TEAM

WRITERS:

YONI NIR

MARCH 2010

INDEX

Hockey Game ................................................................................................. 4

Introduction and Features ........................................................................... 4

Our Place in the Game ................................................................................ 6

Given Implementation of AI ............................................................................ 7

Our Implementation of the team AI ................................................................. 9

Reinforcement Learning .............................................................................. 9

Reinforcement Learning - Approaches ...................................................... 10

Individual Learning – Un-Coached ......................................................... 10

Group Learning – Coached ................................................................... 10

State Abstraction ....................................................................................... 11

Challenges ............................................................................................ 11

Response Times ................................................................................ 11

Number of States ............................................................................... 11

State Size .......................................................................................... 12

Solution ................................................................................................. 12

Policy ........................................................................................................ 13

Q-Function Representation .................................................................... 13

Forming the Best Policy ......................................................................... 13

Challenges ......................................................................................... 13

Update Occasions .......................................................................... 13

Reward Calculation ........................................................................ 14

Implementation ...................................................................................... 15

Results ......................................................................................................... 16

Learning Process ...................................................................................... 16

Choice of Action Improvement .................................................................. 16

Q-Value Inspection ................................................................................ 16

Development of Q-Values: ................................................................. 17

Did We Win? ............................................................................................. 19

Un-Coached Team ................................................................................ 19

Coached Team ...................................................................................... 20

Approaches Comparison ....................................................................... 20

Conclusions .................................................................................................. 21

HOCKEY GAME

INTRODUCTION AND FEATURES

AIsHockey is a simplified hockey computer game, written in Java by Adjunct

Professor Jouni Smed from the University of Turku1 in Finland.

A hockey game has two teams, each with six players: goalie, right defense,

left defense, center, right wing and left wing. The target of the game is scoring

as much as possible, and guarding the home team's goal.

Important events in the game include: goal, offside, icing (a certain hockey

violation), interfering with the goalie, keeping the puck, and more. When an

event occurs, the game stops – then the match is restarted from one of the

faceoff locations on the rink.

The rules of hockey are not followed exactly in the game engine; for instance,

player penalties do not occur. However, most rules are observed.

The implementation consists of a server and two clients, each an autonomous

process. The server multicasts the game status to the clients, who send

desired actions to the server according to the game status.

1 http://www.utu.fi/

http://www.utu.fi/

http://www.utu.fi/

The representation of the state of the game received by the client describes

the following variables: players' positions and their heading angles, the puck's

position and heading angle, score, time, goal announcement and rule-

violation type.

The actions available to players include shooting/passing the puck, dashing

(advancing) somewhere, braking, or keeping the puck (goalie). Certain

actions need an angle and power value.

The game can be run in two configurations:

1. Real time (synchronized): the clients' actions are performed by the

server in synchronization with a real time clock. This mode may be run

with a GUI, displaying the full game with animated players, or in

console mode only, displaying the score on the event of a goal.

2. Unsynchronized: the actions are performed instantly, upon their

reception by the server, allowing the game to run as fast as possible.

The GUI is not supported in this mode.

OUR PLACE IN THE GAME

Our objective was writing smart clients that perform the best action in each

state. As stated earlier, the client consists of six hockey players. Each player

consists of two threads – one thread communicates with the server, while the

other extracts information regarding the game state and chooses actions

based on this state. Everything is done concurrently in multiple threads.

The main mission was implementing the react() method. A quick description

of its flow: abstracting the game state → investigating the state → choosing

the best action based on the state and any additional knowledge →

performing the best action.

A simple implementation may look like:

public class MyAI extends AI implements Constants {

public void react() {

if (isPuckWithinReach()) {

head(headingTo(0.0, THEIR_GOAL_LINE));

brake(1.0);

shoot(0.4);

} else {

dash(1.0);

head(headingTo(puck()));

}

}

}

Each player in the game may run a different class, allowing different AI

strategies for the goalie and the left wing player, for example.

GIVEN IMPLEMENTATION OF AI

Jouni Smed implemented a simple but killer algorithm for playing the game.

Three classes were provided: GoalieAI.java, DefenseAI.java and

ForwardAI.java.

Below are simplified versions of the players' react() method.

The goalie always stands in a strategic location, passes the puck to another

player, or keeps the puck:

GoalieAI.java

stand on goal area between puck and goal


if (ourMinY < theirMinY) {

head(headingTo(ourClosestPlayer));

shoot(1.0);

}

else

keepPuck();

}

The defensive player passes to his forward player, guards an opponent

player, dashes toward the puck or stands in a strategic location:

DefenseAI.java


head(headingTo(player(OUR_LEFT_WING)));

brake(1.0);

shoot(0.7);

}

else {

if (player(OPPONENT_RIGHT_WING)[Y] < OUR_ENDZONE_FACEOFF)

head(headingTo(player(OPPONENT_RIGHT_WING)));

else {

if (puck[X] < 0.0)

head(headingTo(puck));

else

head(headingTo(FACEOFF_LEFT, puck[Y_COORDINATE]));

}

dash(0.7);

}

A forward player keeps away from an offside, shoots when he's not in offside,

passes back to his defensive player, dashes toward the puck, or head to his

strategic location when needed:

ForwardAI.java

if (isOffside(true) & me[Y_COORDINATE] > THEIR_BLUE_LINE) {

head(Math.PI);

dash(1.0);

}

else {


if (!isOffside() & me[Y_COORDINATE] > CENTER_LINE) {

head(headingTo(0.0, THEIR_GOAL_LINE));

shoot(1.0);

}

else {

head(headingTo(player(OUR_LEFT_DEFENSE)));

shoot(0.5);

}

dash(0.2);

}

else {

if (puck[X_COORDINATE] >= FACEOFF_LEFT + 2.0 &

puck[X_COORDINATE] <= FACEOFF_RIGHT - 2.0)

head(headingTo(puck));

else

head(headingTo(0.0, puck[Y_COORDINATE] - 1.0));

dash(1.0);

}

}

These simple algorithms are deterministic, implementing simple reflex agents.

They define in advance which action to perform in each state of the game,

without utilizing knowledge of previous states and actions, or supervising the

performance of the agent. Despite the simplicity of these algorithms, they

were powerful and very difficult to defeat.

OUR IMPLEMENTATION OF THE TEAM AI

REINFORCEMENT LEARNING

We chose to compete with Smed's algorithm with Reinforcement Learning.

The game provides the option of running multiple episodes, learning from

each action and performing accordingly.

The new react() method is simplified as such: abstract and investigate state

→ choose best action based on learned knowledge and current state, or

randomly select an action → extract reward from state and update learning

knowledge-base.

RL attempts to find a policy that will suggest a best action for each state in the

world. This is done by inspecting states and the value given to each action in

this state (q value), and choosing the action with the maximal value.

Additional actions are explored on random occasions.

In some problems, the world is approached as a Markov decision process

(MDP), when each state and action shifts the world to a new state with a

certain probability. After comprehensive exploration, the state transition

function and the rewards should be more-or-less clear, and the agent will

know the best action at each world-state.

The computer-hockey world is very stochastic. Even if the algorithm is

completely deterministic, the result of passing or shooting the puck is slightly

randomized. In addition, the algorithms of the other team are unknown, so

predicting their future actions is only possible after extensive learning. Finally,

the rewards of every state are not always relevant to every player's action,

further complicating the learning process.

We chose to implement SARSA, an on-policy RL algorithm – it acts on its own

policy. All action and rewards are taken from the same policy and it is

updated with new rewards during each action.

The basic flow of the algorithm is as follows:

take chosen action

observe reward r and new state s'

take the best known action or a random action at probability ϵ, a'

Q(s,a) = Q(s,a) + α*(r + γ*Q(s',a') – Q(s,a))

REINFORCEMENT LEARNING - APPROACHES

We took two approaches in competing with Smed's algorithm:

INDIVIDUAL LEARNING – UN-COACHED

Each team-mate learns and reacts autonomously. The player has a set of

nine possible actions. In each state, the player performs his best known action

or explores a different action randomly, inspects its outcomes and updates the

policy with new values for the actions.

The possible actions are:

Player has the puck:

o Pass to a team player in the ith square which has a player of our

team in it

o Shoot

Player doesn't have the puck:

o Retreat to original location

o Dash to the puck

o Guard the closest opponent

AIUncoachedLearner.java

enum Actions {

SHOOT, PASS1, PASS2, PASS3, PASS4, PASS5, RETREAT, TOPUCK, GUARD

}

GROUP LEARNING – COACHED

A team coach is responsible for all actions of the team. In every move, each

player requests a suggested action from the coach. The coach can now

coordinate between all players; for example, the movement of one player can

be stopped while a second player is passing the puck to the first. The coach

has about 30 different strategies to choose from. In each state, he suggests

actions according to the chosen strategy, inspects the outcome and updates

the values of the strategies.

The possible strategies are:

Team has the puck:

o Pass to a certain player and brake the player who should

receive the puck

o Shoot

Team doesn't have the puck

o 1 player dashes toward the puck

o 2 players dash toward the puck

o In both cases, all others:

If they are defensive: retreat or guard

If they are offensive: retreat or advance toward the

opponent's goal line

Each player is suggested one of the following actions (according to the

chosen strategy):

AICoach.java

enum Actions {

SHOOT, PASS_GOALIE, PASS_LD, PASS_RD, PASS_LW, PASS_RW,

PASS_CENTER, BRAKE, RETREAT, TOPUCK, GUARD, FORWARD

}

STATE ABSTRACTION

CHALLENGES

RESPONSE TIMES

Since the game is played in real-time, the calculations cannot be too slow. In

every second, each player may be required to respond hundreds of times,

demanding short calculation times.

To achieve this, we needed to store all the policy data in memory and never

allow its transfer to disk.

Insight: Our reinforcement learning implementations couldn't handle running

the unsynchronized game mode (as-fast-as-possible) because of this issue.

NUMBER OF STATES

Considering the complex elements in each state we could reach an endless

number of states. A state may consist of:

12 players positions, heading angle and current speed

Puck position, heading angle and current speed

For each position coordinates: 0≤x≤30, 0≤y≤60

Angle: 0 – 360

Speed: 0 – 1.0

The values above describe an enormously large state space. A quick

calculation shows that: |S| = ((12+1)30*60*360*100. Even this calculation is

modest, as the values for coordinates, angles and speed are double values.

There is no way to save q-values for each state in RAM.

STATE SIZE

If we were to uniquely describe every state in the S above, we needed log2|S|

bits for each state. Adding the memory needed for the q values for each state

we couldn't compete with Smed.

SOLUTION

We had to find a compact way to identify states and heavily reduce the

number of possible states. Our goal was to insert all possible states and q-

values in RAM of a commodity PC.

Firstly, we abstracted each state by dividing the rink into squares: 3*6

possible squares, leaving us with |S| = ((12+1)3*6*360*100

. Secondly, we

neglected the heading angle and current speed for all players, and described

the heading angle of the puck as heading toward our goal line or heading

toward theirs.

Now: |S| = ((12+1)3*6*2 ≈ 1040 ≫ 100GB. Still, too large.

Thirdly, we took every square in the court and marked it with 2 bits – the first

for existence of our player in this square, and the second for the opponent's.

The other 4 bits indicate the puck square number, and another bit indicates its

general direction.

0 1 … 23 24-26 27 28-31

Existence of

our player in

square 1

Existence of

opponent player in

square 1

… … empty Puck

direction

Puck

square

number

This appears to be much better, now |S| = 232-3 = 0.5GB. Still, when we added

the q-values and the other game assets, it could be hard to compete with

Smed on a PC without a risk of memory paging.

Our fourth idea took advantage of the fact that the rink is symmetric vertically.

Every state defined by the puck on the right side of the rink has an exactly

symmetric state with the puck on the left, with the players positions mirrored

exactly. The actions taken in these states are the same as actions chosen in

the symmetric state.

Finally, working with a worst case state space of 0.25GB seemed reasonable.

POLICY

Q-FUNCTION REPRESENTATION

Considering the data above, we couldn't waste more memory and had to cut

bytes, paying with data accuracy. In the individual learning, for example, we

allowed only nine possible actions per player, each given a 1-byte q-value [-

127 – 127]. In the coached learning, we allowed 31 different strategies for

each state.

We used a HashMap to store pointers to all states and their q-values to allow

quick data retrieving and setting and to use memory only when needed.

Data saved for each visited state: 4bytes + 6players*9bytes or 62bytes ≈ 62B.

Insight: Interestingly, after running the game for several hours, the HashMap

size reached only 6MB of data, covering ~65,000 states (0.02% of all possible

states).

FORMING THE BEST POLICY

CHALLENGES

UPDATE OCCASIONS

As stated beforehand, the game has hundreds of moves every second.

Although the game state may change rapidly, it is rare for it to change

following every action executed by the player. In addition, rewards, like goals,

are given only after a relatively significant time period (~1/2 second). For this

reward to propagate to a relevant action, it would need to update hundreds of

state-action pairs backwards. This will not be useful, and significantly delay

the learning process. Additionally, updating the policy every XMS is not the

answer, as important state-action pairs may be lost.

Thereby, we chose to update the policy only at key occasions:

Individual Learning:

o The player received/lost the puck

o The game has stopped (goal / rule-violation / etc.)

o The player finished playing the same defensive action (any

action not involving the puck) for roundsToDefend rounds (set to

5000 for goalie and 10,000 for field players).

Coached Learning:

o The team received/lost the puck

o The game has stopped

o The team finished playing the same defensive strategy for

roundsToDefend rounds (set to 10,000).

REWARD CALCULATION

Since a goal in a hockey match is relatively rare, it will not be useful to reward

actions only when a goal is scored. Doing so could make rewarded actions

very sparse, making the choice of most actions in states completely random.

Thereby we needed to grant rewards in many more occasions, as well as

penalties, to create greater differentiation between actions.

We evaluated what makes a state good or bad. We came out with this

function:

AILearner.java

public int stateToScore() {

int score = 0;

if (isGoal(true))

return 40;

if (isGoal(false))

return -40;

if (isIcing(true))

return -15;

if (isInterferingGoalie(true)) //we interfere their goalie

return -5;

if (isGoalieKeepingPuck(true))

//our goalie keeps the puck - violation

return -10;

if (isFoul(true)) //we made a foul

return -5;

if (teamHasPuck()!=-1) //none of our team-mates has the puck

score+=13;

else

score-=2;

if (isOffside(true)) //we're on offside

score-=10;

if (isOffside(false)) //they're on offside

score+=2;

double puck[] = puck();

double dist = distanceToGoal(puck);

score += 16 - dist/2; //is the puck close to our or their goal?

score += (puck[2]>0) ? 5 : -7; //puck towards their goal-line?

return score;

}

In fact, for each state, this is the reward given.

Obviously, changing the reward-function will make the entire learning process

and outcome dramatically different. If, for example, a team player in offside

receives a high penalty like a goal from the opponent team, the team will

focus on avoiding offsides, perhaps at the cost of scoring goals.

IMPLEMENTATION

We used SARSA and e-greedy methods to inspect and update the policy

throughout the learning process.

A sketch of the policy update procedure of the individual learner at the react()

function.

AIUncoachedLearner.java

protected void react() {

inputToState(); //abstract state

if (!isGameStopped()) {

if (!puckWithinReach)

keep defensive actions for roundsToDefend rounds

else //now offensive

discard defensive memory

if (need a new action) {

int minAction = puckWithinReach ? 0: MaxActionWithPuck+1;

int maxAction = ...;

if (random()<epsilon)

action = minAction+random()*(maxAction-minAction);

else //get max action

action = policy.getMaxAction(state, thisPlayer,

minAction, maxAction);

//remember new state and action

saveState(state, action);

//update policy with new and old states values

updateScores(state, action);

}

perform(action);

}

else {

updateScores(state);

}

}

protected void updateScores() {

if (isGameStopped()) //new episode

lastQ = 0;

else

lastQ = policy.getQ(currentState, thisPlayer, currentAction);

updateQ = policy.getQ(oldState, thisPlayer, oldAction);

tempQ = (updateQ +

alpha*(currentStateScore + gamma*lastQ - updateQ));

policy.setQ(oldState, thisPlayer, oldAction, updateQ);

}

RESULTS

LEARNING PROCESS

Because of the complexity of our algorithm, the learning process had to be

run in the synchronized mode. This demanded a very long learning period.

We extracted a final policy from each learning approach (individual and

coached) after about 10 hours of run-time.

CHOICE OF ACTION IMPROVEMENT

This result was hard to examine, as it required watching the GUI during play

and estimating the changes in behavior. Still, the team clearly changed

behavior during the learning process.

In the individual learning, players initially choose retreating to their faceoff

position, passing to the goalie, or shooting towards the goal in impossible

states. In the final policy, the behavior is far less random. Certain actions are

clearly preferred in certain states, which also increase the possession of the

puck and the number of goals for the team.

In the coached learning, initially the team chooses inappropriate strategies.

The players might pass to the goalie, retreat to their faceoff positions in wrong

moments, or not allow enough players to dash towards the puck. Again, the

final policy exhibits a more rational team.

Q-VALUE INSPECTION

The development of a certain state and its q-values during the learning

process is shown below. The example given is from the coached learning

approach.

The game state encoding: 00010000 01010010 10101000 00010111

0 1 … 23 24-26 27 28-31

Existence of

our player in

square 1

Existence of

opponent player in

square 1

… … empty Puck

direction

Puck

square

number

DEVELOPMENT OF Q-VALUES:

After 5 goals:

Q(s) = 0, 0, 0, 0, 0, 0, 0, 0, -4, 0, 5, 0, 0, -4, 0, -4, -4, 0, 0, 0, 0, 0, -4, 0, 0, 0, 0, 0, 0, 0, 0

You can see that after 5 goals this state was already visited several times. Six

strategies were examined and updated with values, but the rewards are too

similar.

0

11

The best strategy so far is #10: 1 player goes to the puck, 1 player retreats to

his original location, and the others go forward or guard, according to their

positions.

We can assume that the puck was never yet in the team's control because

offensive strategies were not taken.

After 60 goals:

Q(S) = 90, 13, 90, 90, 0, 0, 0, 0, -4, 0, -3, -6, 0, -4, 0, -4, -4, 0, 0, -6, 0, 6, -4, -6, 0, 0, 0, 0, 0, -6, 0

The state was visited many times, many possible strategies were tested and

the highest q-values were given to the following strategies:

Defensive (team doesn’t have the puck) – strategy #21: two players dash

towards the puck, two retreat to original locations and one goes forward or

guards his closest opponent.

Offensive strategies: #0, #2 and #3: shoot the puck, or pass.

After 150 goals:

Q(S) = 90, 90, 0, 90, 0, 0, 0, -4, -4, -6, -3, -6, 27, -4, -3, -4, -4, -6, -4, -6, -6, -6, -2, -6, -4, -4, -6, -4, -4, -6, -11

The q-function developed further, covering almost all available strategies for

this state. The best strategies are now:

Defensive: #12: one player goes to the puck, 2 retreat and the others go

forward or guard.

Offensive: #0, #1 and #3: shoot the puck or pass.

DID WE WIN?

In both approaches, playing against a different team which used Smed's

simple algorithms, our teams won. The policies created teams which scored

more goals than Smed's teams. Furthermore, the difference between the

number of goals scored by the learning teams and the goals scored by

Smed's teams became larger as the learning process continued.

UN-COACHED TEAM

The final score was 360:240, after 12 hours of learning, starting from scratch.

A graph of goals correlated with the time elapsed is shown below:

0

50

100

150

200

250

Smed's Team

AI Coached Team

At first we scored 0.6 goals for each goal they scored, but eventually we

scored 1.4 times for each goal of Smed's team.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

COACHED TEAM

Eventually we won 254:151, after 8 hours of learning, starting from scratch.

0

50

100

150

200

250

300

Smed's Team

AI Coached Team

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

APPROACHES COMPARISON

We can notice from the results above that the coached approach was better,

even thought the un-coached team trained for longer. This is obviously due to

the fact that the players are synchronized by a coach, stop when the puck is

passed towards them and don't pass to an offside player. We believe that the

un-coached team could eventually learn those issues, but during our training

period, they weren't as smart as the coached team.

Watching the game after the trainings, we can notice that in both approaches

the best actions are to go to the puck, shot it as much as possible, or pass the

it to a forward player. Both teams learned to play reasonably and won the

games.

CONCLUSIONS

The challenges involved in handling this highly complex world and adapting it

to reinforcement learning were greater than expected. During our work, more

and more factors that influence the learning were discovered. The abstraction

of the states, definition of appropriate actions for players and choice of

accuracy for q-values all greatly impact the results of the learning. Each

factor was necessary for implementing a working algorithm, so we had to

make uninformed choices according to our intuition.

Thus, the exact influence of each of these factors remained unexplored – it

would demand days of learning with very minute tweaking of these details.

Perhaps, if we had defined a finer abstraction, added more possible actions,

or allowed greater accuracy in possible q-values, we could have possibly

created a better learning process. The players may have had a greater pool

of choices to make in smaller differences of time, each action clearly

differentiated by the others. However, a policy that could allow this would

have to be bulky and would not provide good response times. In addition,

maybe a larger number of state-action pairs would have created an

unmanageable space to explore, which would require an irrational learning

period in order to create a working policy. Clearly, this implementation opens

infinite possibilities to explore all the elements of reinforcement learning. This

project only shows one possibility of implementation.

Still, after viewing the results of our implementation, it clearly yielded better

teams than simple reflex agents. Both approaches (the coached and

individually acting players) showed good performance, outperforming the

simple algorithms provided by Smed. This suggests that reinforcement

learning is very powerful, and further tweaks may improve the performance

even more.

implementing reinforcement learning for a hockey …ai/projects/old/hockey.pdf · 2018-11-20 ·...

Documents