machine learning and games simon m. lucas centre for computational intelligence university of essex,...

Machine Learning and Games

Simon M. LucasCentre for Computational Intelligence

University of Essex, UK

Overview• Games: dynamic, uncertain, open-ended

– Ready-made test environments– 21 billion dollar industry: space for more machine learning…

• Agent architectures– Where the Computational Intelligence fits– Interfacing the Neural Nets etc– Choice of learning machine (WPC, neural network,

NTuple systems)• Training algorithms

– Evolution / co-evolution– TDL– Hybrids

• Methodology: strong belief in open competitions

My Angle

• Machine learning– How well can systems learn– Given complex semi-structured environment– With indirect reward schemes

Sample Games

• Car Racing• Othello• Ms Pac-Man– Demo

http://julian.togelius.com/cig2007competition/

Agent Basics

• Two main approaches– Action selector– State evaluator

• Each of these has strengths and weaknesses• For any given problem, no hard and fast rules– Experiment!

• Success or failure can hinge on small details!

Co-evolutionEvolutionary algorithm: rank them using a league

(Co) Evolution v. TDL

• Temporal Difference Learning– Often learns much faster– But less robust– Learns during game-play– Uses information readily available (i.e. current observable

game-state)• Evolution / Co-evolution (vanilla form)– Information from game result(s)– Easier to apply– But wasteful

• Both can learn game strategy from scratch

In Pictures…

Simple Example: Mountain Car

• Often used to test TD learning methods• Accelerate a car to reach goal at top of incline• Engine force weaker than gravity (DEMO)

State Value Function

• Actions are applied to current state to generate set of future states

• State value function is used to rate these

• Choose action that leads to highest state value

• Discrete set of actions

Action Selector

• A decision function selects an output directly based on current state of system

• Action may be a discrete choice, or continuous outputs

TDL – State Value Learned

Evolution : Learns Policy, not Value

Example Network Found by NEAT+Q(Whiteson and Stone, JMLR 2006)

• EvoTDL Hybrid• They used a different input coding• So results not directly comparable

~Optimal State Value Policy Functionf = abs(v)

Action Controller

• Directly connect velocity to output

• Simple network!• One neuron!• One connection!• Easy to

interpret!vs

OthelloWith Thomas Runarsson,

University of Iceland

Volatile Piece Difference

moveMove

Setup• Use weighted piece counter– Fast to compute (can play billions of games)– Easy to visualise– See if we can beat the ‘standard’ weights

• Limit search depth to 1-ply– Enables billions of games to be played– For a thorough comparison

• Focus on machine learning rather than game-tree search

• Force random moves (with prob. 0.1)– Get a more robust evaluation of playing ability

Standard “Heuristic” Weights(lighter = more advantageous)

CEL Algorithm

• Evolution Strategy (ES)– (1, 10) (non-elitist worked best)

• Gaussian mutation– Fixed sigma (not adaptive)– Fixed works just as well here

• Fitness defined by full round-robin league performance (e.g. 1, 0, -1 for w/d/l)

• Parent child averaging– Defeats noise inherent in fitness evaluation

TDL Algorithm

• Nearly as simple to apply as CELpublic interface TDLPlayer extends Player { void inGameUpdate(double[] prev, double[] next);

void terminalUpdate(double[] prev, double tg);

}

• Reward signal only given at game end• Initial alpha and alpha cooling rate tuned

empirically

TDL in Java

CEL (1,10) v. Heuristic

TDL v. Random and Heuristic

TDL + CEL v. Heuristic (1 run)

Can we do better?

• Enforce symmetry– This speeds up learning

• Use trusty old friend: N-Tuple System

NTuple Systems• W. Bledsoe and I. Browning. Pattern recognition and reading by

machine. In Proceedings of the EJCC, pages 225 232, December 1959.

• Sample n-tuples of input space• Map sampled values to memory indexes

– Training: adjust values there– Recognition / play: sum over the values

• Superfast• Related to:

– Kernel trick of SVM (non-linear map to high dimensional space; then linear model)

– Kanerva’s sparse memory model– Also similar to Buro’s look-up table

Symmetric N-Tuple Sampling

3-tuple Example

N-Tuple System

• Results used 30 random n-tuples• Snakes created by a random 6-step walk– Duplicates squares deleted

• System typically has around 15000 weights• Simple training rule:

NTuple System (TDL)total games = 1250

Learned strategy…

Web-based League(snapshot before CEC 2006 Competition)

Results versus CEC 2006 Champion(a manual EVO / TDL hybrid)

N-Tuple Summary

• Stunning results compared to other game-learning architectures such as MLP

• How might this hold for other problems?• How easy are N-Tuples to apply to other

domains?

Screen Capture Mode:Ms Pac-Man Challenge

Robotic Car Racing

Conclusions

• Games are great for CI research– Intellectually challenging– Fun to work with

• Agent learning for games is still a black art• Small details can make big differences!– Which inputs to use

• Big details also! (NTuple versus MLP)• Grand challenge: how can we design more efficient

game learners?• EvoTDL hybrids are the way forward.

CIG 2008: Perth, WA; http://cigames.org

machine learning and games simon m. lucas centre for computational intelligence university of essex,...

Documents