planning-based rl alphazer0 · alpha and muzero timeline: alpha-family 2016: alphago only plays go....

ALPHAZER0PLANNING-BASED RL

ALPHA AND MUZERO

ALPHAGO

ALPHA AND MUZERO

GO

▸ ~10^170 board positions

▸ ~10^82 atoms in (observable) universe

▸ Might never be possible to brute-force solve Go

▸ The only ‘reward’ signal: win/loss at end of game!

ALPHA AND MUZERO

GO

Naive Solution: Exhaustive Search

Taken from David Silver, 2020

ALPHA AND MUZERO

TIMELINE: ALPHA-FAMILY

▸ 2016: AlphaGo

▸ Only plays Go. Beats world champion

▸ 2017: AlphaGo Zero

▸ Removed need to train on human games first

▸ 2018: AlphaZero

▸ Generalized to work on Go, Chess, Shogi, etc.

▸ Beat world computer champion, Stockfish

▸ Stockfish has vast amounts of domain-specific engineering (14,000 LOC)

▸ 2019: MuZero

▸ Learns rules of games

▸ Works for single-agent reinforcement learning (eg Atari)

ALPHA AND MUZERO

IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

▸ s: state of the game (eg board position)

▸ p: a probability distribution over possible actions

▸ v: a scalar value. The expected outcome of game in state s

▸ : learnable neural net parameters✓

<latexit sha1_base64="3Zj/nRnfkn0/FAh7xFNkggN1z+M=">AAAB7XicbVBNS8NAEN3Ur1q/qh69LBbBU0mkoseCF48V+gVtKJvtpF272YTdiVBC/4MXD4p49f9489+4bXPQ1gcDj/dmmJkXJFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJk41hxaPZay7ATMghYIWCpTQTTSwKJDQCSZ3c7/zBNqIWDVxmoAfsZESoeAMrdTu4xiQDcoVt+ouQNeJl5MKydEYlL/6w5inESjkkhnT89wE/YxpFFzCrNRPDSSMT9gIepYqFoHxs8W1M3phlSENY21LIV2ovycyFhkzjQLbGTEcm1VvLv7n9VIMb/1MqCRFUHy5KEwlxZjOX6dDoYGjnFrCuBb2VsrHTDOONqCSDcFbfXmdtK+qXq16/VCr1Jt5HEVyRs7JJfHIDamTe9IgLcLJI3kmr+TNiZ0X5935WLYWnHzmlPyB8/kDq/mPQg==</latexit>

(p, v) = f✓(s)

<latexit sha1_base64="uusdBxvtUJIrgpooKmlkHaaq4Kg=">AAACB3icbVDLSgNBEJyNrxhfUY+CDAYhAQm7EtGLEPDiMUJekF2W2clsMmT2wUxvICy5efFXvHhQxKu/4M2/cZLsQRMLGoqqbrq7vFhwBab5beTW1jc2t/LbhZ3dvf2D4uFRW0WJpKxFIxHJrkcUEzxkLeAgWDeWjASeYB1vdDfzO2MmFY/CJkxi5gRkEHKfUwJacounZTsgMPT8NJ5e4HEF32LsuzYMGZCyqrjFklk158CrxMpICWVouMUvux/RJGAhUEGU6llmDE5KJHAq2LRgJ4rFhI7IgPU0DUnAlJPO/5jic630sR9JXSHgufp7IiWBUpPA052zo9WyNxP/83oJ+DdOysM4ARbSxSI/ERgiPAsF97lkFMREE0Il17diOiSSUNDRFXQI1vLLq6R9WbVq1auHWqnezOLIoxN0hsrIQteoju5RA7UQRY/oGb2iN+PJeDHejY9Fa87IZo7RHxifP2/8l9I=</latexit>


REDUCE SEARCH WIDTH

Reduce search-width with policy p

Taken from David Silver, 2020

REDUCE SEARCH DEPTHReduce search-depth with value v


MAKING A MOVE: MCTS

▸ Before making a real move:

▸ Search: try the most promising moves in its mind, and see which leads to the highest value!

▸ A position that looks good at first glance might lead to checkmate in 5 moves

▸ Obviously needs to know the rules/environment model

▸ Also add some noise to search: explore vs exploit

▸ Known as Monte-Carlo Tree Search (MCTS)

▸ Use the results of this search to create new, better policy

▸ MCTS can be seen as a policy improvement operator


p̂

<latexit sha1_base64="3fyN1bmWraMoJUgUzkUyEMInfWQ=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRKp6LLgxmWFvqAJZTKdtEMnkzBzUyghf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTJIJrcJxvq7K1vbO7V92vHRweHZ/Yp2c9HaeKsi6NRawGAdFMcMm6wEGwQaIYiQLB+sHsofD7c6Y0j2UHFgnzIzKRPOSUgJFGtu1NCWReRGAahFmS5yO77jScJfAmcUtSRyXaI/vLG8c0jZgEKojWQ9dJwM+IAk4Fy2teqllC6IxM2NBQSSKm/WyZPMdXRhnjMFbmScBL9fdGRiKtF1FgJouIet0rxP+8YQrhvZ9xmaTAJF0dClOBIcZFDXjMFaMgFoYQqrjJiumUKELBlFUzJbjrX94kvZuG22zcPjXrrU5ZRxVdoEt0jVx0h1roEbVRF1E0R8/oFb1ZmfVivVsfq9GKVe6coz+wPn8AUdeULA==</latexit>

TEXT

IDEA 2: USE MCTS POLICY FOR TRAINING OBJECTIVE

▸ Reward Sparsity: win/loss/draw at end of game

▸ Can take hundreds of moves to get there!

▸ Solution: train the neural net so that p is the same as

▸ Minimize cross-entropy between two distributions

▸

p̂


Lalphazero = (z � v)2 � p̂T log(p)

<latexit sha1_base64="irnyP+tuRR0IglZ4IEmSqiFRYg8=">AAACNnicbVBBSxtBGJ3V2mpaa9RjL0NDIR4Mu6LYiyB46cGCQqJCNoZvJ98mg7M7y8y3Yhz2V3nxd/TmxYOlePUndBIDbbUPBh7vvY/5vpcUSloKw7tgbv7Nwtt3i0u19x+WP67UV9dOrC6NwI7QSpuzBCwqmWOHJCk8KwxClig8TS4OJv7pJRordd6mcYG9DIa5TKUA8lK//j3OgEYClDus+i4mvCIHqhjBNRpdVXvN683LjfMtvsnjEZCbppPUFVV13uax0kPe/KNt9OuNsBVOwV+TaEYabIajfv1HPNCizDAnocDabhQW1HNgSAqFVS0uLRYgLmCIXU9zyND23PTsin/xyoCn2viXE5+qf084yKwdZ4lPTla0L72J+D+vW1L6tedkXpSEuXj+KC0VJ80nHfKBNChIjT0BYaTflYsRGBDkm675EqKXJ78mJ1utaLu1c7zd2G/P6lhkn9hn1mQR22X77Bs7Yh0m2A27Yw/sZ3Ab3Ae/gsfn6Fwwm1ln/yB4+g16Baz9</latexit>

IDEA 2: USE MCTS POLICY FOR TRAINING OBJECTIVE

SAME IDEA: EXPERT ITERATION

Taken from: Thinking Fast and Slow with Deep Learning and Tree Search, T. Anthony et al, 2017

p̂


(MCTS)

p

<latexit sha1_base64="kPLbD4GnT2Gi8Ih87/JES7j6E/s=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIRZcFNy4r9IXtUDJppg3NZIbkjlCG/oUbF4q49W/c+Tem7Sy09UDgcM695NwTJFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJ3dzvPHFtRKyaOE24H9GREqFgFK302I8ojoMwS2aDcsWtuguQdeLlpAI5GoPyV38YszTiCpmkxvQ8N0E/oxoFk3xW6qeGJ5RN6Ij3LFU04sbPFoln5MIqQxLG2j6FZKH+3shoZMw0CuzkPKFZ9ebif14vxfDWz4RKUuSKLT8KU0kwJvPzyVBozlBOLaFMC5uVsDHVlKEtqWRL8FZPXiftq6pXq14/1Cr1Zl5HEc7gHC7Bgxuowz00oAUMFDzDK7w5xnlx3p2P5WjByXdO4Q+czx/5E5Eu</latexit>

IDEA 3: SELF-PLAY▸ Difference with normal RL

▸ Don’t have an agent to play against!

▸ How to evaluate strength?

“Modern learning algorithms are outstanding test-takers: once a problem is packaged into a suitable objective, deep (reinforcement) learning algorithms often find a good solution. However, in many multi-agent domains, the question of what test to take, or what objective to optimize, is not clear... Learning in games is often conservatively formulated as training agents that tie or beat, on average, a fixed set of opponents. However, the dual task, that of generating useful opponents to train and evaluate against, is under-studied. It is not enough to beat the agents you know; it is also important to generate better opponents, which exhibit behaviours that you don’t know.” ~ Balduzzi et al. (2019)

IDEA 3: SELF-PLAY

IDEA 3: SELF-PLAY

▸ Two types of game:

▸ Transitive: Chess, Go, etc

▸ Intransitive (or Cyclic): Rock-Paper-Scissors, StarCraft

▸ Transitive games have a special property: if agent A beats agent B and B beats C, then A beats C

▸ Basis of ELO rating system in Chess

IDEA 3: SELF-PLAY

IDEA 3: SELF-PLAY

▸ Self-Play Algorithm

▸ Start with neural net agent f1

▸ Play f1 against itself to generate training data

▸ Train f1 on this data using to produce f2

▸ Repeat

▸ Key: fn will beat f1, f2, … due to transitivity!

Lalphazero = (z � v)2 � p̂T log(p)

<latexit sha1_base64="irnyP+tuRR0IglZ4IEmSqiFRYg8=">AAACNnicbVBBSxtBGJ3V2mpaa9RjL0NDIR4Mu6LYiyB46cGCQqJCNoZvJ98mg7M7y8y3Yhz2V3nxd/TmxYOlePUndBIDbbUPBh7vvY/5vpcUSloKw7tgbv7Nwtt3i0u19x+WP67UV9dOrC6NwI7QSpuzBCwqmWOHJCk8KwxClig8TS4OJv7pJRordd6mcYG9DIa5TKUA8lK//j3OgEYClDus+i4mvCIHqhjBNRpdVXvN683LjfMtvsnjEZCbppPUFVV13uax0kPe/KNt9OuNsBVOwV+TaEYabIajfv1HPNCizDAnocDabhQW1HNgSAqFVS0uLRYgLmCIXU9zyND23PTsin/xyoCn2viXE5+qf084yKwdZ4lPTla0L72J+D+vW1L6tedkXpSEuXj+KC0VJ80nHfKBNChIjT0BYaTflYsRGBDkm675EqKXJ78mJ1utaLu1c7zd2G/P6lhkn9hn1mQR22X77Bs7Yh0m2A27Yw/sZ3Ab3Ae/gsfn6Fwwm1ln/yB4+g16Baz9</latexit>

IDEA 3: SELF-PLAY

ALPHASTAR FOR INTRANSITIVE GAMES: NEEDS HUMAN GAMES!

• https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/

TEXT

▸ For things like Atari, or almost any real-world RL problem, we don’t know the rules/model:

▸ if I have a state s1 and take action a, what will state s2 be?

▸ MuZero: use RNN to predict next state given previous states + previous actions!

▸ Then use for planning with MCTS like in AlphaZero

IDEA 4: A LEARNED MODEL + MCTS = POWER

TEXT

IDEA 4: A LEARNED MODEL + MCTS = POWER

PacMan Reward

TEXT

ALPHA ZERO

▸ Idea 1: Reduce search-space with policy-value neural net

▸ Needs rules/model of environment. Can be learnt (MuZero)

▸ Idea 2: Use the MCTS policy as the training objective

▸ Idea 3: Use self-play to generate training data

▸ Idea 4: MCTS with a learned model of the world worlds well

COMPARISON TO HUMANS

COMPARISON TO HUMANS: DUAL PROCESS THEORY

“System 1 operates automatically and quickly, with little or no effort and no sense of voluntary control. System 2 allocates attention to the effortful mental activities that demand it, including complex computations. The operations of System 2 are often associated with the subjective experience of agency, choice, and concentration.”


COMPARISON TO HUMANS: DUAL PROCESS THEORY

Taken form David Barber Learning From Scratch by Thinking Fast and Slow with Deep Learning and Tree Search

Neural Net Neural Net used in MCTS

TEXT


https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go

https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go

LESSONS

LESSONS FROM IMPLEMENTINGDeepMind OpenSpiel (https://github.com/deepmind/open_spiel): “OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.”

https://github.com/deepmind/open_spiel

LESSONS

LESSONS FROM IMPLEMENTING▸ A simple tic-tac-toe example

▸ Takes around ~2 mins to train on laptop

LESSONS

LESSONS: HYPERPARAMETERS

😭😭😭😭

“The hyperparameters of AlphaGo Zero were tuned by Bayesian optimization.” ~ AlphaZero paper

▸ Luckily, only two game specific hyper parameters

▸ Found reasonable values for tic-tac-toe with some trial and error

LESSONS

LESSONS: OPENSOURCE ▸ Code reviews by real experts

▸ You cannot buy the sort of feedback you can get from DeepMind engineers/researchers!

▸ Its easy to fool yourself that you understand something better than you do

▸ Fortuitous Connections

▸ Looks good for employers

▸ “Research and software engineer experience demonstrated via an internship, contributions to open source, work experience, or coding competitions.” ~ from recent Cape Town AI company job posting

▸ Benefits others!

LESSONS

OPENSPIEL CALL FOR CONTRIBUTIONSSee: https://github.com/deepmind/open_spiel/blob/master/docs/contributing.md

https://github.com/deepmind/open_spiel/blob/master/docs/contributing.md

BACKPROP IN THE BRAIN

THANKS FOR LISTENING!

planning-based rl alphazer0 · alpha and muzero timeline: alpha-family 2016: alphago only plays go....

Documents