planning-based rl alphazer0 · alpha and muzero timeline: alpha-family 2016: alphago only plays go....

27
ALPHAZER0 PLANNING-BASED RL

Upload: others

Post on 13-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

ALPHAZER0PLANNING-BASED RL

Page 2: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

ALPHA AND MUZERO

ALPHAGO

Page 3: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

ALPHA AND MUZERO

GO

▸ ~10^170 board positions

▸ ~10^82 atoms in (observable) universe

▸ Might never be possible to brute-force solve Go

▸ The only ‘reward’ signal: win/loss at end of game!

Page 4: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

ALPHA AND MUZERO

GO

Naive Solution: Exhaustive Search

Taken from David Silver, 2020

Page 5: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

ALPHA AND MUZERO

TIMELINE: ALPHA-FAMILY

▸ 2016: AlphaGo

▸ Only plays Go. Beats world champion

▸ 2017: AlphaGo Zero

▸ Removed need to train on human games first

▸ 2018: AlphaZero

▸ Generalized to work on Go, Chess, Shogi, etc.

▸ Beat world computer champion, Stockfish

▸ Stockfish has vast amounts of domain-specific engineering (14,000 LOC)

▸ 2019: MuZero

▸ Learns rules of games

▸ Works for single-agent reinforcement learning (eg Atari)

Page 6: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

ALPHA AND MUZERO

IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

▸ s: state of the game (eg board position)

▸ p: a probability distribution over possible actions

▸ v: a scalar value. The expected outcome of game in state s

▸ : learnable neural net parameters✓

<latexit sha1_base64="3Zj/nRnfkn0/FAh7xFNkggN1z+M=">AAAB7XicbVBNS8NAEN3Ur1q/qh69LBbBU0mkoseCF48V+gVtKJvtpF272YTdiVBC/4MXD4p49f9489+4bXPQ1gcDj/dmmJkXJFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJk41hxaPZay7ATMghYIWCpTQTTSwKJDQCSZ3c7/zBNqIWDVxmoAfsZESoeAMrdTu4xiQDcoVt+ouQNeJl5MKydEYlL/6w5inESjkkhnT89wE/YxpFFzCrNRPDSSMT9gIepYqFoHxs8W1M3phlSENY21LIV2ovycyFhkzjQLbGTEcm1VvLv7n9VIMb/1MqCRFUHy5KEwlxZjOX6dDoYGjnFrCuBb2VsrHTDOONqCSDcFbfXmdtK+qXq16/VCr1Jt5HEVyRs7JJfHIDamTe9IgLcLJI3kmr+TNiZ0X5935WLYWnHzmlPyB8/kDq/mPQg==</latexit>

(p, v) = f✓(s)

<latexit sha1_base64="uusdBxvtUJIrgpooKmlkHaaq4Kg=">AAACB3icbVDLSgNBEJyNrxhfUY+CDAYhAQm7EtGLEPDiMUJekF2W2clsMmT2wUxvICy5efFXvHhQxKu/4M2/cZLsQRMLGoqqbrq7vFhwBab5beTW1jc2t/LbhZ3dvf2D4uFRW0WJpKxFIxHJrkcUEzxkLeAgWDeWjASeYB1vdDfzO2MmFY/CJkxi5gRkEHKfUwJacounZTsgMPT8NJ5e4HEF32LsuzYMGZCyqrjFklk158CrxMpICWVouMUvux/RJGAhUEGU6llmDE5KJHAq2LRgJ4rFhI7IgPU0DUnAlJPO/5jic630sR9JXSHgufp7IiWBUpPA052zo9WyNxP/83oJ+DdOysM4ARbSxSI/ERgiPAsF97lkFMREE0Il17diOiSSUNDRFXQI1vLLq6R9WbVq1auHWqnezOLIoxN0hsrIQteoju5RA7UQRY/oGb2iN+PJeDHejY9Fa87IZo7RHxifP2/8l9I=</latexit>

Page 7: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

REDUCE SEARCH WIDTH

Reduce search-width with policy p

Taken from David Silver, 2020

Page 8: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

REDUCE SEARCH DEPTHReduce search-depth with value v

IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

Page 9: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

MAKING A MOVE: MCTS

▸ Before making a real move:

▸ Search: try the most promising moves in its mind, and see which leads to the highest value!

▸ A position that looks good at first glance might lead to checkmate in 5 moves

▸ Obviously needs to know the rules/environment model

▸ Also add some noise to search: explore vs exploit

▸ Known as Monte-Carlo Tree Search (MCTS)

▸ Use the results of this search to create new, better policy

▸ MCTS can be seen as a policy improvement operator

IDEA 1: REDUCE SEARCH SPACE WITH NEURAL NET

<latexit sha1_base64="3fyN1bmWraMoJUgUzkUyEMInfWQ=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRKp6LLgxmWFvqAJZTKdtEMnkzBzUyghf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTJIJrcJxvq7K1vbO7V92vHRweHZ/Yp2c9HaeKsi6NRawGAdFMcMm6wEGwQaIYiQLB+sHsofD7c6Y0j2UHFgnzIzKRPOSUgJFGtu1NCWReRGAahFmS5yO77jScJfAmcUtSRyXaI/vLG8c0jZgEKojWQ9dJwM+IAk4Fy2teqllC6IxM2NBQSSKm/WyZPMdXRhnjMFbmScBL9fdGRiKtF1FgJouIet0rxP+8YQrhvZ9xmaTAJF0dClOBIcZFDXjMFaMgFoYQqrjJiumUKELBlFUzJbjrX94kvZuG22zcPjXrrU5ZRxVdoEt0jVx0h1roEbVRF1E0R8/oFb1ZmfVivVsfq9GKVe6coz+wPn8AUdeULA==</latexit>

Page 10: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

TEXT

IDEA 2: USE MCTS POLICY FOR TRAINING OBJECTIVE

▸ Reward Sparsity: win/loss/draw at end of game

▸ Can take hundreds of moves to get there!

▸ Solution: train the neural net so that p is the same as

▸ Minimize cross-entropy between two distributions

<latexit sha1_base64="3fyN1bmWraMoJUgUzkUyEMInfWQ=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRKp6LLgxmWFvqAJZTKdtEMnkzBzUyghf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTJIJrcJxvq7K1vbO7V92vHRweHZ/Yp2c9HaeKsi6NRawGAdFMcMm6wEGwQaIYiQLB+sHsofD7c6Y0j2UHFgnzIzKRPOSUgJFGtu1NCWReRGAahFmS5yO77jScJfAmcUtSRyXaI/vLG8c0jZgEKojWQ9dJwM+IAk4Fy2teqllC6IxM2NBQSSKm/WyZPMdXRhnjMFbmScBL9fdGRiKtF1FgJouIet0rxP+8YQrhvZ9xmaTAJF0dClOBIcZFDXjMFaMgFoYQqrjJiumUKELBlFUzJbjrX94kvZuG22zcPjXrrU5ZRxVdoEt0jVx0h1roEbVRF1E0R8/oFb1ZmfVivVsfq9GKVe6coz+wPn8AUdeULA==</latexit>

Lalphazero = (z � v)2 � p̂T log(p)

<latexit sha1_base64="irnyP+tuRR0IglZ4IEmSqiFRYg8=">AAACNnicbVBBSxtBGJ3V2mpaa9RjL0NDIR4Mu6LYiyB46cGCQqJCNoZvJ98mg7M7y8y3Yhz2V3nxd/TmxYOlePUndBIDbbUPBh7vvY/5vpcUSloKw7tgbv7Nwtt3i0u19x+WP67UV9dOrC6NwI7QSpuzBCwqmWOHJCk8KwxClig8TS4OJv7pJRordd6mcYG9DIa5TKUA8lK//j3OgEYClDus+i4mvCIHqhjBNRpdVXvN683LjfMtvsnjEZCbppPUFVV13uax0kPe/KNt9OuNsBVOwV+TaEYabIajfv1HPNCizDAnocDabhQW1HNgSAqFVS0uLRYgLmCIXU9zyND23PTsin/xyoCn2viXE5+qf084yKwdZ4lPTla0L72J+D+vW1L6tedkXpSEuXj+KC0VJ80nHfKBNChIjT0BYaTflYsRGBDkm675EqKXJ78mJ1utaLu1c7zd2G/P6lhkn9hn1mQR22X77Bs7Yh0m2A27Yw/sZ3Ab3Ae/gsfn6Fwwm1ln/yB4+g16Baz9</latexit>

Page 11: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

IDEA 2: USE MCTS POLICY FOR TRAINING OBJECTIVE

SAME IDEA: EXPERT ITERATION

Taken from: Thinking Fast and Slow with Deep Learning and Tree Search, T. Anthony et al, 2017

<latexit sha1_base64="3fyN1bmWraMoJUgUzkUyEMInfWQ=">AAAB+XicbVDLSsNAFJ3UV62vqEs3g0VwVRKp6LLgxmWFvqAJZTKdtEMnkzBzUyghf+LGhSJu/RN3/o2TNgttPTBwOOde7pkTJIJrcJxvq7K1vbO7V92vHRweHZ/Yp2c9HaeKsi6NRawGAdFMcMm6wEGwQaIYiQLB+sHsofD7c6Y0j2UHFgnzIzKRPOSUgJFGtu1NCWReRGAahFmS5yO77jScJfAmcUtSRyXaI/vLG8c0jZgEKojWQ9dJwM+IAk4Fy2teqllC6IxM2NBQSSKm/WyZPMdXRhnjMFbmScBL9fdGRiKtF1FgJouIet0rxP+8YQrhvZ9xmaTAJF0dClOBIcZFDXjMFaMgFoYQqrjJiumUKELBlFUzJbjrX94kvZuG22zcPjXrrU5ZRxVdoEt0jVx0h1roEbVRF1E0R8/oFb1ZmfVivVsfq9GKVe6coz+wPn8AUdeULA==</latexit>

(MCTS)

p

<latexit sha1_base64="kPLbD4GnT2Gi8Ih87/JES7j6E/s=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxIRZcFNy4r9IXtUDJppg3NZIbkjlCG/oUbF4q49W/c+Tem7Sy09UDgcM695NwTJFIYdN1vp7CxubW9U9wt7e0fHB6Vj0/aJk414y0Wy1h3A2q4FIq3UKDk3URzGgWSd4LJ3dzvPHFtRKyaOE24H9GREqFgFK302I8ojoMwS2aDcsWtuguQdeLlpAI5GoPyV38YszTiCpmkxvQ8N0E/oxoFk3xW6qeGJ5RN6Ij3LFU04sbPFoln5MIqQxLG2j6FZKH+3shoZMw0CuzkPKFZ9ebif14vxfDWz4RKUuSKLT8KU0kwJvPzyVBozlBOLaFMC5uVsDHVlKEtqWRL8FZPXiftq6pXq14/1Cr1Zl5HEc7gHC7Bgxuowz00oAUMFDzDK7w5xnlx3p2P5WjByXdO4Q+czx/5E5Eu</latexit>

Page 12: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

IDEA 3: SELF-PLAY▸ Difference with normal RL

▸ Don’t have an agent to play against!

▸ How to evaluate strength?

“Modern learning algorithms are outstanding test-takers: once a problem is packaged into a suitable objective, deep (reinforcement) learning algorithms often find a good solution. However, in many multi-agent domains, the question of what test to take, or what objective to optimize, is not clear... Learning in games is often conservatively formulated as training agents that tie or beat, on average, a fixed set of opponents. However, the dual task, that of generating useful opponents to train and evaluate against, is under-studied. It is not enough to beat the agents you know; it is also important to generate better opponents, which exhibit behaviours that you don’t know.” ~ Balduzzi et al. (2019)

Page 13: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

IDEA 3: SELF-PLAY

IDEA 3: SELF-PLAY

▸ Two types of game:

▸ Transitive: Chess, Go, etc

▸ Intransitive (or Cyclic): Rock-Paper-Scissors, StarCraft

▸ Transitive games have a special property: if agent A beats agent B and B beats C, then A beats C

▸ Basis of ELO rating system in Chess

Page 14: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

IDEA 3: SELF-PLAY

IDEA 3: SELF-PLAY

▸ Self-Play Algorithm

▸ Start with neural net agent f1

▸ Play f1 against itself to generate training data

▸ Train f1 on this data using to produce f2

▸ Repeat

▸ Key: fn will beat f1, f2, … due to transitivity!

Lalphazero = (z � v)2 � p̂T log(p)

<latexit sha1_base64="irnyP+tuRR0IglZ4IEmSqiFRYg8=">AAACNnicbVBBSxtBGJ3V2mpaa9RjL0NDIR4Mu6LYiyB46cGCQqJCNoZvJ98mg7M7y8y3Yhz2V3nxd/TmxYOlePUndBIDbbUPBh7vvY/5vpcUSloKw7tgbv7Nwtt3i0u19x+WP67UV9dOrC6NwI7QSpuzBCwqmWOHJCk8KwxClig8TS4OJv7pJRordd6mcYG9DIa5TKUA8lK//j3OgEYClDus+i4mvCIHqhjBNRpdVXvN683LjfMtvsnjEZCbppPUFVV13uax0kPe/KNt9OuNsBVOwV+TaEYabIajfv1HPNCizDAnocDabhQW1HNgSAqFVS0uLRYgLmCIXU9zyND23PTsin/xyoCn2viXE5+qf084yKwdZ4lPTla0L72J+D+vW1L6tedkXpSEuXj+KC0VJ80nHfKBNChIjT0BYaTflYsRGBDkm675EqKXJ78mJ1utaLu1c7zd2G/P6lhkn9hn1mQR22X77Bs7Yh0m2A27Yw/sZ3Ab3Ae/gsfn6Fwwm1ln/yB4+g16Baz9</latexit>

Page 15: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

IDEA 3: SELF-PLAY

ALPHASTAR FOR INTRANSITIVE GAMES: NEEDS HUMAN GAMES!

• https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/

Page 16: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

TEXT

▸ For things like Atari, or almost any real-world RL problem, we don’t know the rules/model:

▸ if I have a state s1 and take action a, what will state s2 be?

▸ MuZero: use RNN to predict next state given previous states + previous actions!

▸ Then use for planning with MCTS like in AlphaZero

IDEA 4: A LEARNED MODEL + MCTS = POWER

Page 17: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

TEXT

IDEA 4: A LEARNED MODEL + MCTS = POWER

PacMan Reward

Page 18: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

TEXT

ALPHA ZERO

▸ Idea 1: Reduce search-space with policy-value neural net

▸ Needs rules/model of environment. Can be learnt (MuZero)

▸ Idea 2: Use the MCTS policy as the training objective

▸ Idea 3: Use self-play to generate training data

▸ Idea 4: MCTS with a learned model of the world worlds well

Page 19: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

COMPARISON TO HUMANS

COMPARISON TO HUMANS: DUAL PROCESS THEORY

“System 1 operates automatically and quickly, with little or no effort and no sense of voluntary control. System 2 allocates attention to the effortful mental activities that demand it, including complex computations. The operations of System 2 are often associated with the subjective experience of agency, choice, and concentration.”

Page 20: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

COMPARISON TO HUMANS

COMPARISON TO HUMANS: DUAL PROCESS THEORY

Taken form David Barber Learning From Scratch by Thinking Fast and Slow with Deep Learning and Tree Search

Neural Net Neural Net used in MCTS

Page 21: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

TEXT

COMPARISON TO HUMANS

https://deepmind.com/blog/article/alphazero-shedding-new-light-grand-games-chess-shogi-and-go

Page 22: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

LESSONS

LESSONS FROM IMPLEMENTINGDeepMind OpenSpiel (https://github.com/deepmind/open_spiel): “OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.”

Page 23: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

LESSONS

LESSONS FROM IMPLEMENTING▸ A simple tic-tac-toe example

▸ Takes around ~2 mins to train on laptop

Page 24: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

LESSONS

LESSONS: HYPERPARAMETERS

😭😭😭😭

“The hyperparameters of AlphaGo Zero were tuned by Bayesian optimization.” ~ AlphaZero paper

▸ Luckily, only two game specific hyper parameters

▸ Found reasonable values for tic-tac-toe with some trial and error

Page 25: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

LESSONS

LESSONS: OPENSOURCE ▸ Code reviews by real experts

▸ You cannot buy the sort of feedback you can get from DeepMind engineers/researchers!

▸ Its easy to fool yourself that you understand something better than you do

▸ Fortuitous Connections

▸ Looks good for employers

▸ “Research and software engineer experience demonstrated via an internship, contributions to open source, work experience, or coding competitions.” ~ from recent Cape Town AI company job posting

▸ Benefits others!

Page 26: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

LESSONS

OPENSPIEL CALL FOR CONTRIBUTIONSSee: https://github.com/deepmind/open_spiel/blob/master/docs/contributing.md

Page 27: PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games first

BACKPROP IN THE BRAIN

THANKS FOR LISTENING!