ttic 31230, fundamentals of deep learning · alphago zero: trained on self play only. trained for 3...

12

TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 AlphaZero Background Algorithms 1

Upload: others

Post on 25-May-2020

6 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

TTIC 31230, Fundamentals of Deep Learning

David McAllester, Winter 2020

AlphaZero Background Algorithms

1

Page 2: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

AlphaGo Fan (October 2015)

AlphaGo Defeats Fan Hui, European Go Champion.

2

Page 3: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

AlphaGo Lee (March 2016)

3

Page 4: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

AlphaGo Zero vs. Alphago Lee (April 2017)

AlphaGo Lee:

• Trained on both human games and self play.

• Trained for Months.

•Run on many machines with 48 TPUs for Lee Sedol match.

AlphaGo Zero:

• Trained on self play only.

• Trained for 3 days.

•Run on one machine with 4 TPUs.

•Defeated AlphaGo Lee under match conditions 100 to 0.

4

Page 5: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

AlphaZero Defeats Stockfish in Chess (December 2017)

AlphaGo Zero was a fundamental algorithmic advance for gen-eral RL.

The general RL algorithm of AlphaZero is essentially the sameas that of AlphaGo Zero.

5

Page 6: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

Some Background

6

Page 7: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

Monte-Carlo Tree Search (MCTS)

Brugmann (1993)

To estimate the value of a position (who is ahead and by howmuch) run a cheap stochastic policy to generate a sequence ofmoves (a rollout) and see who wins.

Select the move with the best rollout value.

7

Page 8: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

(One Armed) Bandit Problems

Robbins (1952)

Consider a set of choices (different slot machines).Each choice gets a stochastic reward.

We can select a choice and get a reward as often as we like.

We would like to determine which choice is best and also toget reward as quickly as possible.

8

Page 9: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

The Upper Confidence Bound (UCB) Algorithm

Lai and Robbins (1985)

For each action choice (bandit) a, construct a confidence in-terval for its average reward based on n trials for that acrtion.

µ(a) ∈ µ̂(a)± 2σ(a)/√n(a)

Always select

argmaxa

µ̂(a) + 2σ(a)/√n(a)

9

Page 10: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

The Upper Confidence Tree (UCT) Algorithm

Kocsis and Szepesvari (2006), Gelly and Silver (2007)

The UCT algorithm grows a tree by running “simulations”.

Each simulation descends into the tree to a leaf node, expandsthat leaf, and returns a value.

In the UCT algorithm each move choice at each position istreated as a bandit problem.

We select the child (bandit) with the lowest upper bound ascomputed from simulations selecting that child.

10

Page 11: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

Bootstrapping from Game Tree Search

Vaness, Silver, Blair and Uther, NeurIPS 2009

In bootstrapped tree search we do a tree search to compute amin-max value Vmm(s) using tree search with a static evaluatorVΦ(s). We then try to fit the static value to the min-max value.

∆Φ = −η∇Φ (VΦ(s)− Vmm(s))2

This is similar to minimizing a Bellman error between VΦ(s)and a rollout estimate of the value of s but where the rolloutestimate is replaced by a min-max tree search estimate.

11

Page 12: TTIC 31230, Fundamentals of Deep Learning · AlphaGo Zero: Trained on self play only. Trained for 3 days. Run on one machine with 4 TPUs. Defeated AlphaGo Lee under match conditions

END

What did AlphaGo do to beat the strongest human Go player?

Lessons Learned from AlphaGo - CEUR-WS.orgceur-ws.org/Vol-1837/paper14.pdf · Lessons Learned from AlphaGo Ste en Hölldobler1;2 ... The game of Go is known to be one of the most

AlphaGo – Artificial Intelligencecis.csuohio.edu/~sschung/CIS601/Presentation1... · AlphaGo –Artificial Intelligence CIS 601 - Graduate Seminar Presentation ... Employing Deep

TTIC 31230, Fundamentals of Deep Learningdmcallester/DeepClass18/16... · 2018. 2. 27. · AlphaZero Defeats Stock sh in Chess (December 2017) AlphaGo Zero was a fundamental algorithmic

A SHORT PRESENTATION ON REINFORCEMENT LEARNING · tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised

Where will AGI come from?€¦ · Example: AlphaGo (see my Medium post “AlphaGo, in context”) Convenient properties of Go: 1. Deterministic. No noise in the game. 2. Fully observed

AlphaGo, etc. - Swarthmore College · AlphaGo results Beat a low-rank professional player (Fan Hui) 5 games to 0. Will take on a top professional player (Lee Sedol) March 8-15 in

CMPUT 396: AlphaGo - University of Albertawebdocs.cs.ualberta.ca/~hayward/396/hoven/11alphago.pdf · CMPUT 396: AlphaGo The race to build a superhuman Go-bot is over! AlphaGo won,

AlphaGo in Depth

Alphago Tech talk

B4M35PAG - Paralelní algoritmy · Recent Highlights in Parallel Computing • In March 2016, AlphaGo beat Lee Sedol, a 9-dan professional. AlphaGo ran on 48 CPUs and 8 GPUs. •

Go in numbers€¦ · Evaluating Seoul AlphaGo against computers Crazy Stone Zen 3000 2500 2000 1500 1000 500 0 3500 AlphaGo (Nature v13) Pachi Fuego Gnu Go 4000 4500 AlphaGo (Seoul

Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

PLANNING-BASED RL ALPHAZER0 · ALPHA AND MUZERO TIMELINE: ALPHA-FAMILY 2016: AlphaGo Only plays Go. Beats world champion 2017: AlphaGo Zero Removed need to train on human games ﬁrst

AlphaGo andMonteCarloTreeSearch - University of Washington5/30/17 1 AlphaGo andMonteCarloTreeSearch Machine0Learning0forBig0Data00 CSE547/STAT548, University0of0Washington Sham0Kakade

ABSTRACT BOOKLET - McMaster Universitydeza/Program_Tokyo_July-2018.pdf · Sebastian Pokutta (Georgia Tech) ... We tackle this problem by using AlphaGo-like approach by using RNN trained

Maurizio Lenzerini Tiziana Catarci · 2016: AlphaGo beats Go champion Lee Sedol Thanks to machine learning techniques AlphaGo may develop «intuitions» for Go playing

UNDERSTANDING & GENERALIZING ALPHAGO ZERO

Game tree complexity - ICML · Evaluating AlphaGo against computers Crazy Stone Zen 3000 2500 2000 1500 1000 500 0 3500 AlphaGo (Nature v13) Pachi Fuego Gnu Go 4000 4500 AlphaGo (Seoul

Stochastic Planning in Games: an AlphaGo Case-studyTommaso Macrì | 23.03.20 | 18 AlphaGo: the 4 Neural Networks Forward Propagation: 2µs 3ms Accuracy: 24% 57% Mastering the game

DeepBlue, AlphaGo, and AI?€¦ · Chess vs. Alpha Go •Will the technical advances (underlying AlphaGo) have broader implications? 1997, AI named Deep Blue beat chess world champion

In Two Moves, AlphaGo and Lee Sedol Redefined the Future ...publicservicesalliance.org/wp-content/uploads/2016/...#ALPHAGO#ARTIFICIAL INTELLIGENCE#DEEP LEARNING#DEEPMIND#GO#GOOGLE

INTRODUCING WAQF BASED TAKAFUL MODEL IN INDIAirep.iium.edu.my/31230/1/INTRODUCING_WAQF_BASED_TAKAFUL_MODEL_IN_I… · (s.a.w.) and has ever since moderately blossomed until the starting

Exploration versus exploitation in reinforcement learning ... › ~xz2574 › download › RL-ver12.pdfing playing perfect-information board games such as Go (AlphaGo/AlphaGo Zero,

TTIC 31230, Fundamentals of Deep Learning · 2020-04-04 · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2019 The Fundamental Equations of Deep Learning 1. Early

AlphaGo: Mastering the Game of Go with Deep Neural Networks and Tree Search

AlphaGo: Mastering the game of Go with neural networks and ... · •AlphaGo is the first AI Go system to beat a human pro player. •Fan Hui is the European champion and a 2 dan

Machine Learning Techniques at the core of ... - Meetupfiles.meetup.com/20381738/AlphaGo meet up 2016.pdf · AlphaGo ML System Conclusion Machine Learning Techniques at the core of

cs230.stanford.edu · however, AlphaGo Zero (Silver et al. 2017b) described an approach that used absolutely no expert knowledge and was trained entirely through self-play. In our

Deep Reinforcement Learning in Continuous Action Spaces: a …proceedings.mlr.press/v80/lee18b/lee18b.pdf · AlphaGo Zero (Silver et al., 2017), which is trained via self-play without

Big Data and the Landscapestrings/halverson.pdf · Recently, AlphaGo became the ﬁrst program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions

TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess

HEROZ O. AlphaGo Teach JAPAN VR Spacely VRCHEL … · O. AlphaGo Teach JAPAN VR Spacely VRCHEL EmbodgMe D7Ry Kibiro PENDROID Musio n Asratec 9curon åiL Dentrg Holoeyes A-zas Fafñõt-e

How AlphaGo Works

From AlphaGo Zero to 2048 - Michigan State Universityzhaoxi35/DRL4KDD/2.pdf · From AlphaGo Zero to 2048 DRL4KDD ’19, August 5, 2019, Anchorage, AK, USA (9) totalScore(s): the score