human-level control through deep reinforcement...

Human-level control through deep reinforcement learning

Byeong-Sun Hong

2018-12-31

Computer Graphics @ Korea University

Copyright of figures and other materials in the paper belongs to original authors.

Volodymyr MnihGoogle DeepMindNature 2015

Byeong-sun Hong | 2018-12-31| # 2Computer Graphics @ Korea University

Index

• Introduction

• Background

▪ Reinforcement Learning

• Related Work

• Key Idea

• Method

• Result

• Conclusion

Introduction


Introduction

• 강화학습은 Agent가 환경에서의 행동을 어떻게 최적화 할 수있는지 제공한다.

• 많은 강화학습 알고리즘

▪ 사람의 도움이 필요한 도메인 또는 간단한 State 정보만 가진도메인에 대해서만 성공적

▪ Real World의 복잡한 환경에 대한 최적화는 여전히 어려운 문제


Contribution

• Deep Learning과 강화학습의 접목

▪ 고차원의 State 정보도 해석 가능

• Experience Replay(Replay Memory)

▪ Data Correlation 문제 해결

• Target-Q Network를 고정

▪ Network 불안정성 해결


Over View

Background


Reinforcement Learning

Machine Learning의 분류

• Machine Learning

출처 : Google Image



Environment / Agent

Environment Agent출처 : Google Image



RL의 적용과 모델링

• Reinforcement Learning을 적용하는 문제

▪ 순차적 결정문제를 해결할 때 사용

▪ 순차적 결정 문제를 RL을 통해 해결하기 위해선 수학적으로 모델링할 필요가 있다. (MDP로 모델링)

• MDP(Markov Decision Process)

▪ Markov 속성을 만족하면서 필요한 요소를 가진다.

• Markov 속성 - 미래상태의 결과가 과거상태와는 독립적으로현재상태에 의해서만 결정

• 필요한 요소 - State, Action, Reward, Policy



State / Action / Reward

• State

• Action

• Reward

• Discount Factor




Policy

• Policy

▪ Agent가 모든 상태에 대해 해야 할 행동

• 강화학습의 최종 목표

▪ 최고의 Reward를 가지는최적의 Policy를 찾는 것!




Value Function

• Value Function

▪ 앞으로 받을 것으로 예상되는 보상 총합

▪ State-Value Function, Action-Value Function이 있음

• State-Value Function

• Action-Value Function(Q-Function)

𝑅 = Reward𝛾 = Discount factor



RL-Road Map


On-Policy Off-Policy

Q-Network



Model-Based vs Model-Free

• Planning / Learning

▪ Model Based = 환경의 Model을 정확히 알고 있을 때 계산을 통해문제를 해결 -> Planning – State Value Function

▪ Model Free = 환경의 Model을 모르지만 상호작용을 통해 학습을하며 문제를 해결 -> Learning – Action Value Function




Model-Free RL

• Monte-Carlo Control

▪ 한 Episode가 끝날 때까지 진행한 뒤 Update

• Temporal-Difference Control

▪ Step 한번만 한 뒤에 Update




On-Policy vs Off-Policy

• On Policy - 현재 움직이고 있는 Policy와 Improve하는 Policy가 같다

▪ SARSA -

• Off Policy – 현재 움직이고 있는 Policy와 Improve하는 Policy가 분리

▪ Q-learning -


SARSA Q-Learning


• 지금까지 했던 방법들은 매우 간단한 환경(Gridworld)같은 곳에서만사용 가능

▪ 환경이 복잡해지면 각각의 Q값을 Update 해야 하는 문제(너무 많음)

▪ Q값을 바로 업데이트 하는 것이 아니라 w라는 Parameter를 추가하여w를 Update


Q-Network


SARSA Q-Learning Q-Network

Related Work


Related Work


Temporal Difference Learning and TD-Gammon

[Gerald Tesauro /ACM 1995]

Reinforcement learning for robot soccer

[Martin Riedmiller et al. / Action Robot 2009]

An Object-Oriented Representation for Efficient

Reinforcement Learning[Carlos Diuk et al. /ICML 2008]


Related Work

Atari Game Learning

• The Arcade Learning Environment: An Evaluation Platform for General Agents

▪ [Marc G. Bellemare(Univ. of Alberta) et al. / AIR 2013]


Related Work

Convolutional Neural Network

• ImageNet Classification with Deep Convolutional Neural Networks

▪ [Krizhevsky A. (Univ. of Toronto) / ILSVRC 2012]

Key Idea


이전의 Limitation_1

• Correlation between samples


Key Idea_1

• Experience Replay(Replay Memory)

state, action, reward, next_state










Update


이전의 Limitation_2

• Non-stationary Target

▪ Update 할 때, 자기 자신이 Target이 됨으로 학습이 매우 불안정

Main Network

Loss 계산Weight Update

Reward와다음 State에서

최고의 Q값의 합

Weight를 통해나온 Q값

𝑟 = Reward𝛾 = Discount factor

𝑠 = State𝑎 = Action

𝜃 = Weight𝑖 = Iteration

𝑄 = Q-value Function

Loss =


Key Idea_2

• Fixed Q-target

Main Network

Loss 계산Weight UpdateWeight를 통해

나온 Q값

Main Network

Loss 계산Weight Update

Weight를 통해나온 Q값

Target Network

일정 시간마다 Weight 복사






Key Idea_3

• Convolutional Neural Network

▪ Deep Learning의 발전에 따라 커다란 Input Data를 받아서 분석 할수 있게 되었음

▪ Low Level의 State를 받아오던 이전과는 달리 화면 전체의 Pixel을State로 받아오는 것이 가능

Action

Conv. laye

r

Conv. laye

r

Conv. laye

r

Method


Method

Data Pre-Processing (1/2)

210x160 pixel128 color

84x84 pixelGray


Method

Data Pre-Processing (2/2)

s1 s2

s3 s4 s5

State_1 = {s1, s2, s3, s4}State_2 = {s2, s3, s4, s5}State_3 = {s3, s4, s5, s6}


Method

Model architecture


Method

Algorithm

• Model free

▪ Action Value Function 사용

• Off-policy

▪ Q-Learning Algorithm 사용

• 𝜀-greedy 방식사용

▪ 𝜀 = 0~1 사이의 수 설정

▪ 0~1 사이의 수를 Random으로 뽑아 𝜀 보다 크면Max Q값을 가지는 Action, 작으면 Random Action 선택

• Loss Function

𝑟 = Reward𝛾 = Discount factor

Loss =

𝑠 = State𝑎 = Action

𝜃 = Weight𝑖 = Iteration

𝑄 = Q-value Function


Method

Training Details (1/3)

• 49개의 Atari 2600 Game들을 Same Network Architecture로학습시켰다.

• 전체 학습기간 5000만 Frame (38일 정도 소요)

• 여러 목숨을 가진 게임은 마지막 생명이 죽을 때 Episode의 끝으로표현

• Hyper Parameter 및 Optimization Parameter 값은 5게임만 가지고찾았다. (Pong, Breakout, Seaquest, Space Invaders, Beam Rider)

▪ 모든 게임을 고려하자면 너무 많은 계산이 필요

▪ 다른 게임들은 Extended Data Table1 대로 고정되어있다.


Method


• Optimization(Gradient Descent – RMSProp 사용)

• Policy = 𝜀-greedy 방식 사용

▪ 𝜀 값을 1.0에서 100만 Frame 까지 0.1로 줄이고 0.1로 고정

• Replay Memory = 최근의 100만 Frame 저장

• Reward Clipping

▪ 여러 게임을 같은 Model로 학습하기 위해

• Positive = +1 / Negative = -1 / Else = 0

𝜃𝐽(𝜃𝑡)𝜂𝜀

= Weight= Loss= Learning rate= 분모가 0이 되는 것 방지


Method



Method

Training Algorithm for Deep Q-networks


Method

Evaluation Procedure

• Professional DQN Agent vs Human Tester

▪ DQN Agent 제약 조건

• 𝜀 = 0.05로 설정

• 10Hz 마다 Action을 취할 수 있음

▪ 10Hz는 사람이 가장 빨리 Button을 누를 수 있는 시간

• 5분간 게임 30 Episode의 평균 값을 결과로 사용

▪ Professional Human Tester 제약 조건

• 일시 중지, 저장, 로드 불가

• 오디오 출력 비활성화 – 감각입력이 시각만 있도록

• 2시간 연습 후 5분 게임 20 Episode의 평균 값을 결과로 사용

Result


Result

Comparison with Others (1/3)


Result



Result

Best / Worst Agent Game

Montezuma’s Revenge GravitarPrivate Eye

Boxing BreakoutVideo Pinball


Result

게임에서 Value 변화


Result

Training에 따른 결과

Space Invaders Seaquest


Result

Replay, Target Q 적용 vs 미 적용


Result

DQN vs Linear


Result

T-SNE


Result

T-SNE

Human

DQN


Result

Breakout Video


• 하나의 Architecture로 여러 환경에서 적응할 수 있는 것을 증명

▪ Pixel 값과 Score만 받아서 같은 Algorithm, Network Architecture, hyperparameter들로 학습

• Experience Replay와 Target-Q를 고정함으로 학습을 안정화 시켜서사람과 비슷하거나 사람보다 뛰어난 Agent 생성

Conclusion

Tae-hyeong Kim | 2012. 10. 29 | # 52Computer Graphics @ Korea University

Q & A


Appendix

RMSProp

• RMSprop

▪ Gradient Descent 방법 중 하나

▪ Adagrad의 step size가 작아지는 것을 보완하기 위해 만들어짐

𝜃𝐽(𝜃𝑡)𝜂𝜀

= Weight= Loss= Learning rate= 분모가 0이 되는 것 방지


Appendix

Linear Approximation (1/3)

• BASIC

▪ NTSC(National Television System Committee) 방식 사용

▪ 128가지의 색상 사용

• BASS

▪ SECAM(Séquentiel couleur avec mémoire, sequential color with memory) 방식 사용

▪ 8가지 색상으로 줄이면서 각 색상 사이의 interaction 계산


Appendix


• DISCO

▪ 나오는 Object에 따라 Class를 나누고 Class 간의 interaction 체크


Appendix


• LSH(Locality Sensitive Hashing)

▪ Input 화면은 작게 잘라서 hashed down

• RAM

▪ RAM을 활용

▪ 128Byte = 1024bit 값을 분석


Appendix

Parameter 관련 5가지 Game

Beam Rider

Pong

Seaquest

BreakoutSpace Invaders

human-level control through deep reinforcement...

Documents