lecture11 - dl - rnn & gan · dr. patrick chan @ scut problem of traditional ml traditional ml...

Machine Learning

Lecture 11

Deep Learning

RNN & GAN

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Recurrent Neural Network

Structure

Backpropagation Through Time

Types

Example: Image Captioning

Generative Adversarial Network

Generative and Discriminative models

Training Process and Loss Function

Lecture 11: DL - RNN & GAN2


Problem of traditional ML

Traditional ML assumes

Sample format is identical

Decision of sample is independent

Cannot deal with the data

Length variation

Information of sample sequence



Recurrent Neural Network

Sequence learningLearn from and handle the sequential data

Application Example:

Language

Video

Recurrent Neural Network learn the information of sequence



RNN

Structure


��

�� (�) �� (�) �� (��)

(�) �� (�)

Feed Forward

Neural Network

Recurrent

Neural Network

(�) ��

(�)

��

(�)(��)

��

��

��

tanh is used


RNN

Structure


��

�

��

� �

�

Feed Forward

Neural Network

(�)

��(�)�

� ��(��)�

�

(�)

��

�(�)�

�

Recurrent

Neural Network

(�)

(��)

�(�)

��

ℎ(�)

��

��

(�)

��

ℎ�(�) ℎ�

(�)

��

��

��

(�) ��

� ��

ℎ

��

��

��

ℎ� ℎ� �

��

��

��

��

ℎ(��)

�

� − �ℎ�

(��) ℎ�(��)

��


RNN

Structure

RNN is multiple copies of the same network, each passing a message to a successor

, , and are

shared (Do not change)


�� x(t) h(t) ��(t)

h(t-1)

Unfold the loop

x(1) h(1) ��(1)h(1)

x(2) h(2) ��(2)h(2)

x(3) h(3) ��(3)h(3)

x(t) h(t) ��(t)h(t-1)

h(0)

��

��

��

��

…


RNN

Example

Character-level language model

Generate one new character at a time by outputting the probability distribution of the next character in the sequence given a sequence of previous characters



RNN

Different Types


One to One Feed-Forward Network

One to Many Image CaptioningImage > Seq. of Words

Many to One Sentiment ClassificationSeq. of Words > Sentiment

Many to Many TranslationSeq. of Words > Seq. of Words

Video Classification (frame Level)Frame > Class


RNN


Parameters of RNN ( , and ) are

shared

Different time steps are affected each other

Derivatives are aggregated across time steps

A special Backpropagation Backpropagation through time (BPTT)



J

RNN



�� ()

��

�� = " �� ()

��

�

= � 12 � () − �� () �

��

= − � () − �� () �� ()

��

= �%(� ��ℎ (�))��

= �%(� ��ℎ (�))�(� ��ℎ (�))

�(� ��ℎ (�))��

= ℎ (�)

x(1) h(1) ��(1)h(1)

x(2) h(2) ��(2)h(2)

x(3) h(3) ��(3)

h(0)

��

��

��

J(1) y(1)

J(2) y(2)

J(3) y(3)

�� ()

��

�(� ��ℎ (�))��

= %′(� ��ℎ (�)) �(� ��ℎ (�))��

ℎ(�) = %(� ��' � + � ��ℎ (��))

��(�) = %(� ��ℎ (�))

() (�) (�)

() () () �


J

RNN



�� ()

��

�� = " �� ()

��

�

= � 12 � () − �� () �

��

= − � () − �� () �� ()

��

= �%(� ��ℎ (�))��

= �%(� ��ℎ (�))�(� ��ℎ (�))

�(� ��ℎ (�))�� x(1) h(1) ��(1)

h(1)

x(2) h(2) ��(2)h(2)

x(3) h(3) ��(3)

h(0)

��

��

��

J(1) y(1)

J(2) y(2)

J(3) y(3)

�� ()

��

= %′(� ��ℎ (�)) �(� ��ℎ (�))��

= � �� (ℎ (�))��

�(� ��ℎ (�))��

ℎ(�) = %(� ��' � + � ��ℎ (��))

��(�) = %(� ��ℎ (�))

() (�) (�)

() () () �


J

RNN



x(1) h(1) ��(1)h(1)

x(2) h(2) ��(2)h(2)

x(3) h(3) ��(3)

h(0)

��

��

��

J(1) y(1)

J(2) y(2)

J(3) y(3)

= �(� ��' � + � ��ℎ (��))��

�(ℎ (�))��

= ℎ (��) + � �� (ℎ (��))��

Recursive function

�(ℎ ())�� = ℎ ()) + � �� (ℎ ()))

��

= ℎ ())

Base case

= ' � �(� ��)�� + � �� (' � )

��

+ℎ (��) �(� ��)�� + � �� (ℎ (��))

��

ℎ(�) = %(� ��' � + � ��ℎ (��))

��(�) = %(� ��ℎ (�))

() (�) (�)

() () () �


J

RNN



�� ()

��

�� = " �� ()

��

�

= � 12 � () − �� () �

��

= − � () − �� () �� ()

��

= �%(� ��ℎ (�))��

= �%(� ��ℎ (�))�(� ��ℎ (�))

�(� ��ℎ (�))��

x(1) h(1) ��(1)h(1)

x(2) h(2) ��(2)h(2)

x(3) h(3) ��(3)

h(0)

��

��

��

J(1) y(1)

J(2) y(2)

J(3) y(3)

�� ()

��

= %′(� ��ℎ (�)) �(� ��ℎ (�))��

= � �� (ℎ (�))��

�(� ��ℎ (�))��

ℎ(�) = %(� ��' � + � ��ℎ (��))

��(�) = %(� ��ℎ (�))

() (�) (�)

() () () �


J

RNN



x(1) h(1) ��(1)h(1)

x(2) h(2) ��(2)h(2)

x(3) h(3) ��(3)

h(0)

��

��

��

J(1) y(1)

J(2) y(2)

J(3) y(3)

= �(� ��' � + � ��ℎ (��))��

�(ℎ (�))��

�(' ())�� = ' () + � �� (ℎ ()))

��

= ' ()

= ' � �(� ��)�� + � �� (' � )

��

+ℎ (��) �(� ��)�� + � �� (ℎ (��))

��

= ' � + � �� (ℎ (��))��

ℎ(�) = %(� ��' � + � ��ℎ (��))

��(�) = %(� ��ℎ (�))

() (�) (�)

() () () �

Recursive function

Base case


RNN

Other Structures

RNNs with multiple hidden layers

16

(�)

(�)

()

()

(�)

(�)

(�)

(�)

(*)

(*)

(+)

(+)

Lecture 11: DL - RNN & GAN


RNN

Other Structures

Bi-directional RNN

process the input sequence in forward and in the reverse direction

Popular in speech recognition

17

(�)

(�)

(�,)

(��)

()

()

(�)

(�)

(�)

(�)

(*)

(*)

(+)

(+)

Lecture 11: DL - RNN & GAN


RNN



straw

()

<START>

()straw

(�)hat

(�)

<END>

(�)hat

(�)


RNN




RNN

Long-Term Dependency

RNN performs badly when a task requires long-term dependency

Example:

A language model trying to predict the next word based on the previous ones

I grew up in France… I speak fluent French

“I speak fluent” suggests the next word is a language

For narrowing down which language, the context of France is required, which need long dependency



RNN

Long Short-Term Memory


tanh

tanh

-(��) -(�)ℎ(�)

ℎ(�)ℎ(��)

'(�)

.(�)/(�)0(�)

-1(�)

ℎ(�)

ℎ(�)ℎ(��)

'(�)

tanh(��)

(�) ��

(�)

��

(�) ��

Long Short-TermMemory

RNN


RNN


Four key components

Cell state

Forget Gate

Input Gate

Output Gate


tanh

tanh

-(�)ℎ(�)

ℎ(�)

'(�)

.(�)/(�)0(�)

-1(�)tanh

tanh

-(�)ℎ(�)

ℎ(�)

'(�)

.(�)/(�)0(�)

-1(�)tanh

tanh

-()) -()ℎ()

ℎ()ℎ())

'()

.()/()0()

-1()

http://colah.github.io/posts/2015-08-Understanding-LSTMs/


RNN


Cell state

Runs straight down the entire chain with only minor linear interactions

Easy for information to flow along it unchanged


tanh

tanh

-(��) -(�)ℎ(�)

ℎ(�)ℎ(��)

'(�)

.(�)/(�)0(�)

-1(�)

-(��) -(�)


RNN


Cell state can be modified by gates

Gates: Optionally let information through

Sigmoid Function

Range: 0 – 1

Describe how much a component should be let through

An LSTM has three gates, to manage cell state


tanh

tanh

-(��) -(�)ℎ(�)

ℎ(�)ℎ(��)

'(�)

.(�)/(�)0(�)

-1(�)


RNN


Forget Gate

Base on and , outputs a number

between 0 and 1 for each number in the cell

state

1 : completely keep the value

0 : completely remove the value


tanh

tanh

-(��) -(�)ℎ(�)

ℎ(�)ℎ(��)

'(�)

.(�)/(�)0(�)

-1(�)ℎ(��)

'(�)

0(�)

(�) 2 �� 2

Example C might include the gender of the

present subject, so the pronouns (he/she) is correct

Until a new subject, the previous gender will be forgotten


RNN


Input Gate

Decide the new information stored in cell state

tanh: decide which information in update as a new candidate cell state

sigmoid: determine how much should get

involved in update


(�) � ��

(�) 3 �� 3

tanh

tanh

-(��) -(�)ℎ(�)

ℎ(�)ℎ(��)

'(�)

.(�)/(�)0(�)

-1(�)tanhℎ(��)

'(�)

/(�)

-1(�)

Example Add the gender of the new subject

to the cell state


RNN


is updated to the new cell state

Forgetting old things by multiplying

Learning new things by adding


� � ��

tanh

tanh

-(��) -(�)ℎ(�)

ℎ(�)ℎ(��)

'(�)

.(�)/(�)0(�)

-1(�)

-(��) -(�)

/(�)0(�)

-1(�)

Example Drop the information about the old

subject’s gender Add the new information


RNN


Output Gate

Output extracted (tanh) and filtered (sigmoid) information of the cell state

tanh: decide which information is output

Sigmoid: decide how much of the cell state should be outputted


tanh

tanh

-(��) -(�)ℎ(�)

ℎ(�)ℎ(��)

'(�)

.(�)/(�)0(�)

-1(�)

tanh

ℎ(�)

ℎ(�)ℎ(��)

'(�)

.(�)

� � �� 4 �� 4

Example Output information relevant to the

current object (C) Output whether the subject is

singular or plural, so the form a verb can be determined


Generative VS Discriminative

Is it a Lion or Cat?

Method 1

Find out all possible images on Lion and Cat

Compare the image and the collected Lion and Cat images

Method 2

Find out the difference between Lion and Cat

Identify the image according to the difference


(Generative Model)

(Discriminative Model)


Generative VS Discriminative

Generative Model

Understand everything p(x,y)

More difficult task than Discriminative Model

Classify samples

Able to generate samples

Discriminative Model

Understand difference between classes p(y|x)

Only classify samples



Generative Model

Aim to incresase the similarity between and


RealModel

Simulated Model

Generated FaceReal Face


Generative Model

How to quantify the similarity of real and generated distributions?

Explicit way (evaluation)

E.g. Gaussian assumption

Need prior knowledge

Evaluation may not be reasonable for some applications

Implicit way

Generative Adversarial Network

Evaluated by discriminant accuracy



Adversarial Concept

Army race fordefender and attacker


Face Recognition Put an image in

front of camera

Depth Map

3D Fake Model

Defender Attacker


Generative Adversarial Model

Generative Adversarial Model (GAN) quantify the similarity implicitly by accuracy of classifying the sample

Good classification indicate the real and generated ones are different

GAN contains

Generative Model (G)

Discriminative Model (D)


RealModel

Generative Model

Discriminative

Model

Real or Generated?


GAN

Training Process

Generative Model G(z)

Generate samples similar to real one

Input noise (z) in order to

generate different samples each time

z typically has very high dimensionality (higher than x)

Discriminative Model D(x)

Classify whether a sample is real or fake

If x is real, D(x)= 1; otherwise, D(x) = 0


RealModel

Generative Model

Discriminative

Model

Real or Generated?

Noise (z)


GAN

Training Process

G aims to fool D

D aims not to be fooled

Models are trained simultaneously

As G gets better, D has a more challenging task

As D gets better, G has a more challenging task

Only G is used finally

D aims to assist the training of G

RealModel

Generative Model

Discriminative

Model

Real or Generated?

Noise (z)



GAN

Training Process


Green solid line: probability density function (PDF) of GBlack dotted line: PDF of original x

Blue dash line: PDF of discriminator D

G is not similar to x

D is unstable

D win

(distinguish well)

D is updated

0.5

G win

(cannot distinguish

well)

G is updated

0.5

Start


GAN

Training Process


G learn well (identical to x)

D does not learn well (not separate well)

Finally (Hopefully…)

Green solid line: probability density function (PDF) of GBlack dotted line: PDF of original x

Blue dash line: PDF of discriminator D


GAN

Training Process

Loss function for D

If x is real, D(x) = 1; otherwise, D(x) = 0

Minimize the error

Loss function be for G

Maximize the error of D

5 = 5 Minimax procedure


RealD(x) → 1

=> 78 9 ' → 0

FakeD(x) → 0

=> 78 (1 − 9 ' ) → 0


GAN

Training Process

Train both G and D simultaneously

Stochastic gradient descent

Two ways of training

(1) Compute ; and 5 , and update together

(2) freeze one, calculate the gradient and update another; and then vice versa

A model can be trained without altering the other

Fixed a model, multiple training epochs of another

Increase ability of one side, assign more difficult task to other



GAN

Problem

Training GAN is very difficult

Networks are difficult to converge

Ideal goal: G and D reach desired

equilibrium but this is rare

GANs are yet to converge on large problems

D becomes too strong too quickly and G ends

up not learning anything

G is more complicated task than D

D focuses on difference

G focuses on distribution



References

https://www.cs.toronto.edu/~tingwuwang/rnn_tutorial.pdf

https://medium.com/deep-math-machine-learning-ai/chapter-10-deepnlp-recurrent-neural-networks-with-math-c4a6846a50a2

http://colah.github.io/posts/2015-08-Understanding-LSTMs/


lecture11 - dl - rnn & gan · dr. patrick chan @ scut problem of traditional ml traditional ml...

Documents