cs7015 (deep learning) : lecture 7 · mitesh m. khapra cs7015 (deep learning) : lecture 7. 5/55 x i...

228
1/55 CS7015 (Deep Learning) : Lecture 7 Autoencoders and relation to PCA, Regularization in autoencoders, Denoising autoencoders, Sparse autoencoders, Contractive autoencoders Mitesh M. Khapra Department of Computer Science and Engineering Indian Institute of Technology Madras Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Upload: others

Post on 30-May-2020

16 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

1/55

CS7015 (Deep Learning) : Lecture 7Autoencoders and relation to PCA, Regularization in autoencoders, Denoising

autoencoders, Sparse autoencoders, Contractive autoencoders

Mitesh M. Khapra

Department of Computer Science and EngineeringIndian Institute of Technology Madras

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 2: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

2/55

Module 7.1: Introduction to Autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 3: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

3/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 4: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

3/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder is a special type offeed forward neural network whichdoes the following

Encodes its input xi into a hiddenrepresentation h

Decodes the input again from thishidden representation

The model is trained to minimize acertain loss function which will ensurethat xi is close to xi (we will see somesuch loss functions soon)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 5: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

3/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder is a special type offeed forward neural network whichdoes the following

Encodes its input xi into a hiddenrepresentation h

Decodes the input again from thishidden representation

The model is trained to minimize acertain loss function which will ensurethat xi is close to xi (we will see somesuch loss functions soon)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 6: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

3/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder is a special type offeed forward neural network whichdoes the following

Encodes its input xi into a hiddenrepresentation h

Decodes the input again from thishidden representation

The model is trained to minimize acertain loss function which will ensurethat xi is close to xi (we will see somesuch loss functions soon)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 7: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

3/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder is a special type offeed forward neural network whichdoes the following

Encodes its input xi into a hiddenrepresentation h

Decodes the input again from thishidden representation

The model is trained to minimize acertain loss function which will ensurethat xi is close to xi (we will see somesuch loss functions soon)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 8: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

3/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder is a special type offeed forward neural network whichdoes the following

Encodes its input xi into a hiddenrepresentation h

Decodes the input again from thishidden representation

The model is trained to minimize acertain loss function which will ensurethat xi is close to xi (we will see somesuch loss functions soon)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 9: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

3/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder is a special type offeed forward neural network whichdoes the following

Encodes its input xi into a hiddenrepresentation h

Decodes the input again from thishidden representation

The model is trained to minimize acertain loss function which will ensurethat xi is close to xi (we will see somesuch loss functions soon)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 10: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

4/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) < dim(xi) iscalled an under complete autoencoder

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 11: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

4/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) < dim(xi) iscalled an under complete autoencoder

Let us consider the case wheredim(h) < dim(xi)

If we are still able to reconstruct xi

perfectly from h, then what does itsay about h?

h is a loss-free encoding of xi. It cap-tures all the important characteristicsof xi

Do you see an analogy with PCA?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 12: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

4/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) < dim(xi) iscalled an under complete autoencoder

Let us consider the case wheredim(h) < dim(xi)

If we are still able to reconstruct xi

perfectly from h, then what does itsay about h?

h is a loss-free encoding of xi. It cap-tures all the important characteristicsof xi

Do you see an analogy with PCA?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 13: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

4/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) < dim(xi) iscalled an under complete autoencoder

Let us consider the case wheredim(h) < dim(xi)

If we are still able to reconstruct xi

perfectly from h, then what does itsay about h?

h is a loss-free encoding of xi. It cap-tures all the important characteristicsof xi

Do you see an analogy with PCA?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 14: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

4/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) < dim(xi) iscalled an under complete autoencoder

Let us consider the case wheredim(h) < dim(xi)

If we are still able to reconstruct xi

perfectly from h, then what does itsay about h?

h is a loss-free encoding of xi. It cap-tures all the important characteristicsof xi

Do you see an analogy with PCA?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 15: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

4/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) < dim(xi) iscalled an under complete autoencoder

Let us consider the case wheredim(h) < dim(xi)

If we are still able to reconstruct xi

perfectly from h, then what does itsay about h?

h is a loss-free encoding of xi. It cap-tures all the important characteristicsof xi

Do you see an analogy with PCA?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 16: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 17: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 18: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 19: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 20: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 21: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 22: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 23: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 24: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 25: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 26: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 27: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 28: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 29: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

5/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

An autoencoder where dim(h) ≥ dim(xi) iscalled an over complete autoencoder

Let us consider the case whendim(h) ≥ dim(xi)

In such a case the autoencoder couldlearn a trivial encoding by simplycopying xi into h and then copyingh into xi

Such an identity encoding is uselessin practice as it does not really tell usanything about the important char-acteristics of the data

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 30: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

6/55

The Road Ahead

Choice of f(xi) and g(xi)

Choice of loss function

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 31: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

6/55

The Road Ahead

Choice of f(xi) and g(xi)

Choice of loss function

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 32: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

6/55

The Road Ahead

Choice of f(xi) and g(xi)

Choice of loss function

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 33: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

7/55

The Road Ahead

Choice of f(xi) and g(xi)

Choice of loss function

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 34: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

8/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

g is typically chosen as the sigmoidfunction

Suppose all our inputs are binary(each xij ∈ 0, 1)Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

Logistic as it naturally restricts alloutputs to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 35: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

8/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

g is typically chosen as the sigmoidfunction

Suppose all our inputs are binary(each xij ∈ 0, 1)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

Logistic as it naturally restricts alloutputs to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 36: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

8/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

g is typically chosen as the sigmoidfunction

Suppose all our inputs are binary(each xij ∈ 0, 1)Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

Logistic as it naturally restricts alloutputs to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 37: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

8/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

g is typically chosen as the sigmoidfunction

Suppose all our inputs are binary(each xij ∈ 0, 1)Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

Logistic as it naturally restricts alloutputs to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 38: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

8/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

g is typically chosen as the sigmoidfunction

Suppose all our inputs are binary(each xij ∈ 0, 1)Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

Logistic as it naturally restricts alloutputs to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 39: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

8/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

g is typically chosen as the sigmoidfunction

Suppose all our inputs are binary(each xij ∈ 0, 1)Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

Logistic as it naturally restricts alloutputs to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 40: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

8/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

g is typically chosen as the sigmoidfunction

Suppose all our inputs are binary(each xij ∈ 0, 1)Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

Logistic as it naturally restricts alloutputs to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 41: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

8/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

g is typically chosen as the sigmoidfunction

Suppose all our inputs are binary(each xij ∈ 0, 1)Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

Logistic as it naturally restricts alloutputs to be between 0 and 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 42: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 43: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 44: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 45: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 46: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 47: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 48: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 49: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 50: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

9/55

0.25 0.5 1.25 3.5 4.5

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(real valued inputs)

Again, g is typically chosen as thesigmoid function

Suppose all our inputs are real (eachxij ∈ R)

Which of the following functionswould be most apt for the decoder?

xi = tanh(W ∗h + c)

xi = W ∗h + c

xi = logistic(W ∗h + c)

What will logistic and tanh do?

They will restrict the reconstruc-ted xi to lie between [0,1] or [-1,1]whereas we want xi ∈ Rn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 51: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

10/55

The Road Ahead

Choice of f(xi) and g(xi)

Choice of loss function

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 52: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

11/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

Consider the case when the inputs are realvalued

The objective of the autoencoder is to recon-struct xi to be as close to xi as possible

This can be formalized using the followingobjective function:

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(xij − xij)2

i.e., minW,W∗,c,b

1

m

m∑i=1

(xi − xi)T (xi − xi)

We can then train the autoencoder just likea regular feedforward network using back-propagation

All we need is a formula for ∂L (θ)∂W∗ and ∂L (θ)

∂W

which we will see now

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 53: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

11/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

Consider the case when the inputs are realvalued

The objective of the autoencoder is to recon-struct xi to be as close to xi as possible

This can be formalized using the followingobjective function:

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(xij − xij)2

i.e., minW,W∗,c,b

1

m

m∑i=1

(xi − xi)T (xi − xi)

We can then train the autoencoder just likea regular feedforward network using back-propagation

All we need is a formula for ∂L (θ)∂W∗ and ∂L (θ)

∂W

which we will see now

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 54: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

11/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

Consider the case when the inputs are realvalued

The objective of the autoencoder is to recon-struct xi to be as close to xi as possible

This can be formalized using the followingobjective function:

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(xij − xij)2

i.e., minW,W∗,c,b

1

m

m∑i=1

(xi − xi)T (xi − xi)

We can then train the autoencoder just likea regular feedforward network using back-propagation

All we need is a formula for ∂L (θ)∂W∗ and ∂L (θ)

∂W

which we will see now

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 55: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

11/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

Consider the case when the inputs are realvalued

The objective of the autoencoder is to recon-struct xi to be as close to xi as possible

This can be formalized using the followingobjective function:

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(xij − xij)2

i.e., minW,W∗,c,b

1

m

m∑i=1

(xi − xi)T (xi − xi)

We can then train the autoencoder just likea regular feedforward network using back-propagation

All we need is a formula for ∂L (θ)∂W∗ and ∂L (θ)

∂W

which we will see now

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 56: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

11/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

Consider the case when the inputs are realvalued

The objective of the autoencoder is to recon-struct xi to be as close to xi as possible

This can be formalized using the followingobjective function:

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(xij − xij)2

i.e., minW,W∗,c,b

1

m

m∑i=1

(xi − xi)T (xi − xi)

We can then train the autoencoder just likea regular feedforward network using back-propagation

All we need is a formula for ∂L (θ)∂W∗ and ∂L (θ)

∂W

which we will see now

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 57: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

11/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

Consider the case when the inputs are realvalued

The objective of the autoencoder is to recon-struct xi to be as close to xi as possible

This can be formalized using the followingobjective function:

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(xij − xij)2

i.e., minW,W∗,c,b

1

m

m∑i=1

(xi − xi)T (xi − xi)

We can then train the autoencoder just likea regular feedforward network using back-propagation

All we need is a formula for ∂L (θ)∂W∗ and ∂L (θ)

∂W

which we will see now

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 58: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

11/55

xi

W

h

W ∗

xi

h = g(Wxi + b)

xi = f(W ∗h + c)

Consider the case when the inputs are realvalued

The objective of the autoencoder is to recon-struct xi to be as close to xi as possible

This can be formalized using the followingobjective function:

minW,W∗,c,b

1

m

m∑i=1

n∑j=1

(xij − xij)2

i.e., minW,W∗,c,b

1

m

m∑i=1

(xi − xi)T (xi − xi)

We can then train the autoencoder just likea regular feedforward network using back-propagation

All we need is a formula for ∂L (θ)∂W∗ and ∂L (θ)

∂W

which we will see now

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 59: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

12/55

L (θ) = (xi − xi)T (xi − xi)

h0 = xi

h1a1

h2 = xia2

W

W ∗

Note that the loss function isshown for only one trainingexample.

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how to calculate the expres-sion in the boxes when we learnt backpropagation

∂L (θ)

∂h2=∂L (θ)

∂xi

= ∇xi(xi − xi)

T (xi − xi)= 2(xi − xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 60: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

12/55

L (θ) = (xi − xi)T (xi − xi)

h0 = xi

h1a1

h2 = xia2

W

W ∗

Note that the loss function isshown for only one trainingexample.

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how to calculate the expres-sion in the boxes when we learnt backpropagation

∂L (θ)

∂h2=∂L (θ)

∂xi

= ∇xi(xi − xi)

T (xi − xi)= 2(xi − xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 61: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

12/55

L (θ) = (xi − xi)T (xi − xi)

h0 = xi

h1a1

h2 = xia2

W

W ∗

Note that the loss function isshown for only one trainingexample.

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how to calculate the expres-sion in the boxes when we learnt backpropagation

∂L (θ)

∂h2=∂L (θ)

∂xi

= ∇xi(xi − xi)

T (xi − xi)= 2(xi − xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 62: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

12/55

L (θ) = (xi − xi)T (xi − xi)

h0 = xi

h1a1

h2 = xia2

W

W ∗

Note that the loss function isshown for only one trainingexample.

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how to calculate the expres-sion in the boxes when we learnt backpropagation

∂L (θ)

∂h2=∂L (θ)

∂xi

= ∇xi(xi − xi)

T (xi − xi)= 2(xi − xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 63: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

12/55

L (θ) = (xi − xi)T (xi − xi)

h0 = xi

h1a1

h2 = xia2

W

W ∗

Note that the loss function isshown for only one trainingexample.

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how to calculate the expres-sion in the boxes when we learnt backpropagation

∂L (θ)

∂h2=∂L (θ)

∂xi

= ∇xi(xi − xi)

T (xi − xi)= 2(xi − xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 64: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

12/55

L (θ) = (xi − xi)T (xi − xi)

h0 = xi

h1a1

h2 = xia2

W

W ∗

Note that the loss function isshown for only one trainingexample.

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how to calculate the expres-sion in the boxes when we learnt backpropagation

∂L (θ)

∂h2=∂L (θ)

∂xi

= ∇xi(xi − xi)

T (xi − xi)= 2(xi − xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 65: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

12/55

L (θ) = (xi − xi)T (xi − xi)

h0 = xi

h1a1

h2 = xia2

W

W ∗

Note that the loss function isshown for only one trainingexample.

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how to calculate the expres-sion in the boxes when we learnt backpropagation

∂L (θ)

∂h2=∂L (θ)

∂xi

= ∇xi(xi − xi)

T (xi − xi)

= 2(xi − xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 66: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

12/55

L (θ) = (xi − xi)T (xi − xi)

h0 = xi

h1a1

h2 = xia2

W

W ∗

Note that the loss function isshown for only one trainingexample.

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how to calculate the expres-sion in the boxes when we learnt backpropagation

∂L (θ)

∂h2=∂L (θ)

∂xi

= ∇xi(xi − xi)

T (xi − xi)= 2(xi − xi)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 67: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 68: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 69: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 70: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 71: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

What value of xij will minimize thisfunction?

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 72: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

What value of xij will minimize thisfunction?

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 73: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

What value of xij will minimize thisfunction?

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 74: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

What value of xij will minimize thisfunction?

If xij = 1 ?

If xij = 0 ?

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 75: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

13/55

0 1 1 0 1

xi

h = g(Wxi + b)

W

W ∗

xi = f(W ∗h + c)

(binary inputs)

What value of xij will minimize thisfunction?

If xij = 1 ?

If xij = 0 ?

Indeed the above function will beminimized when xij = xij !

Consider the case when the inputs arebinary

We use a sigmoid decoder which willproduce outputs between 0 and 1, andcan be interpreted as probabilities.

For a single n-dimensional ith input wecan use the following loss function

min−n∑j=1

(xij log xij + (1− xij) log(1− xij))

Again we need is a formula for ∂L (θ)∂W∗ and

∂L (θ)∂W to use backpropagation

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 76: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

14/55

L (θ) = −n∑j=1

(xij log xij + (1− xij) log(1− xij))

h0 = xi

h1a1

h2 = xia2

W

W ∗

∂L (θ)

∂h2=

∂L (θ)∂h2n

...

∂L (θ)∂h22

∂L (θ)∂h21

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how tocalculate the expressions in thesquare boxes when we learnt BP

The first two terms on RHS can becomputed as:∂L (θ)

∂h2j= −xij

xij+

1− xij1− xij

∂h2j

∂a2j= σ(a2j)(1− σ(a2j))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 77: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

14/55

L (θ) = −n∑j=1

(xij log xij + (1− xij) log(1− xij))

h0 = xi

h1a1

h2 = xia2

W

W ∗

∂L (θ)

∂h2=

∂L (θ)∂h2n

...

∂L (θ)∂h22

∂L (θ)∂h21

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how tocalculate the expressions in thesquare boxes when we learnt BP

The first two terms on RHS can becomputed as:∂L (θ)

∂h2j= −xij

xij+

1− xij1− xij

∂h2j

∂a2j= σ(a2j)(1− σ(a2j))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 78: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

14/55

L (θ) = −n∑j=1

(xij log xij + (1− xij) log(1− xij))

h0 = xi

h1a1

h2 = xia2

W

W ∗

∂L (θ)

∂h2=

∂L (θ)∂h2n

...

∂L (θ)∂h22

∂L (θ)∂h21

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how tocalculate the expressions in thesquare boxes when we learnt BP

The first two terms on RHS can becomputed as:∂L (θ)

∂h2j= −xij

xij+

1− xij1− xij

∂h2j

∂a2j= σ(a2j)(1− σ(a2j))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 79: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

14/55

L (θ) = −n∑j=1

(xij log xij + (1− xij) log(1− xij))

h0 = xi

h1a1

h2 = xia2

W

W ∗

∂L (θ)

∂h2=

∂L (θ)∂h2n

...

∂L (θ)∂h22

∂L (θ)∂h21

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how tocalculate the expressions in thesquare boxes when we learnt BP

The first two terms on RHS can becomputed as:∂L (θ)

∂h2j= −xij

xij+

1− xij1− xij

∂h2j

∂a2j= σ(a2j)(1− σ(a2j))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 80: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

14/55

L (θ) = −n∑j=1

(xij log xij + (1− xij) log(1− xij))

h0 = xi

h1a1

h2 = xia2

W

W ∗

∂L (θ)

∂h2=

∂L (θ)∂h2n

...

∂L (θ)∂h22

∂L (θ)∂h21

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how tocalculate the expressions in thesquare boxes when we learnt BP

The first two terms on RHS can becomputed as:∂L (θ)

∂h2j= −xij

xij+

1− xij1− xij

∂h2j

∂a2j= σ(a2j)(1− σ(a2j))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 81: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

14/55

L (θ) = −n∑j=1

(xij log xij + (1− xij) log(1− xij))

h0 = xi

h1a1

h2 = xia2

W

W ∗

∂L (θ)

∂h2=

∂L (θ)∂h2n

...

∂L (θ)∂h22

∂L (θ)∂h21

∂L (θ)

∂W ∗=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂W ∗

∂L (θ)

∂W=∂L (θ)

∂h2

∂h2

∂a2

∂a2∂h1

∂h1

∂a1

∂a1∂W

We have already seen how tocalculate the expressions in thesquare boxes when we learnt BP

The first two terms on RHS can becomputed as:∂L (θ)

∂h2j= −xij

xij+

1− xij1− xij

∂h2j

∂a2j= σ(a2j)(1− σ(a2j))

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 82: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

15/55

Module 7.2: Link between PCA and Autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 83: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

16/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

We will now see that the encoder partof an autoencoder is equivalent toPCA if we

use a linear encoderuse a linear decoderuse squared error loss functionnormalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 84: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

16/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

We will now see that the encoder partof an autoencoder is equivalent toPCA if we

use a linear encoder

use a linear decoderuse squared error loss functionnormalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 85: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

16/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

We will now see that the encoder partof an autoencoder is equivalent toPCA if we

use a linear encoderuse a linear decoder

use squared error loss functionnormalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 86: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

16/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

We will now see that the encoder partof an autoencoder is equivalent toPCA if we

use a linear encoderuse a linear decoderuse squared error loss function

normalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 87: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

16/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

We will now see that the encoder partof an autoencoder is equivalent toPCA if we

use a linear encoderuse a linear decoderuse squared error loss functionnormalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 88: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

17/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First let us consider the implicationof normalizing the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

The operation in the bracket ensuresthat the data now has 0 mean alongeach dimension j (we are subtractingthe mean)

Let X′

be this zero mean data mat-rix then what the above normaliza-tion gives us is X = 1√

mX′

Now (X)TX = 1m(X ′)TX ′ is the co-

variance matrix (recall that covari-ance matrix plays an important rolein PCA)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 89: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

17/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First let us consider the implicationof normalizing the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)The operation in the bracket ensuresthat the data now has 0 mean alongeach dimension j (we are subtractingthe mean)

Let X′

be this zero mean data mat-rix then what the above normaliza-tion gives us is X = 1√

mX′

Now (X)TX = 1m(X ′)TX ′ is the co-

variance matrix (recall that covari-ance matrix plays an important rolein PCA)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 90: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

17/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First let us consider the implicationof normalizing the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)The operation in the bracket ensuresthat the data now has 0 mean alongeach dimension j (we are subtractingthe mean)

Let X′

be this zero mean data mat-rix then what the above normaliza-tion gives us is X = 1√

mX′

Now (X)TX = 1m(X ′)TX ′ is the co-

variance matrix (recall that covari-ance matrix plays an important rolein PCA)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 91: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

17/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First let us consider the implicationof normalizing the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)The operation in the bracket ensuresthat the data now has 0 mean alongeach dimension j (we are subtractingthe mean)

Let X′

be this zero mean data mat-rix then what the above normaliza-tion gives us is X = 1√

mX′

Now (X)TX = 1m(X ′)TX ′ is the co-

variance matrix (recall that covari-ance matrix plays an important rolein PCA)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 92: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

18/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First we will show that if we use lin-ear decoder and a squared error lossfunction then

The optimal solution to the followingobjective function

1

m

m∑i=1

n∑j=1

(xij − xij)2

is obtained when we use a linear en-coder.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 93: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

18/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First we will show that if we use lin-ear decoder and a squared error lossfunction then

The optimal solution to the followingobjective function

1

m

m∑i=1

n∑j=1

(xij − xij)2

is obtained when we use a linear en-coder.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 94: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

18/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First we will show that if we use lin-ear decoder and a squared error lossfunction then

The optimal solution to the followingobjective function

1

m

m∑i=1

n∑j=1

(xij − xij)2

is obtained when we use a linear en-coder.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 95: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

18/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First we will show that if we use lin-ear decoder and a squared error lossfunction then

The optimal solution to the followingobjective function

1

m

m∑i=1

n∑j=1

(xij − xij)2

is obtained when we use a linear en-coder.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 96: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

18/55

xi

h

xi

PCA

P TXTXP = D

y

x

u1 u2

First we will show that if we use lin-ear decoder and a squared error lossfunction then

The optimal solution to the followingobjective function

1

m

m∑i=1

n∑j=1

(xij − xij)2

is obtained when we use a linear en-coder.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 97: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

19/55

minθ

m∑i=1

n∑j=1

(xij − xij)2 (1)

This is equivalent to

minW∗H

(‖X −HW ∗‖F )2 ‖A‖F =

√√√√ m∑i=1

n∑j=1

a2ij

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (weare ignoring the biases)

From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤kΣk,kVT.,≤k

By matching variables one possible solution is

H = U. ,≤kΣk,k

W ∗ = V T.,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 98: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

19/55

minθ

m∑i=1

n∑j=1

(xij − xij)2 (1)

This is equivalent to

minW∗H

(‖X −HW ∗‖F )2 ‖A‖F =

√√√√ m∑i=1

n∑j=1

a2ij

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (weare ignoring the biases)

From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤kΣk,kVT.,≤k

By matching variables one possible solution is

H = U. ,≤kΣk,k

W ∗ = V T.,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 99: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

19/55

minθ

m∑i=1

n∑j=1

(xij − xij)2 (1)

This is equivalent to

minW∗H

(‖X −HW ∗‖F )2

‖A‖F =

√√√√ m∑i=1

n∑j=1

a2ij

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (weare ignoring the biases)

From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤kΣk,kVT.,≤k

By matching variables one possible solution is

H = U. ,≤kΣk,k

W ∗ = V T.,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 100: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

19/55

minθ

m∑i=1

n∑j=1

(xij − xij)2 (1)

This is equivalent to

minW∗H

(‖X −HW ∗‖F )2 ‖A‖F =

√√√√ m∑i=1

n∑j=1

a2ij

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (weare ignoring the biases)

From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤kΣk,kVT.,≤k

By matching variables one possible solution is

H = U. ,≤kΣk,k

W ∗ = V T.,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 101: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

19/55

minθ

m∑i=1

n∑j=1

(xij − xij)2 (1)

This is equivalent to

minW∗H

(‖X −HW ∗‖F )2 ‖A‖F =

√√√√ m∑i=1

n∑j=1

a2ij

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (weare ignoring the biases)

From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤kΣk,kVT.,≤k

By matching variables one possible solution is

H = U. ,≤kΣk,k

W ∗ = V T.,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 102: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

19/55

minθ

m∑i=1

n∑j=1

(xij − xij)2 (1)

This is equivalent to

minW∗H

(‖X −HW ∗‖F )2 ‖A‖F =

√√√√ m∑i=1

n∑j=1

a2ij

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (weare ignoring the biases)

From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤kΣk,kVT.,≤k

By matching variables one possible solution is

H = U. ,≤kΣk,k

W ∗ = V T.,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 103: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

19/55

minθ

m∑i=1

n∑j=1

(xij − xij)2 (1)

This is equivalent to

minW∗H

(‖X −HW ∗‖F )2 ‖A‖F =

√√√√ m∑i=1

n∑j=1

a2ij

(just writing the expression (1) in matrix form and using the definition of ||A||F ) (weare ignoring the biases)

From SVD we know that optimal solution to the above problem is given by

HW ∗ = U. ,≤kΣk,kVT.,≤k

By matching variables one possible solution is

H = U. ,≤kΣk,k

W ∗ = V T.,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 104: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 105: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 106: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 107: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 108: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 109: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 110: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 111: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 112: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 113: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 114: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 115: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

20/55

We will now show that H is a linear encoding and find an expression for the encoderweights W

H = U. ,≤kΣk,k

= (XXT )(XXT )−1U. ,≤KΣk,k (pre-multiplying (XXT )(XXT )−1 = I)

= (XV ΣTUT )(UΣV TV ΣTUT )−1U. ,≤kΣk,k (using X = UΣV T )

= XV ΣTUT (UΣΣTUT )−1U. ,≤kΣk,k (V TV = I)

= XV ΣTUTU(ΣΣT )−1UTU. ,≤kΣk,k ((ABC)−1 = C−1B−1A−1)

= XV ΣT (ΣΣT )−1UTU. ,≤kΣk,k (UTU = I)

= XV ΣTΣT−1

Σ−1UTU. ,≤kΣk,k ((AB)−1 = B−1A−1)

= XV Σ−1I. ,≤kΣk,k (UTU. ,≤k = I. ,≤k)

= XV I. ,≤k (Σ−1I. ,≤k = Σ−1k,k)

H = XV. ,≤k

Thus H is a linear transformation of X and W = V. ,≤k

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 116: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

21/55

We have encoder W = V. ,≤k

From SVD, we know that V is the matrix of eigen vectors of XTX

From PCA, we know that P is the matrix of the eigen vectors of the covariancematrix

We saw earlier that, if entries of X are normalized by

Thus, the encoder matrix for linear autoencoder(W ) and the projectionmatrix(P ) for PCA could indeed be the same. Hence proved

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 117: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

21/55

We have encoder W = V. ,≤k

From SVD, we know that V is the matrix of eigen vectors of XTX

From PCA, we know that P is the matrix of the eigen vectors of the covariancematrix

We saw earlier that, if entries of X are normalized by

Thus, the encoder matrix for linear autoencoder(W ) and the projectionmatrix(P ) for PCA could indeed be the same. Hence proved

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 118: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

21/55

We have encoder W = V. ,≤k

From SVD, we know that V is the matrix of eigen vectors of XTX

From PCA, we know that P is the matrix of the eigen vectors of the covariancematrix

We saw earlier that, if entries of X are normalized by

Thus, the encoder matrix for linear autoencoder(W ) and the projectionmatrix(P ) for PCA could indeed be the same. Hence proved

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 119: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

21/55

We have encoder W = V. ,≤k

From SVD, we know that V is the matrix of eigen vectors of XTX

From PCA, we know that P is the matrix of the eigen vectors of the covariancematrix

We saw earlier that, if entries of X are normalized by

Thus, the encoder matrix for linear autoencoder(W ) and the projectionmatrix(P ) for PCA could indeed be the same. Hence proved

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 120: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

21/55

We have encoder W = V. ,≤k

From SVD, we know that V is the matrix of eigen vectors of XTX

From PCA, we know that P is the matrix of the eigen vectors of the covariancematrix

We saw earlier that, if entries of X are normalized by

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Thus, the encoder matrix for linear autoencoder(W ) and the projectionmatrix(P ) for PCA could indeed be the same. Hence proved

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 121: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

21/55

We have encoder W = V. ,≤k

From SVD, we know that V is the matrix of eigen vectors of XTX

From PCA, we know that P is the matrix of the eigen vectors of the covariancematrix

We saw earlier that, if entries of X are normalized by

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

then XTX is indeed the covariance matrix

Thus, the encoder matrix for linear autoencoder(W ) and the projectionmatrix(P ) for PCA could indeed be the same. Hence proved

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 122: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

21/55

We have encoder W = V. ,≤k

From SVD, we know that V is the matrix of eigen vectors of XTX

From PCA, we know that P is the matrix of the eigen vectors of the covariancematrix

We saw earlier that, if entries of X are normalized by

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

then XTX is indeed the covariance matrix

Thus, the encoder matrix for linear autoencoder(W ) and the projectionmatrix(P ) for PCA could indeed be the same. Hence proved

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 123: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

22/55

Remember

The encoder of a linear autoencoder is equivalent to PCA if we

use a linear encoder

use a linear decoder

use a squared error loss function

and normalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 124: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

22/55

Remember

The encoder of a linear autoencoder is equivalent to PCA if we

use a linear encoder

use a linear decoder

use a squared error loss function

and normalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 125: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

22/55

Remember

The encoder of a linear autoencoder is equivalent to PCA if we

use a linear encoder

use a linear decoder

use a squared error loss function

and normalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 126: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

22/55

Remember

The encoder of a linear autoencoder is equivalent to PCA if we

use a linear encoder

use a linear decoder

use a squared error loss function

and normalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 127: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

22/55

Remember

The encoder of a linear autoencoder is equivalent to PCA if we

use a linear encoder

use a linear decoder

use a squared error loss function

and normalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 128: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

22/55

Remember

The encoder of a linear autoencoder is equivalent to PCA if we

use a linear encoder

use a linear decoder

use a squared error loss function

and normalize the inputs to

xij =1√m

(xij −

1

m

m∑k=1

xkj

)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 129: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

23/55

Module 7.3: Regularization in autoencoders(Motivation)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 130: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

24/55

xi

W

h

W ∗

xi

While poor generalization could hap-pen even in undercomplete autoen-coders it is an even more serious prob-lem for overcomplete auto encoders

Here, (as stated earlier) the modelcan simply learn to copy xi to h andthen h to xi

To avoid poor generalization, we needto introduce regularization

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 131: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

24/55

xi

W

h

W ∗

xi

While poor generalization could hap-pen even in undercomplete autoen-coders it is an even more serious prob-lem for overcomplete auto encoders

Here, (as stated earlier) the modelcan simply learn to copy xi to h andthen h to xi

To avoid poor generalization, we needto introduce regularization

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 132: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

24/55

xi

W

h

W ∗

xi

While poor generalization could hap-pen even in undercomplete autoen-coders it is an even more serious prob-lem for overcomplete auto encoders

Here, (as stated earlier) the modelcan simply learn to copy xi to h andthen h to xi

To avoid poor generalization, we needto introduce regularization

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 133: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

24/55

xi

W

h

W ∗

xi

While poor generalization could hap-pen even in undercomplete autoen-coders it is an even more serious prob-lem for overcomplete auto encoders

Here, (as stated earlier) the modelcan simply learn to copy xi to h andthen h to xi

To avoid poor generalization, we needto introduce regularization

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 134: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

25/55

xi

W

h

W ∗

xi

The simplest solution is to add a L2-regularization term to the objectivefunction

minθ,w,w∗,b,c

1

m

m∑i=1

n∑j=1

(xij − xij)2 + λ‖θ‖2

This is very easy to implement andjust adds a term λW to the gradient∂L (θ)∂W (and similarly for other para-

meters)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 135: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

25/55

xi

W

h

W ∗

xi

The simplest solution is to add a L2-regularization term to the objectivefunction

minθ,w,w∗,b,c

1

m

m∑i=1

n∑j=1

(xij − xij)2 + λ‖θ‖2

This is very easy to implement andjust adds a term λW to the gradient∂L (θ)∂W (and similarly for other para-

meters)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 136: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

26/55

xi

W

h

W ∗

xi

Another trick is to tie the weights ofthe encoder and decoder

i.e., W ∗ =W T

This effectively reduces the capacityof Autoencoder and acts as a regular-izer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 137: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

26/55

xi

W

h

W ∗

xi

Another trick is to tie the weights ofthe encoder and decoder i.e., W ∗ =W T

This effectively reduces the capacityof Autoencoder and acts as a regular-izer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 138: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

26/55

xi

W

h

W ∗

xi

Another trick is to tie the weights ofthe encoder and decoder i.e., W ∗ =W T

This effectively reduces the capacityof Autoencoder and acts as a regular-izer

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 139: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

27/55

Module 7.4: Denoising Autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 140: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

28/55

xi

xi

h

xi

P (xij |xij)

A denoising encoder simply corruptsthe input data using a probabilisticprocess (P (xij |xij)) before feeding itto the network

A simple P (xij |xij) used in practiceis the following

P (xij = 0|xij) = q

P (xij = xij |xij) = 1− q

In other words, with probability q theinput is flipped to 0 and with probab-ility (1− q) it is retained as it is

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 141: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

28/55

xi

xi

h

xi

P (xij |xij)

A denoising encoder simply corruptsthe input data using a probabilisticprocess (P (xij |xij)) before feeding itto the network

A simple P (xij |xij) used in practiceis the following

P (xij = 0|xij) = q

P (xij = xij |xij) = 1− q

In other words, with probability q theinput is flipped to 0 and with probab-ility (1− q) it is retained as it is

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 142: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

28/55

xi

xi

h

xi

P (xij |xij)

A denoising encoder simply corruptsthe input data using a probabilisticprocess (P (xij |xij)) before feeding itto the network

A simple P (xij |xij) used in practiceis the following

P (xij = 0|xij) = q

P (xij = xij |xij) = 1− q

In other words, with probability q theinput is flipped to 0 and with probab-ility (1− q) it is retained as it is

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 143: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

28/55

xi

xi

h

xi

P (xij |xij)

A denoising encoder simply corruptsthe input data using a probabilisticprocess (P (xij |xij)) before feeding itto the network

A simple P (xij |xij) used in practiceis the following

P (xij = 0|xij) = q

P (xij = xij |xij) = 1− q

In other words, with probability q theinput is flipped to 0 and with probab-ility (1− q) it is retained as it is

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 144: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

28/55

xi

xi

h

xi

P (xij |xij)

A denoising encoder simply corruptsthe input data using a probabilisticprocess (P (xij |xij)) before feeding itto the network

A simple P (xij |xij) used in practiceis the following

P (xij = 0|xij) = q

P (xij = xij |xij) = 1− q

In other words, with probability q theinput is flipped to 0 and with probab-ility (1− q) it is retained as it is

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 145: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

29/55

xi

xi

h

xi

P (xij |xij)

For example, it will have to learn toreconstruct a corrupted xij correctly byrelying on its interactions with otherelements of xi

How does this help ?

This helps because the objective isstill to reconstruct the original (un-corrupted) xi

arg minθ

1

m

m∑i=1

n∑j=1

(xij − xij)2

It no longer makes sense for the modelto copy the corrupted xi into h(xi)and then into xi (the objective func-tion will not be minimized by doingso)

Instead the model will now have tocapture the characteristics of the datacorrectly.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 146: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

29/55

xi

xi

h

xi

P (xij |xij)

For example, it will have to learn toreconstruct a corrupted xij correctly byrelying on its interactions with otherelements of xi

How does this help ?

This helps because the objective isstill to reconstruct the original (un-corrupted) xi

arg minθ

1

m

m∑i=1

n∑j=1

(xij − xij)2

It no longer makes sense for the modelto copy the corrupted xi into h(xi)and then into xi (the objective func-tion will not be minimized by doingso)

Instead the model will now have tocapture the characteristics of the datacorrectly.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 147: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

29/55

xi

xi

h

xi

P (xij |xij)

For example, it will have to learn toreconstruct a corrupted xij correctly byrelying on its interactions with otherelements of xi

How does this help ?

This helps because the objective isstill to reconstruct the original (un-corrupted) xi

arg minθ

1

m

m∑i=1

n∑j=1

(xij − xij)2

It no longer makes sense for the modelto copy the corrupted xi into h(xi)and then into xi (the objective func-tion will not be minimized by doingso)

Instead the model will now have tocapture the characteristics of the datacorrectly.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 148: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

29/55

xi

xi

h

xi

P (xij |xij)

For example, it will have to learn toreconstruct a corrupted xij correctly byrelying on its interactions with otherelements of xi

How does this help ?

This helps because the objective isstill to reconstruct the original (un-corrupted) xi

arg minθ

1

m

m∑i=1

n∑j=1

(xij − xij)2

It no longer makes sense for the modelto copy the corrupted xi into h(xi)and then into xi (the objective func-tion will not be minimized by doingso)

Instead the model will now have tocapture the characteristics of the datacorrectly.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 149: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

29/55

xi

xi

h

xi

P (xij |xij)

For example, it will have to learn toreconstruct a corrupted xij correctly byrelying on its interactions with otherelements of xi

How does this help ?

This helps because the objective isstill to reconstruct the original (un-corrupted) xi

arg minθ

1

m

m∑i=1

n∑j=1

(xij − xij)2

It no longer makes sense for the modelto copy the corrupted xi into h(xi)and then into xi (the objective func-tion will not be minimized by doingso)

Instead the model will now have tocapture the characteristics of the datacorrectly.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 150: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

30/55

We will now see a practical application in which AEs are used and then compareDenoising Autoencoders with regular autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 151: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

31/55

Task: Hand-written digitrecognition

Figure: MNIST Data

0 1 2 3 9

|xi| = 784 = 28× 28

28*28

Figure: Basic approach(we use raw data as inputfeatures)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 152: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

32/55

Task: Hand-written digitrecognition

Figure: MNIST Data

|xi| = 784 = 28× 28

xi ∈ R784

h ∈ Rd

Figure: AE approach (first learn importantcharacteristics of data)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 153: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

33/55

Task: Hand-written digitrecognition

Figure: MNIST Data

0 1 2 3 9

|xi| = 784 = 28× 28

h ∈ Rd

Figure: AE approach (and then train a classifier ontop of this hidden representation)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 154: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

34/55

We will now see a way of visualizing AEs and use this visualization to comparedifferent AEs

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 155: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

35/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

We can think of each neuron as a filter whichwill fire (or get maximally) activated for a cer-tain input configuration xi

For example,

h1 = σ(W T1 xi) [ignoring bias b]

Where W1 is the trained vector of weights con-necting the input to the first hidden neuron

What values of xi will cause h1 to be max-imum (or maximally activated)

Suppose we assume that our inputs are nor-malized so that ‖xi‖ = 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 156: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

35/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

We can think of each neuron as a filter whichwill fire (or get maximally) activated for a cer-tain input configuration xi

For example,

h1 = σ(W T1 xi) [ignoring bias b]

Where W1 is the trained vector of weights con-necting the input to the first hidden neuron

What values of xi will cause h1 to be max-imum (or maximally activated)

Suppose we assume that our inputs are nor-malized so that ‖xi‖ = 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 157: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

35/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

We can think of each neuron as a filter whichwill fire (or get maximally) activated for a cer-tain input configuration xi

For example,

h1 = σ(W T1 xi) [ignoring bias b]

Where W1 is the trained vector of weights con-necting the input to the first hidden neuron

What values of xi will cause h1 to be max-imum (or maximally activated)

Suppose we assume that our inputs are nor-malized so that ‖xi‖ = 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 158: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

35/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

We can think of each neuron as a filter whichwill fire (or get maximally) activated for a cer-tain input configuration xi

For example,

h1 = σ(W T1 xi) [ignoring bias b]

Where W1 is the trained vector of weights con-necting the input to the first hidden neuron

What values of xi will cause h1 to be max-imum (or maximally activated)

Suppose we assume that our inputs are nor-malized so that ‖xi‖ = 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 159: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

35/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

We can think of each neuron as a filter whichwill fire (or get maximally) activated for a cer-tain input configuration xi

For example,

h1 = σ(W T1 xi) [ignoring bias b]

Where W1 is the trained vector of weights con-necting the input to the first hidden neuron

What values of xi will cause h1 to be max-imum (or maximally activated)

Suppose we assume that our inputs are nor-malized so that ‖xi‖ = 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 160: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

35/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

We can think of each neuron as a filter whichwill fire (or get maximally) activated for a cer-tain input configuration xi

For example,

h1 = σ(W T1 xi) [ignoring bias b]

Where W1 is the trained vector of weights con-necting the input to the first hidden neuron

What values of xi will cause h1 to be max-imum (or maximally activated)

Suppose we assume that our inputs are nor-malized so that ‖xi‖ = 1

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 161: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

36/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

Thus the inputs

xi =W1√W T

1 W1

,W2√W T

2 W2

, . . .Wn√W TnWn

will respectively cause hidden neurons 1 to nto maximally fire

Let us plot these images (xi’s) which maxim-ally activate the first k neurons of the hiddenrepresentations learned by a vanilla autoen-coder and different denoising autoencoders

These xi’s are computed by the above formulausing the weights (W1,W2 . . .Wk) learned bythe respective autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 162: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

36/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

Thus the inputs

xi =W1√W T

1 W1

,W2√W T

2 W2

, . . .Wn√W TnWn

will respectively cause hidden neurons 1 to nto maximally fire

Let us plot these images (xi’s) which maxim-ally activate the first k neurons of the hiddenrepresentations learned by a vanilla autoen-coder and different denoising autoencoders

These xi’s are computed by the above formulausing the weights (W1,W2 . . .Wk) learned bythe respective autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 163: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

36/55

xi

h

xi

maxxi

WT1 xi

s.t. ||xi||2 = xTi xi = 1

Solution: xi =W1√WT

1 W1

Thus the inputs

xi =W1√W T

1 W1

,W2√W T

2 W2

, . . .Wn√W TnWn

will respectively cause hidden neurons 1 to nto maximally fire

Let us plot these images (xi’s) which maxim-ally activate the first k neurons of the hiddenrepresentations learned by a vanilla autoen-coder and different denoising autoencoders

These xi’s are computed by the above formulausing the weights (W1,W2 . . .Wk) learned bythe respective autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 164: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

37/55

Figure: Vanilla AE(No noise)

Figure: 25% DenoisingAE (q=0.25)

Figure: 50% DenoisingAE (q=0.5)

The vanilla AE does not learn many meaningful patterns

The hidden neurons of the denoising AEs seem to act like pen-stroke detectors(for example, in the highlighted neuron the black region is a stroke that youwould expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)

As the noise increases the filters become more wide because the neuron has torely on more adjacent pixels to feel confident about a stroke

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 165: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

37/55

Figure: Vanilla AE(No noise)

Figure: 25% DenoisingAE (q=0.25)

Figure: 50% DenoisingAE (q=0.5)

The vanilla AE does not learn many meaningful patterns

The hidden neurons of the denoising AEs seem to act like pen-stroke detectors(for example, in the highlighted neuron the black region is a stroke that youwould expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)

As the noise increases the filters become more wide because the neuron has torely on more adjacent pixels to feel confident about a stroke

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 166: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

37/55

Figure: Vanilla AE(No noise)

Figure: 25% DenoisingAE (q=0.25)

Figure: 50% DenoisingAE (q=0.5)

The vanilla AE does not learn many meaningful patterns

The hidden neurons of the denoising AEs seem to act like pen-stroke detectors(for example, in the highlighted neuron the black region is a stroke that youwould expect in a ’0’ or a ’2’ or a ’3’ or a ’8’ or a ’9’)

As the noise increases the filters become more wide because the neuron has torely on more adjacent pixels to feel confident about a stroke

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 167: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

38/55

xi

xi

h

xi

P (xij |xij)

We saw one form of P (xij |xij) which flips afraction q of the inputs to zero

Another way of corrupting the inputs is to adda Gaussian noise to the input

xij = xij + N (0, 1)

We will now use such a denoising AE on adifferent dataset and see their performance

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 168: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

38/55

xi

xi

h

xi

P (xij |xij)

We saw one form of P (xij |xij) which flips afraction q of the inputs to zero

Another way of corrupting the inputs is to adda Gaussian noise to the input

xij = xij + N (0, 1)

We will now use such a denoising AE on adifferent dataset and see their performance

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 169: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

38/55

xi

xi

h

xi

P (xij |xij)

We saw one form of P (xij |xij) which flips afraction q of the inputs to zero

Another way of corrupting the inputs is to adda Gaussian noise to the input

xij = xij + N (0, 1)

We will now use such a denoising AE on adifferent dataset and see their performance

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 170: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

39/55

Figure: Data Figure: AE filtersFigure: Weight decayfilters

The hidden neurons essentially behave like edge detectors

PCA does not give such edge detectors

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 171: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

39/55

Figure: Data Figure: AE filtersFigure: Weight decayfilters

The hidden neurons essentially behave like edge detectors

PCA does not give such edge detectors

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 172: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

40/55

Module 7.5: Sparse Autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 173: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

41/55

xi

h

xi

A hidden neuron with sigmoid activation willhave values between 0 and 1

We say that the neuron is activated when itsoutput is close to 1 and not activated whenits output is close to 0.

A sparse autoencoder tries to ensure theneuron is inactive most of the times.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 174: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

41/55

xi

h

xiA hidden neuron with sigmoid activation willhave values between 0 and 1

We say that the neuron is activated when itsoutput is close to 1 and not activated whenits output is close to 0.

A sparse autoencoder tries to ensure theneuron is inactive most of the times.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 175: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

41/55

xi

h

xiA hidden neuron with sigmoid activation willhave values between 0 and 1

We say that the neuron is activated when itsoutput is close to 1 and not activated whenits output is close to 0.

A sparse autoencoder tries to ensure theneuron is inactive most of the times.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 176: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

41/55

xi

h

xiA hidden neuron with sigmoid activation willhave values between 0 and 1

We say that the neuron is activated when itsoutput is close to 1 and not activated whenits output is close to 0.

A sparse autoencoder tries to ensure theneuron is inactive most of the times.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 177: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

42/55

xi

h

xi

The average value of theactivation of a neuron l is givenby

ρl =1

m

m∑i=1

h(xi)l

If the neuron l is sparse (i.e. mostly inactive)then ρl → 0

A sparse autoencoder uses a sparsity para-meter ρ (typically very close to 0, say, 0.005)and tries to enforce the constraint ρl = ρ

One way of ensuring this is to add the follow-ing term to the objective function

Ω(θ) =

k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

When will this term reach its minimum valueand what is the minimum value? Let us plotit and check.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 178: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

42/55

xi

h

xi

The average value of theactivation of a neuron l is givenby

ρl =1

m

m∑i=1

h(xi)l

If the neuron l is sparse (i.e. mostly inactive)then ρl → 0

A sparse autoencoder uses a sparsity para-meter ρ (typically very close to 0, say, 0.005)and tries to enforce the constraint ρl = ρ

One way of ensuring this is to add the follow-ing term to the objective function

Ω(θ) =

k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

When will this term reach its minimum valueand what is the minimum value? Let us plotit and check.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 179: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

42/55

xi

h

xi

The average value of theactivation of a neuron l is givenby

ρl =1

m

m∑i=1

h(xi)l

If the neuron l is sparse (i.e. mostly inactive)then ρl → 0

A sparse autoencoder uses a sparsity para-meter ρ (typically very close to 0, say, 0.005)and tries to enforce the constraint ρl = ρ

One way of ensuring this is to add the follow-ing term to the objective function

Ω(θ) =

k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

When will this term reach its minimum valueand what is the minimum value? Let us plotit and check.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 180: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

42/55

xi

h

xi

The average value of theactivation of a neuron l is givenby

ρl =1

m

m∑i=1

h(xi)l

If the neuron l is sparse (i.e. mostly inactive)then ρl → 0

A sparse autoencoder uses a sparsity para-meter ρ (typically very close to 0, say, 0.005)and tries to enforce the constraint ρl = ρ

One way of ensuring this is to add the follow-ing term to the objective function

Ω(θ) =

k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

When will this term reach its minimum valueand what is the minimum value? Let us plotit and check.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 181: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

43/55

Ω(θ)

0.2 ρl

ρ = 0.2

The function will reach its minimum value(s) when ρl = ρ.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 182: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

43/55

Ω(θ)

0.2 ρl

ρ = 0.2

The function will reach its minimum value(s) when ρl = ρ.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 183: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =

k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 184: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =

k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 185: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =

k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 186: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =

k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 187: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 188: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 189: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 190: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]T

For each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 191: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 192: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 193: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 194: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

44/55

Ω(θ) =k∑l=1

ρlogρ

ρl+ (1− ρ)log

1− ρ1− ρl

Can be re-written as

Ω(θ) =

k∑l=1

ρlogρ−ρlogρl+(1−ρ)log(1−ρ)−(1−ρ)log(1−ρl)

By Chain rule:

∂Ω(θ)

∂W=∂Ω(θ)

∂ρ.∂ρ

∂W

∂Ω(θ)

∂ρ=[∂Ω(θ)∂ρ1

, ∂Ω(θ)∂ρ2

, . . . ∂Ω(θ)∂ρk

]TFor each neuron l ∈ 1 . . . k in hidden layer, we have

∂Ω(θ)

∂ρl= − ρ

ρl+

(1− ρ)

1− ρl

and∂ρl∂W

= xi(g′(WTxi + b))T (see next slide)

Now,

L (θ) = L (θ) + Ω(θ)

L (θ) is the squared error loss orcross entropy loss and Ω(θ) is thesparsity constraint.

We already know how to calculate∂L (θ)∂W

Let us see how to calculate ∂Ω(θ)∂W .

Finally,

∂L (θ)

∂W=∂L (θ)

∂W+∂Ω(θ)

∂W

(and we know how to calculate bothterms on R.H.S)

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 195: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

45/55

Derivation∂ρ

∂W=[∂ρ1∂W

∂ρ2∂W . . . ∂ρk∂W

]For each element in the above equation we can calculate ∂ρl

∂W (which is the partialderivative of a scalar w.r.t. a matrix = matrix). For a single element of a matrix Wjl:-

∂ρl∂Wjl

=∂[

1m

∑mi=1 g

(WT

:,lxi + bl)]

∂Wjl

=1

m

m∑i=1

∂[g(WT

:,lxi + bl)]

∂Wjl

=1

m

m∑i=1

g′(WT

:,lxi + bl)xij

So in matrix notation we can write it as :

∂ρl∂W

= xi(g′(WTxi + b))T

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 196: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

46/55

Module 7.6: Contractive Autoencoders

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 197: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

47/55

A contractive autoencoder also triesto prevent an overcomplete autoen-coder from learning the identity func-tion.

It does so by adding the following reg-ularization term to the loss function

Ω(θ) = ‖Jx(h)‖2Fwhere Jx(h) is the Jacobian of the en-coder.

Let us see what it looks like.

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 198: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

47/55

A contractive autoencoder also triesto prevent an overcomplete autoen-coder from learning the identity func-tion.

It does so by adding the following reg-ularization term to the loss function

Ω(θ) = ‖Jx(h)‖2F

where Jx(h) is the Jacobian of the en-coder.

Let us see what it looks like.

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 199: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

47/55

A contractive autoencoder also triesto prevent an overcomplete autoen-coder from learning the identity func-tion.

It does so by adding the following reg-ularization term to the loss function

Ω(θ) = ‖Jx(h)‖2Fwhere Jx(h) is the Jacobian of the en-coder.

Let us see what it looks like.

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 200: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

47/55

A contractive autoencoder also triesto prevent an overcomplete autoen-coder from learning the identity func-tion.

It does so by adding the following reg-ularization term to the loss function

Ω(θ) = ‖Jx(h)‖2Fwhere Jx(h) is the Jacobian of the en-coder.

Let us see what it looks like.

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 201: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

48/55

If the input has n dimensions and thehidden layer has k dimensions then

In other words, the (l, j) entry of theJacobian captures the variation in theoutput of the lth neuron with a smallvariation in the jth input.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 202: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

48/55

If the input has n dimensions and thehidden layer has k dimensions then

In other words, the (l, j) entry of theJacobian captures the variation in theoutput of the lth neuron with a smallvariation in the jth input.

Jx(h) =

∂h1∂x1

. . . . . . . . . ∂h1∂xn

∂h2∂x1

. . . . . . . . . ∂h2∂xn

.... . .

...∂hk∂x1

. . . . . . . . . ∂hk∂xn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 203: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

48/55

If the input has n dimensions and thehidden layer has k dimensions then

In other words, the (l, j) entry of theJacobian captures the variation in theoutput of the lth neuron with a smallvariation in the jth input.

Jx(h) =

∂h1∂x1

. . . . . . . . . ∂h1∂xn

∂h2∂x1

. . . . . . . . . ∂h2∂xn

.... . .

...∂hk∂x1

. . . . . . . . . ∂hk∂xn

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 204: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

48/55

If the input has n dimensions and thehidden layer has k dimensions then

In other words, the (l, j) entry of theJacobian captures the variation in theoutput of the lth neuron with a smallvariation in the jth input.

Jx(h) =

∂h1∂x1

. . . . . . . . . ∂h1∂xn

∂h2∂x1

. . . . . . . . . ∂h2∂xn

.... . .

...∂hk∂x1

. . . . . . . . . ∂hk∂xn

‖Jx(h)‖2F =

n∑j=1

k∑l=1

(∂hl∂xj

)2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 205: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

49/55

What is the intuition behind this ?

Consider ∂h1∂x1

, what does it mean if∂h1∂x1

= 0

It means that this neuron is not verysensitive to variations in the input x1.

But doesn’t this contradict our othergoal of minimizing L(θ) which re-quires h to capture variations in theinput.

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 206: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

49/55

What is the intuition behind this ?

Consider ∂h1∂x1

, what does it mean if∂h1∂x1

= 0

It means that this neuron is not verysensitive to variations in the input x1.

But doesn’t this contradict our othergoal of minimizing L(θ) which re-quires h to capture variations in theinput.

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 207: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

49/55

What is the intuition behind this ?

Consider ∂h1∂x1

, what does it mean if∂h1∂x1

= 0

It means that this neuron is not verysensitive to variations in the input x1.

But doesn’t this contradict our othergoal of minimizing L(θ) which re-quires h to capture variations in theinput.

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 208: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

49/55

What is the intuition behind this ?

Consider ∂h1∂x1

, what does it mean if∂h1∂x1

= 0

It means that this neuron is not verysensitive to variations in the input x1.

But doesn’t this contradict our othergoal of minimizing L(θ) which re-quires h to capture variations in theinput.

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 209: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

50/55

Indeed it does and that’s the idea

By putting these two contradictingobjectives against each other we en-sure that h is sensitive to only veryimportant variations as observed inthe training data.

L(θ) - capture important variationsin data

Ω(θ) - do not capture variations indata

Tradeoff - capture only very import-ant variations in the data

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 210: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

50/55

Indeed it does and that’s the idea

By putting these two contradictingobjectives against each other we en-sure that h is sensitive to only veryimportant variations as observed inthe training data.

L(θ) - capture important variationsin data

Ω(θ) - do not capture variations indata

Tradeoff - capture only very import-ant variations in the data

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 211: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

50/55

Indeed it does and that’s the idea

By putting these two contradictingobjectives against each other we en-sure that h is sensitive to only veryimportant variations as observed inthe training data.

L(θ) - capture important variationsin data

Ω(θ) - do not capture variations indata

Tradeoff - capture only very import-ant variations in the data

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 212: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

50/55

Indeed it does and that’s the idea

By putting these two contradictingobjectives against each other we en-sure that h is sensitive to only veryimportant variations as observed inthe training data.

L(θ) - capture important variationsin data

Ω(θ) - do not capture variations indata

Tradeoff - capture only very import-ant variations in the data

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 213: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

50/55

Indeed it does and that’s the idea

By putting these two contradictingobjectives against each other we en-sure that h is sensitive to only veryimportant variations as observed inthe training data.

L(θ) - capture important variationsin data

Ω(θ) - do not capture variations indata

Tradeoff - capture only very import-ant variations in the data

‖Jx(h)‖2F =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

x

h

x

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 214: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

51/55

Let us try to understand this with the help of an illustration.

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 215: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

52/55

x

y

u1

u2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 216: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

52/55

x

y

u1

u2

Consider the variations in the dataalong directions u1 and u2

It makes sense to maximize a neuronto be sensitive to variations along u1

At the same time it makes sense toinhibit a neuron from being sensitiveto variations along u2 (as there seemsto be small noise and unimportant forreconstruction)

By doing so we can balance betweenthe contradicting goals of good recon-struction and low sensitivity.

What does this remind you of ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 217: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

52/55

x

y

u1

u2

Consider the variations in the dataalong directions u1 and u2

It makes sense to maximize a neuronto be sensitive to variations along u1

At the same time it makes sense toinhibit a neuron from being sensitiveto variations along u2 (as there seemsto be small noise and unimportant forreconstruction)

By doing so we can balance betweenthe contradicting goals of good recon-struction and low sensitivity.

What does this remind you of ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 218: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

52/55

x

y

u1

u2

Consider the variations in the dataalong directions u1 and u2

It makes sense to maximize a neuronto be sensitive to variations along u1

At the same time it makes sense toinhibit a neuron from being sensitiveto variations along u2 (as there seemsto be small noise and unimportant forreconstruction)

By doing so we can balance betweenthe contradicting goals of good recon-struction and low sensitivity.

What does this remind you of ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 219: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

52/55

x

y

u1

u2

Consider the variations in the dataalong directions u1 and u2

It makes sense to maximize a neuronto be sensitive to variations along u1

At the same time it makes sense toinhibit a neuron from being sensitiveto variations along u2 (as there seemsto be small noise and unimportant forreconstruction)

By doing so we can balance betweenthe contradicting goals of good recon-struction and low sensitivity.

What does this remind you of ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 220: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

52/55

x

y

u1

u2

Consider the variations in the dataalong directions u1 and u2

It makes sense to maximize a neuronto be sensitive to variations along u1

At the same time it makes sense toinhibit a neuron from being sensitiveto variations along u2 (as there seemsto be small noise and unimportant forreconstruction)

By doing so we can balance betweenthe contradicting goals of good recon-struction and low sensitivity.

What does this remind you of ?

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 221: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

53/55

Module 7.7 : Summary

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 222: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

54/55

x

h

x

PCA

P TXTXP = D

y

x

u1 u2

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 223: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

54/55

x

h

x

PCA

P TXTXP = D

y

x

u1 u2

minθ‖X −HW ∗︸ ︷︷ ︸

UΣV T(SVD)

‖2F

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 224: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

55/55

xi

xi

h

xi

P (xij |xij)

Regularization

Ω(θ) = λ‖θ‖2 Weight decaying

Ω(θ) =

k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

Sparse

Ω(θ) =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

Contractive

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 225: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

55/55

xi

xi

h

xi

P (xij |xij)

Regularization

Ω(θ) = λ‖θ‖2 Weight decaying

Ω(θ) =

k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

Sparse

Ω(θ) =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

Contractive

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 226: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

55/55

xi

xi

h

xi

P (xij |xij)

Regularization

Ω(θ) = λ‖θ‖2 Weight decaying

Ω(θ) =

k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

Sparse

Ω(θ) =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

Contractive

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 227: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

55/55

xi

xi

h

xi

P (xij |xij)

Regularization

Ω(θ) = λ‖θ‖2 Weight decaying

Ω(θ) =k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

Sparse

Ω(θ) =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

Contractive

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7

Page 228: CS7015 (Deep Learning) : Lecture 7 · Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7. 5/55 x i W h W x^ i h = g(Wx i + b) x^ i = f(Wh+ c) An autoencoder where dim(h) dim(x i)

55/55

xi

xi

h

xi

P (xij |xij)

Regularization

Ω(θ) = λ‖θ‖2 Weight decaying

Ω(θ) =k∑l=1

ρ logρ

ρl+ (1− ρ) log

1− ρ1− ρl

Sparse

Ω(θ) =

n∑j=1

k∑l=1

( ∂hl∂xj

)2

Contractive

Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 7