contractive auto-encoders: explicit invariance during...

1

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Akarsh Pokkunuru

EECS Department

03-16-2017

2

AGENDA

• Introduction to Auto-encoders

• Types of Auto-encoders

• Analysis of different Auto-encoder

• Contractive Auto-encoder

• Results and benchmarking tests

• Conclusion

33

Introduction To Auto-Encoders

4

Auto-Encoder Introduction

• Notation for Auto-Encoders - AE

• AE = retain good info – bad info.

• AE is a great technique for :

– Characterizing the input distribution

– Dimensionality reduction

– Feature rich extraction

• # of hidden layer nodes < input nodes = bottleneck! (fail to extract enough useful info.)

5

Auto-Encoder Illustration

• Composed of two parts –

– Encoder

– Decoder

6

Auto-Encoder Mathematical Expression

• Encoder maps input 𝑥 to a hidden or higher level representation:

• where ℎ is the hidden layer representation, 𝑠𝑓 is encoder activation, 𝑊 is weight, 𝑏ℎ is the bias.

• Encoder output contains reduced dimension or compact data.

7

Auto-Encoder Mathematical Expression Cont.

• Decoder tries to reconstruct the original information with less error.

• where g(ℎ) is the decoder function, 𝑠𝑔 is decoder activation, 𝑊′ is transpose of weight, 𝑏𝑦 is the bias.

8

Auto-Encoder Cont.

• Types of activation functions used :

– Linear (identity, binary etc.)

– Non-linear (sigmoid, tanh etc.)

• Why linear activation? :

– Very simple to implement

– No interesting information at output

• Why non-linear activation? :

– Feature rich output

– Computation burden

– Very popular

9

Training AE and Cost Function

• Initialize the weight, biases of encoder and decoder function parameters.

• Train the data set and minimize the reconstruction error and cost function :

• L is reconstruction error function (e.g. MMSE, cross entropy function) and 𝜏𝐴𝐸(𝜃) is the cost function.

1010

Types of Auto-Encoders

11

Types of Auto-Encoders

• Auto-Encoders can be categorized as follows :

– Normal AE

– Regularized AE

– Denoising AE

– Sparse AE

– Contractive AE

• We will focus on regularized, denoising and the proposed contractive AE.

12

Regularized Auto-Encoder

• Idea is to favor very small increments in weight by “decaying” the “bad” features :

• λ controls strength of regularizing weights, W is weight parameters.

• Offers significantly better results than normal AE in most benchmarking datasets (MNIST, CIFAR etc.)

13

Denoising Auto-Encoder

• Modification of Regularized AE.

• Idea is to add noise to input on purpose and reconstruct a cleaner version.

• 𝑥 = 𝑥 + Є is the corrupted version of input, 𝑞( 𝑥|𝑥) is the corruption process (e.g. Gaussian noise).

• Optimization is done by stochastic gradient descent algorithm.

1414

Contractive Auto-Encoders

15

Contractive AE

• Modification of Regularized AE.

• Idea is to avoid/penalize uninteresting features.

• Introduce a penalty function which penalizes the highly sensitive inputs to increase robustness as follows :

• As a result, all the samples are flat or invariant to small variations in input samples.

16

Contractive Auto-Encoder Cont.

• is the Frobenius norm of the Jacobian matrix.

• If the encoder is linear, RAE and CAE are identical.

• CAE and Denoising AE (DAE) behave in the same way, but..

– CAE increases “flatness” from first hidden layer in contrast to DAE;

– DAE encourages “flatness” only from reconstruction layer.

• However, cost of computation remains same!

17

Contractive Auto-Encoder Cont.

• The cost function is given as follows:

• where λ has the same functionality as in regularized AE and is the Jacobian penalty function as discussed previously.

18

Example

• Received power data set – 4 million samples.

1919

Results and Benchmarking

20

Considered Models For Comparison

• The models considered for performance comparisons with CAE are as follows:

21

Experimental Setting

• The experimental setting for AE is as follows:

– Unsupervised training.

– First a single layer NN, then extended to multilayer.

– All auto-encoder variants used tied weights (faster convergence and less parameters to optimize).

– A sigmoid activation function for both encoder and decoder.

– A cross-entropy reconstruction error function.

– Optimization by stochastic gradient descent.

– 1000 hidden layer units are considered while training.

22

Experimental Setting

• The experimental setting for RBM neural network is as follows:

– Unsupervised training.

– First a single layer NN, then extended to multilayer.

– Contrastive divergence to train the RBM.

• After training, feature extraction parameters W, b are fed to a MLP with another random output layer for classification.

• Gradient decent is then used for fine tuning.

23

Results

• Two standard data sets are considered – MNIST and CIFAR-bw. The results are as follows :

24

Results Cont.

• SAT indicates the average fraction of saturated units.

• A unit is saturated, if activation function output is below a certain threshold (e.g. 0.05 is lower SAT or 0.95 for upper SAT).

• The penalty function is a measure of contraction/flatness. Lower the average, better the invariance to small variations.

25

Results Cont.

• Results for stacked neural networks are as follows:

• Dual layer CAE is better than other 3-layer NN!

26

How Contraction Works?

• For better understanding of how contraction works, we use the following analysis:

– Need to understand local behavior of a data point when contractive penalty is applied.

• Singular values of Jacobian matrix.

– Contraction of samples has effect on not just the immediate samples, but beyond (mean and variance).

• Contraction ratio between two close points -𝑑1

𝑑2(𝑟)

• Average contraction ratio for a hidden layer – defined using a randomly generated sphere of radius r.

27

Effect of Singular Values

• Large singular value corresponds to direction of allowed variation.

• CAE better at characterizing low-dimensional inputs.

28

Contraction Ratio

• The contraction ratio can be visualized as follows:

– 𝑥0 is some point from validation data set.

– 𝑥1 is randomly generated mapping of 𝑥0 in a 3D sphere of radius “r”, centered at 𝑥0.

– Contraction ratio between 𝑥0 and 𝑥1 after mapping is given

by 𝑑1

𝑑2(𝑟)

– 𝑑1 = dist. in original i/p space.

– 𝑑2 = dist. in mapped space.

29

Contraction ratio vs Radius

• Decrease in CAE ratio occurs at max “r”.

• CAE is trying to make the features invariant in all directions around the training examples.

• Reconstruction error is making sure that that the representation function doesn’t change.

30


• Measure of contraction ratio for CIFAR-bw.

31


• Deeper encoders produce features that are more invariant, over a farther distance.

32

Conclusion

• Contractive AE uses a penalty to induce flatness to small variations in input.

• By looking at the contraction ratio and singular values, we have studied how the CAE is robust to small scale variations in the data set.

• Finally, the penalty function helps the CAE to improve the performance compared to other auto-encoders.

3333

Thank you

contractive auto-encoders: explicit invariance during...

Documents