contractive auto-encoders: explicit invariance during...
TRANSCRIPT
1
Contractive Auto-Encoders: Explicit Invariance During Feature Extraction
Akarsh Pokkunuru
EECS Department
03-16-2017
2
AGENDA
• Introduction to Auto-encoders
• Types of Auto-encoders
• Analysis of different Auto-encoder
• Contractive Auto-encoder
• Results and benchmarking tests
• Conclusion
33
Introduction To Auto-Encoders
4
Auto-Encoder Introduction
• Notation for Auto-Encoders - AE
• AE = retain good info – bad info.
• AE is a great technique for :
– Characterizing the input distribution
– Dimensionality reduction
– Feature rich extraction
• # of hidden layer nodes < input nodes = bottleneck! (fail to extract enough useful info.)
5
Auto-Encoder Illustration
• Composed of two parts –
– Encoder
– Decoder
6
Auto-Encoder Mathematical Expression
• Encoder maps input 𝑥 to a hidden or higher level representation:
• where ℎ is the hidden layer representation, 𝑠𝑓 is encoder activation, 𝑊 is weight, 𝑏ℎ is the bias.
• Encoder output contains reduced dimension or compact data.
7
Auto-Encoder Mathematical Expression Cont.
• Decoder tries to reconstruct the original information with less error.
• where g(ℎ) is the decoder function, 𝑠𝑔 is decoder activation, 𝑊′ is transpose of weight, 𝑏𝑦 is the bias.
8
Auto-Encoder Cont.
• Types of activation functions used :
– Linear (identity, binary etc.)
– Non-linear (sigmoid, tanh etc.)
• Why linear activation? :
– Very simple to implement
– No interesting information at output
• Why non-linear activation? :
– Feature rich output
– Computation burden
– Very popular
9
Training AE and Cost Function
• Initialize the weight, biases of encoder and decoder function parameters.
• Train the data set and minimize the reconstruction error and cost function :
• L is reconstruction error function (e.g. MMSE, cross entropy function) and 𝜏𝐴𝐸(𝜃) is the cost function.
1010
Types of Auto-Encoders
11
Types of Auto-Encoders
• Auto-Encoders can be categorized as follows :
– Normal AE
– Regularized AE
– Denoising AE
– Sparse AE
– Contractive AE
• We will focus on regularized, denoising and the proposed contractive AE.
12
Regularized Auto-Encoder
• Idea is to favor very small increments in weight by “decaying” the “bad” features :
• λ controls strength of regularizing weights, W is weight parameters.
• Offers significantly better results than normal AE in most benchmarking datasets (MNIST, CIFAR etc.)
13
Denoising Auto-Encoder
• Modification of Regularized AE.
• Idea is to add noise to input on purpose and reconstruct a cleaner version.
• 𝑥 = 𝑥 + Є is the corrupted version of input, 𝑞( 𝑥|𝑥) is the corruption process (e.g. Gaussian noise).
• Optimization is done by stochastic gradient descent algorithm.
1414
Contractive Auto-Encoders
15
Contractive AE
• Modification of Regularized AE.
• Idea is to avoid/penalize uninteresting features.
• Introduce a penalty function which penalizes the highly sensitive inputs to increase robustness as follows :
• As a result, all the samples are flat or invariant to small variations in input samples.
16
Contractive Auto-Encoder Cont.
• is the Frobenius norm of the Jacobian matrix.
• If the encoder is linear, RAE and CAE are identical.
• CAE and Denoising AE (DAE) behave in the same way, but..
– CAE increases “flatness” from first hidden layer in contrast to DAE;
– DAE encourages “flatness” only from reconstruction layer.
• However, cost of computation remains same!
17
Contractive Auto-Encoder Cont.
• The cost function is given as follows:
• where λ has the same functionality as in regularized AE and is the Jacobian penalty function as discussed previously.
18
Example
• Received power data set – 4 million samples.
1919
Results and Benchmarking
20
Considered Models For Comparison
• The models considered for performance comparisons with CAE are as follows:
21
Experimental Setting
• The experimental setting for AE is as follows:
– Unsupervised training.
– First a single layer NN, then extended to multilayer.
– All auto-encoder variants used tied weights (faster convergence and less parameters to optimize).
– A sigmoid activation function for both encoder and decoder.
– A cross-entropy reconstruction error function.
– Optimization by stochastic gradient descent.
– 1000 hidden layer units are considered while training.
22
Experimental Setting
• The experimental setting for RBM neural network is as follows:
– Unsupervised training.
– First a single layer NN, then extended to multilayer.
– Contrastive divergence to train the RBM.
• After training, feature extraction parameters W, b are fed to a MLP with another random output layer for classification.
• Gradient decent is then used for fine tuning.
23
Results
• Two standard data sets are considered – MNIST and CIFAR-bw. The results are as follows :
24
Results Cont.
• SAT indicates the average fraction of saturated units.
• A unit is saturated, if activation function output is below a certain threshold (e.g. 0.05 is lower SAT or 0.95 for upper SAT).
• The penalty function is a measure of contraction/flatness. Lower the average, better the invariance to small variations.
25
Results Cont.
• Results for stacked neural networks are as follows:
• Dual layer CAE is better than other 3-layer NN!
26
How Contraction Works?
• For better understanding of how contraction works, we use the following analysis:
– Need to understand local behavior of a data point when contractive penalty is applied.
• Singular values of Jacobian matrix.
– Contraction of samples has effect on not just the immediate samples, but beyond (mean and variance).
• Contraction ratio between two close points -𝑑1
𝑑2(𝑟)
• Average contraction ratio for a hidden layer – defined using a randomly generated sphere of radius r.
27
Effect of Singular Values
• Large singular value corresponds to direction of allowed variation.
• CAE better at characterizing low-dimensional inputs.
28
Contraction Ratio
• The contraction ratio can be visualized as follows:
– 𝑥0 is some point from validation data set.
– 𝑥1 is randomly generated mapping of 𝑥0 in a 3D sphere of radius “r”, centered at 𝑥0.
– Contraction ratio between 𝑥0 and 𝑥1 after mapping is given
by 𝑑1
𝑑2(𝑟)
– 𝑑1 = dist. in original i/p space.
– 𝑑2 = dist. in mapped space.
29
Contraction ratio vs Radius
• Decrease in CAE ratio occurs at max “r”.
• CAE is trying to make the features invariant in all directions around the training examples.
• Reconstruction error is making sure that that the representation function doesn’t change.
30
Contraction ratio vs Radius
• Measure of contraction ratio for CIFAR-bw.
31
Contraction ratio vs Radius
• Deeper encoders produce features that are more invariant, over a farther distance.
32
Conclusion
• Contractive AE uses a penalty to induce flatness to small variations in input.
• By looking at the contraction ratio and singular values, we have studied how the CAE is robust to small scale variations in the data set.
• Finally, the penalty function helps the CAE to improve the performance compared to other auto-encoders.
3333
Thank you