analysis of single-layer networks
TRANSCRIPT
Analysis of Single-Layer Networks
Presented by Hourieh Fakourfar
Adam Coates [email protected]
Honglak Lee [email protected]
Andrew Y. Ng [email protected]
Computer Science Department, Stanford University, Stanford, CA 94305, USA
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
Agenda
• Introduction
• Unsupervised Learning
• Feature Extraction
• Classification
• Experiments & Analysis
• Conclusion
Motivation
• Achieve a state-of-the-art performance with simple algorithms and a single layer of features
• Avoid complexity and expense, without compromising performance
Single-Layer Network
• Maps the n-dimensional input space to the m-dimensional output space
• Widely used for linear separable problems
Input layer Output layer
http://wwwold.ece.utep.edu/
Multi-layer Network
• One or more hidden layers
• Higher level of computation
• More complexity
• Cost efficient?
• Performance?
Unsupervised Learning
• To find hidden structure in unlabeled data▫ learn feature representations from unlabeled data
•• Unlike supervised learning no error or reward signal to
evaluate a potential solution.
• Approaches to unsupervised learning include:▫ Clustering k-means, mixture models, hierarchical clustering
▫ Blind signal separation using feature extraction techniques for dimensionality reduction PCA, Independent component analysis, Non-negative matrix
factorization, Singular value decomposition
Benchmark Datasets
• CIFAR
• NORB
• STL
http://www.idsia.ch/~juergen/vision.html http://www.stanford.edu/~acoates//stl10/
CIFAR-10
• 60000 32x32 color images
▫ 10 classes
▫ 6000 images per class
• 50000 training
• 10000 test images
32
32
CIFAR-10: 10 classes with 10 random images from each
http://www.cs.toronto.edu/~kriz/cifar.html
NORB• Images of 50 toys
• 5 generic categories
▫ Four-legged animals
▫ Human figures
▫ Airplanes
▫ Trucks
▫ Cars
• Training set:
▫ 5 instances of each category (instances 4, 6, 7, 8 and 9)
• Test set
▫ 5 instances (instances 0, 1, 2, 3, and 5).
STL-10
• 10 classes
▫ airplane, bird, car, cat, deer, dog, horse, monkey, ship, truck.
• Images Size
▫ 96x96 pixels
• Training images
▫ 500 (10 pre-defined folds)
• Test images
▫ 800 per class
▫ 100000 unlabeled images for unsupervised learning
Learning Framework
Extract patches
Extract random patches from unlabeled training images.
Pre-processing
Apply a pre-processing stage to the patches.
Feature-mapping
Learn a feature-mapping using an unsupervised learning algorithm.
Preprocessing
Normalization (Zscores) Whitening
• For images/visual data:
▫ local brightness
▫ Contrast normalization
• Mean subtraction and scale normalization
• Common preprocessing technique
• Decorrelates data
• Observed vector x is linearly transformed to new vector
Whitening
• Observed vector x
• linearly transformed to new vector
▫ Components are uncorrelated
▫ Their variances is equal to unity
• New vector is then defined by:
Feature Mapping
•
Feature Extraction & Classification
Framework
Extract features from
Pool features
Training and prediction
Feature extraction and classification
• Reduce dimentionality to:▫ (n-w-1)-by-(n-w+1)-by-K image representation via feature mapping
▫ Summing up over local regions of yij
▫ Split yij into 4 equal-sized quadrants
▫ Compute the sum yij in each
▫ Obtain 4K-dimensional feature vectors for each training image and label
• Apply L2 SVM classification ▫ Where regularization parameters determined using cross validation
Feature Learning Algorithms
• Sparse auto-encoder
• Sparse RBMs
• K-means clustering
• Gaussian Mixtures
Sparse Auto-encoder
• Feature mapping:
Where,
is the logistic sigmoid function, and is applied component-wise to vector z
Sparse Auto-encoder
• Nice way to do non-linear dimensionality reduction:
▫ They provide mappings both ways
▫ The learning time and memory both scale
• Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data.
Image Representation
Input Image
K
N
W,b
Sparse Restricted Boltzmann Machine
(RBM)Hidden Units• Particular form of log-linear
Markov Random Field (MRF)
• Energy function is linear:
▫ E(v,h) = - b'v - c'h - h'Wv
▫ W :weights connecting hidden and visible units
▫ b, c are the offsets of the visible and hidden layers respectively
• Sparsity penalty as in autoencoder
m=3
n=4
Visible Units
http://deeplearning.net/tutorial/rbm.html
K-means Clustering
• Standard 1-of-K, hard-assignment coding
• Non-linear mapping that attempts “softer”
• chance-based
K-means Clustering
• Standard 1-of-K, hard-assignment coding
• Non-linear mapping that attempts “softer”
• chance-based
Gaussian Mixtures Model (GMM)
Represents the density of K Gaussian distributions
f maps each input to the posterior membership probabilities
Parameters
• Evaluate and assess the effects of change in the following parameters:
▫ Whitened or raw image
▫ Number of features K
▫ Stride size s
▫ Receptive field size w
Testing Procedure
• For each unsupervised learning algorithm
▫ Train a single-layer of features
Whitened or raw
Choice of the parameteres K, s, and w
▫ Then train a linear classifier
On a holdout set (main analysis)
On test set (for final results)
K-means
• Whitening is a crucial pre-process since the clustering algorithms cannot handle the correlations in the data
Without Whitening With Whitening
GMM
• Whitening is a crucial pre-process since the clustering algorithms cannot handle the correlations in the data
Without Whitening With Whitening
Sparse Auto-encoder
The effect here is somewhat ambiguous
Without Whitening With Whitening
Sparse RBM
Without Whitening With Whitening
The effect here is somewhat ambiguous
Performance for Raw and Whitened Inputs
• Feature representations with K=100,200, 400, 800, 1200, & 1600
• All algorithms achieved higher performance by learning more features as expected
Performance vs. Feature Stride
• “Stride” s is the spacing between patches where feature values will be extracted
• # of features are fixed (1600)
• Receptive field size (6 pixel)
• Stride is varying over 1,2,4,8
Effect of Receptive Field
• Stride = 1 ;1600 Bases ;Whitening
• Tested results for w = 6,8,12
• Overall 6 pixel receptive field
worked best
• Meanwhile 12 were similar or
worse than 6 or 8 pixels
• Unlike for other parameters
receptive field requires cross-
validation to make an informed
choice
Final Classification Results
Conclusions
Mean-subtraction, scale normalization and Whitening
+ Large K (#of features)
+ Small s (step size or “stride”)
+ Right patch size w (receptive field size)
+ Simple feature learning algorithm (soft K-means)
=
State-of-the-art results on CIFAR-10 and NORB