![Page 1: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/1.jpg)
6.S093 Visual Recognition through Machine Learning
Competition
Image by kirkh.deviantart.com
Aditya Khosla and Joseph Lim
![Page 2: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/2.jpg)
Today’s class• Part 1: Introduction to deep learning
• What is deep learning?• Why deep learning?• Some common deep learning algorithms
• Part 2: Deep learning tutorial• Please install Python++ now!
![Page 3: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/3.jpg)
Slide credit• Many slides are taken/adapted from Andrew Ng’s
![Page 4: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/4.jpg)
Typical goal of machine learning
Label: “Motorcycle”Suggest tagsImage search…
Speech recognitionMusic classificationSpeaker identification…
Web searchAnti-spamMachine translation…
text
audio
images/video
input output
ML
ML
ML
![Page 5: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/5.jpg)
Typical goal of machine learning
Label: “Motorcycle”Suggest tagsImage search…
Speech recognitionMusic classificationSpeaker identification…
Web searchAnti-spamMachine translation…
text
audio
images/video
input output
ML
ML
ML
Feature engineering: most time consuming!
![Page 6: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/6.jpg)
Our goal in object classification
“motorcycle”ML
![Page 7: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/7.jpg)
Why is this hard?You see this:
But the camera sees this:
![Page 8: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/8.jpg)
Pixel-based representation
Input
Raw image
Motorbikes“Non”-Motorbikes
Learningalgorithm
pixel 1
pixe
l 2
pixel 1
pixel 2
![Page 9: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/9.jpg)
Pixel-based representation
InputMotorbikes“Non”-Motorbikes
Learningalgorithm
pixel 1
pixe
l 2
pixel 1
pixel 2
Raw image
![Page 10: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/10.jpg)
Pixel-based representation
InputMotorbikes“Non”-Motorbikes
Learningalgorithm
pixel 1
pixe
l 2
pixel 1
pixel 2
Raw image
![Page 11: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/11.jpg)
What we want
InputMotorbikes“Non”-Motorbikes
Learningalgorithm
pixel 1
pixe
l 2
Feature representation
handlebars
wheelE.g., Does it have Handlebars? Wheels?
Handlebars
Whe
els
Raw image Features
![Page 12: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/12.jpg)
Some feature representations
SIFT Spin image
HoG RIFT
Textons GLOH
![Page 13: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/13.jpg)
Some feature representations
SIFT Spin image
HoG RIFT
Textons GLOH
Coming up with features is often difficult, time-consuming, and requires expert knowledge.
![Page 14: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/14.jpg)
The brain: potential motivation for deep learning
[Roe et al., 1992]
Auditory cortex learns to see!
Auditory Cortex
![Page 15: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/15.jpg)
The brain adapts!
[BrainPort; Welsh & Blasch, 1997; Nagel et al., 2005; Constantine-Paton & Law, 2009]
Seeing with your tongue Human echolocation (sonar)
Haptic belt: Direction sense Implanting a 3rd eye
![Page 16: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/16.jpg)
Basic idea of deep learning• Also referred to as representation learning or
unsupervised feature learning (with subtle distinctions)
• Is there some way to extract meaningful features from data even without knowing the task to be performed?
• Then, throw in some hierarchical ‘stuff’ to make it ‘deep’
![Page 17: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/17.jpg)
Feature learning problem• Given a 14x14 image patch x, can represent it using 196
real numbers.
• Problem: Can we find a learn a better feature vector to represent this?
255989387899148…
![Page 18: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/18.jpg)
First stage of visual processing: V1
V1 is the first stage of visual processing in the brain.Neurons in V1 typically modeled as edge detectors:
Neuron #1 of visual cortex(model)
Neuron #2 of visual cortex(model)
![Page 19: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/19.jpg)
Learning sensor representations
Sparse coding (Olshausen & Field,1996)
Input: Images x(1), x(2), …, x(m) (each in Rn x n)
Learn: Dictionary of bases f1, f2, …, fk (also Rn x n), so that each input x can be approximately decomposed as:
x aj fj
s.t. aj’s are mostly zero (“sparse”)
j=1
k
![Page 20: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/20.jpg)
Sparse coding illustration Natural Images Learned bases (f1 , …, f64): “Edges”
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
50 100 150 200 250 300 350 400 450 500
50
100
150
200
250
300
350
400
450
500
» 0.8 * + 0.3 * + 0.5 *
x » 0.8 * f36 + 0.3 * f42
+ 0.5 * f63[a1, …, a64] = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, 0]
(feature representation)
Test example
![Page 21: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/21.jpg)
Sparse coding illustration
0.6 * + 0.8 * + 0.4 * f15 f28
f
37
1.3 * + 0.9 * + 0.3 *
f5 f18
f
29 • Method “invents” edge detection• Automatically learns to represent an image in terms of the edges that appear in it. Gives a more succinct, higher-level representation than the raw pixels. • Quantitatively similar to primary visual cortex (area V1) in brain.
Represent as: [a5=1.3, a18=0.9, a29 = 0.3]
Represent as: [a15=0.6, a28=0.8, a37 = 0.4]
![Page 22: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/22.jpg)
Going deep
pixels
edges
object parts(combination of edges)
object models
[Honglak Lee]
Training set: Alignedimages of faces.
![Page 23: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/23.jpg)
Why deep learning?
Method Accuracy
Hessian + ESURF [Williems et al 2008] 38%
Harris3D + HOG/HOF [Laptev et al 2003, 2004] 45%
Cuboids + HOG/HOF [Dollar et al 2005, Laptev 2004] 46%
Hessian + HOG/HOF [Laptev 2004, Williems et al 2008] 46%
Dense + HOG / HOF [Laptev 2004] 47%
Cuboids + HOG3D [Klaser 2008, Dollar et al 2005] 46%
Unsupervised feature learning (our method) 52%
[Le, Zhou & Ng, 2011]
Task: video activity recognition
![Page 24: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/24.jpg)
TIMIT Phone classification AccuracyPrior art (Clarkson et al.,1999) 79.6%
Feature learning 80.3%
TIMIT Speaker identification AccuracyPrior art (Reynolds, 1995) 99.7%Feature learning 100.0%
Audio
Images
Multimodal (audio/video)
CIFAR Object classification AccuracyPrior art (Ciresan et al., 2011) 80.5%
Feature learning 82.0%
NORB Object classification AccuracyPrior art (Scherer et al., 2010) 94.4%
Feature learning 95.0%
AVLetters Lip reading AccuracyPrior art (Zhao et al., 2009) 58.9%
Stanford Feature learning 65.8%
GalaxyHollywood2 Classification AccuracyPrior art (Laptev et al., 2004) 48%
Feature learning 53%
KTH AccuracyPrior art (Wang et al., 2010) 92.1%
Feature learning 93.9%
UCF AccuracyPrior art (Wang et al., 2010) 85.6%
Feature learning 86.5%
YouTube AccuracyPrior art (Liu et al., 2009) 71.2%
Feature learning 75.8%
Video
Text/NLPParaphrase detection AccuracyPrior art (Das & Smith, 2009) 76.1%
Feature learning 76.4%
Sentiment (MR/MPQA data) AccuracyPrior art (Nakagawa et al., 2010) 77.3%
Feature learning 77.7%
![Page 25: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/25.jpg)
Speech recognition on Android
![Page 26: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/26.jpg)
Impact on speech recognition
![Page 27: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/27.jpg)
Application to Google Streetview
![Page 28: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/28.jpg)
ImageNet classification: 22,000 classes…smoothhound, smoothhound shark, Mustelus mustelusAmerican smooth dogfish, Mustelus canisFlorida smoothhound, Mustelus norrisiwhitetip shark, reef whitetip shark, Triaenodon obseusAtlantic spiny dogfish, Squalus acanthiasPacific spiny dogfish, Squalus suckleyihammerhead, hammerhead sharksmooth hammerhead, Sphyrna zygaenasmalleye hammerhead, Sphyrna tudesshovelhead, bonnethead, bonnet shark, Sphyrna tiburoangel shark, angelfish, Squatina squatina, monkfishelectric ray, crampfish, numbfish, torpedosmalltooth sawfish, Pristis pectinatusguitarfishroughtail stingray, Dasyatis centrourabutterfly rayeagle rayspotted eagle ray, spotted ray, Aetobatus narinaricownose ray, cow-nosed ray, Rhinoptera bonasusmanta, manta ray, devilfishAtlantic manta, Manta birostrisdevil ray, Mobula hypostomagrey skate, gray skate, Raja batislittle skate, Raja erinacea…
Stingray
Mantaray
![Page 29: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/29.jpg)
0.005%Random guess
9.5% ?Feature learning From raw pixels
State-of-the-art(Weston, Bengio ‘11)
Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012
ImageNet Classification: 14M images, 22k categories
![Page 30: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/30.jpg)
0.005%Random guess
9.5% 21.3%Feature learning From raw pixels
State-of-the-art(Weston, Bengio ‘11)
Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012
ImageNet Classification: 14M images, 22k categories
![Page 31: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/31.jpg)
Some common deep architectures•Autoencoders•Deep belief networks (DBNs)•Convolutional variants•Sparse coding
![Page 32: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/32.jpg)
Andrew Ng
Logistic regression
x1
x2
x3
+1
Logistic regression has a learned parameter vector q. On input x, it outputs:
where
Draw a logisticregression unit as:
![Page 33: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/33.jpg)
Andrew Ng
Neural Network
String a lot of logistic units together. Example 3 layer network:
x1
x2
x3
+1 +1
a3
a2
a1
Layer 1 Layer 3
Layer 3
![Page 34: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/34.jpg)
Andrew Ng
Neural Network
x1
x2
x3
+1 +1
Layer 1 Layer 2
Layer 4+1
Layer 3
Example 4 layer network with 2 output units:
![Page 35: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/35.jpg)
Andrew Ng
Training a neural network
Given training set (x1, y1), (x2, y2), (x3, y3 ), ….
Adjust parameters q (for every node) to make:
(Use gradient descent. “Backpropagation” algorithm. Susceptible to local optima.)
![Page 36: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/36.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
Layer 1
Layer 2
x1
x2
x3
x4
x5
x6
x1
x2
x3
+1
Layer 3
Autoencoder.
Network is trained to output the input (learn identify function).
Trivial solution unless:- Constrain number of units in Layer 2 (learn compressed representation), or- Constrain Layer 2 to be sparse.
a1
a2
a3
![Page 37: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/37.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
Layer 1
Layer 2
x1
x2
x3
x4
x5
x6
x1
x2
x3
+1
Layer 3
a1
a2
a3
![Page 38: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/38.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
Layer 1
Layer 2
x1
x2
x3
+1
a1
a2
a3
New representation for input.
![Page 39: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/39.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
Layer 1
Layer 2
x1
x2
x3
+1
a1
a2
a3
![Page 40: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/40.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
x1
x2
x3
+1
a1
a2
a3
+1
b1
b2
b3
Train parameters so that , subject to bi’s being sparse.
![Page 41: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/41.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
x1
x2
x3
+1
a1
a2
a3
+1
b1
b2
b3
Train parameters so that , subject to bi’s being sparse.
![Page 42: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/42.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
x1
x2
x3
+1
a1
a2
a3
+1
b1
b2
b3
Train parameters so that , subject to bi’s being sparse.
![Page 43: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/43.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
x1
x2
x3
+1
a1
a2
a3
+1
b1
b2
b3
New representation for input.
![Page 44: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/44.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
x1
x2
x3
+1
a1
a2
a3
+1
b1
b2
b3
![Page 45: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/45.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
x1
x2
x3
+1
a1
a2
a3
+1
b1
b2
b3
+1
c1
c2
c3
![Page 46: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/46.jpg)
Andrew Ng
Unsupervised feature learning with a neural network
x4
x5
x6
+1
x1
x2
x3
+1
a1
a2
a3
+1
b1
b2
b3
+1
c1
c2
c3
New representation for input.
Use [c1, c3, c3] as representation to feed to learning algorithm.
![Page 47: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/47.jpg)
Andrew Ng
Deep Belief Net
Deep Belief Net (DBN) is another algorithm for learning a feature hierarchy.
Building block: 2-layer graphical model (Restricted Boltzmann Machine).
Can then learn additional layers one at a time.
![Page 48: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/48.jpg)
Andrew Ng
Restricted Boltzmann machine (RBM)
Input [x1, x2, x3, x4]
Layer 2. [a1, a2, a3](binary-valued)
MRF with joint distribution:
Use Gibbs sampling for inference.
Given observed inputs x, want maximum likelihood estimation:
x4x1 x2 x3
a2 a3a1
![Page 49: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/49.jpg)
Andrew Ng
Restricted Boltzmann machine (RBM)
Input [x1, x2, x3, x4]
Layer 2. [a1, a2, a3](binary-valued)
Gradient ascent on log P(x) :
[xiaj]obs from fixing x to observed value, and sampling a from P(a|x).
[xiaj]prior from running Gibbs sampling to convergence.
Adding sparsity constraint on ai’s usually improves results.
x4x1 x2 x3
a2 a3a1
![Page 50: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/50.jpg)
Andrew Ng
Deep Belief Network
Input [x1, x2, x3, x4]
Layer 2. [a1, a2, a3]
Layer 3. [b1, b2, b3]
Similar to a sparse autoencoder in many ways. Stack RBMs on top of each other to get DBN.
Train with approximate maximum likelihood (often with sparsity constraint on ai’s):
![Page 51: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/51.jpg)
Andrew Ng
Deep Belief Network
Input [x1, x2, x3, x4]
Layer 2. [a1, a2, a3]
Layer 3. [b1, b2, b3]
Layer 4. [c1, c2, c3]
![Page 52: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/52.jpg)
Andrew Ng
Convolutional DBN for audio
Spectrogram
Detection units
Max pooling unit
![Page 53: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/53.jpg)
Andrew Ng
Convolutional DBN for audio
Spectrogram
![Page 54: 6.S093 Visual Recognition through Machine Learning Competition](https://reader036.vdocuments.us/reader036/viewer/2022062501/5681697c550346895de186ec/html5/thumbnails/54.jpg)
Andrew Ng
Convolutional DBN for Images
Wk
Detection layer H
Max-pooling layer P
Hidden nodes (binary)
“Filter” weights (shared)
‘’max-pooling’’ node (binary)
Input data V