6.s093 visual recognition through machine learning competition

6.S093 Visual Recognition through Machine Learning

Competition

Image by kirkh.deviantart.com

Aditya Khosla and Joseph Lim

Today’s class• Part 1: Introduction to deep learning

• What is deep learning?• Why deep learning?• Some common deep learning algorithms

• Part 2: Deep learning tutorial• Please install Python++ now!

Slide credit• Many slides are taken/adapted from Andrew Ng’s

Typical goal of machine learning

Label: “Motorcycle”Suggest tagsImage search…

Speech recognitionMusic classificationSpeaker identification…

Web searchAnti-spamMachine translation…

images/video

input output

Typical goal of machine learning

Label: “Motorcycle”Suggest tagsImage search…

Speech recognitionMusic classificationSpeaker identification…

Web searchAnti-spamMachine translation…

images/video

input output

Feature engineering: most time consuming!

Our goal in object classification

“motorcycle”ML

Why is this hard?You see this:

But the camera sees this:

Pixel-based representation

Raw image

Motorbikes“Non”-Motorbikes

Learningalgorithm

pixel 1

pixel 2

InputMotorbikes“Non”-Motorbikes

Learningalgorithm

pixel 1

pixel 2

Raw image

Learningalgorithm

pixel 1

pixel 2

Raw image

What we want

Learningalgorithm

pixel 1

Feature representation

handlebars

wheelE.g., Does it have Handlebars? Wheels?

Handlebars

Raw image Features

Some feature representations

SIFT Spin image

HoG RIFT

Textons GLOH

Some feature representations

SIFT Spin image

HoG RIFT

Textons GLOH

Coming up with features is often difficult, time-consuming, and requires expert knowledge.

The brain: potential motivation for deep learning

[Roe et al., 1992]

Auditory cortex learns to see!

Auditory Cortex

The brain adapts!

[BrainPort; Welsh & Blasch, 1997; Nagel et al., 2005; Constantine-Paton & Law, 2009]

Seeing with your tongue Human echolocation (sonar)

Haptic belt: Direction sense Implanting a 3rd eye

Basic idea of deep learning• Also referred to as representation learning or

unsupervised feature learning (with subtle distinctions)

• Is there some way to extract meaningful features from data even without knowing the task to be performed?

• Then, throw in some hierarchical ‘stuff’ to make it ‘deep’

Feature learning problem• Given a 14x14 image patch x, can represent it using 196

real numbers.

• Problem: Can we find a learn a better feature vector to represent this?

255989387899148…

First stage of visual processing: V1

V1 is the first stage of visual processing in the brain.Neurons in V1 typically modeled as edge detectors:

Neuron #1 of visual cortex(model)

Neuron #2 of visual cortex(model)

Learning sensor representations

Sparse coding (Olshausen & Field,1996)

Input: Images x(1), x(2), …, x(m) (each in Rn x n)

Learn: Dictionary of bases f1, f2, …, fk (also Rn x n), so that each input x can be approximately decomposed as:

x aj fj

s.t. aj’s are mostly zero (“sparse”)

Sparse coding illustration Natural Images Learned bases (f1 , …, f64): “Edges”

50 100 150 200 250 300 350 400 450 500

» 0.8 * + 0.3 * + 0.5 *

x » 0.8 * f36 + 0.3 * f42

+ 0.5 * f63[a1, …, a64] = [0, 0, …, 0, 0.8, 0, …, 0, 0.3, 0, …, 0, 0.5, 0]

(feature representation)

Test example

Sparse coding illustration

0.6 * + 0.8 * + 0.4 * f15 f28

1.3 * + 0.9 * + 0.3 *

f5 f18

29 • Method “invents” edge detection• Automatically learns to represent an image in terms of the edges that appear in it. Gives a more succinct, higher-level representation than the raw pixels. • Quantitatively similar to primary visual cortex (area V1) in brain.

Represent as: [a5=1.3, a18=0.9, a29 = 0.3]

Represent as: [a15=0.6, a28=0.8, a37 = 0.4]

Going deep

pixels

object parts(combination of edges)

object models

[Honglak Lee]

Training set: Alignedimages of faces.

Why deep learning?

Method Accuracy

Hessian + ESURF [Williems et al 2008] 38%

Harris3D + HOG/HOF [Laptev et al 2003, 2004] 45%

Cuboids + HOG/HOF [Dollar et al 2005, Laptev 2004] 46%

Hessian + HOG/HOF [Laptev 2004, Williems et al 2008] 46%

Dense + HOG / HOF [Laptev 2004] 47%

Cuboids + HOG3D [Klaser 2008, Dollar et al 2005] 46%

Unsupervised feature learning (our method) 52%

[Le, Zhou & Ng, 2011]

Task: video activity recognition

TIMIT Phone classification AccuracyPrior art (Clarkson et al.,1999) 79.6%

Feature learning 80.3%

TIMIT Speaker identification AccuracyPrior art (Reynolds, 1995) 99.7%Feature learning 100.0%

Images

Multimodal (audio/video)

CIFAR Object classification AccuracyPrior art (Ciresan et al., 2011) 80.5%

NORB Object classification AccuracyPrior art (Scherer et al., 2010) 94.4%

AVLetters Lip reading AccuracyPrior art (Zhao et al., 2009) 58.9%

Stanford Feature learning 65.8%

GalaxyHollywood2 Classification AccuracyPrior art (Laptev et al., 2004) 48%

Feature learning 53%

KTH AccuracyPrior art (Wang et al., 2010) 92.1%

UCF AccuracyPrior art (Wang et al., 2010) 85.6%

YouTube AccuracyPrior art (Liu et al., 2009) 71.2%

Text/NLPParaphrase detection AccuracyPrior art (Das & Smith, 2009) 76.1%

Sentiment (MR/MPQA data) AccuracyPrior art (Nakagawa et al., 2010) 77.3%

Speech recognition on Android

Impact on speech recognition

Application to Google Streetview

ImageNet classification: 22,000 classes…smoothhound, smoothhound shark, Mustelus mustelusAmerican smooth dogfish, Mustelus canisFlorida smoothhound, Mustelus norrisiwhitetip shark, reef whitetip shark, Triaenodon obseusAtlantic spiny dogfish, Squalus acanthiasPacific spiny dogfish, Squalus suckleyihammerhead, hammerhead sharksmooth hammerhead, Sphyrna zygaenasmalleye hammerhead, Sphyrna tudesshovelhead, bonnethead, bonnet shark, Sphyrna tiburoangel shark, angelfish, Squatina squatina, monkfishelectric ray, crampfish, numbfish, torpedosmalltooth sawfish, Pristis pectinatusguitarfishroughtail stingray, Dasyatis centrourabutterfly rayeagle rayspotted eagle ray, spotted ray, Aetobatus narinaricownose ray, cow-nosed ray, Rhinoptera bonasusmanta, manta ray, devilfishAtlantic manta, Manta birostrisdevil ray, Mobula hypostomagrey skate, gray skate, Raja batislittle skate, Raja erinacea…

Stingray

Mantaray

0.005%Random guess

9.5% ?Feature learning From raw pixels

State-of-the-art(Weston, Bengio ‘11)

Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

ImageNet Classification: 14M images, 22k categories

0.005%Random guess

9.5% 21.3%Feature learning From raw pixels

State-of-the-art(Weston, Bengio ‘11)

Le, et al., Building high-level features using large-scale unsupervised learning. ICML 2012

ImageNet Classification: 14M images, 22k categories

Some common deep architectures•Autoencoders•Deep belief networks (DBNs)•Convolutional variants•Sparse coding

Andrew Ng

Logistic regression

Logistic regression has a learned parameter vector q. On input x, it outputs:

Draw a logisticregression unit as:

Andrew Ng

Neural Network

String a lot of logistic units together. Example 3 layer network:

Layer 1 Layer 3

Layer 3

Andrew Ng

Neural Network

Layer 1 Layer 2

Layer 4+1

Layer 3

Example 4 layer network with 2 output units:

Andrew Ng

Training a neural network

Given training set (x1, y1), (x2, y2), (x3, y3 ), ….

Adjust parameters q (for every node) to make:

(Use gradient descent. “Backpropagation” algorithm. Susceptible to local optima.)

Andrew Ng

Unsupervised feature learning with a neural network

Layer 1

Layer 2

Layer 3

Autoencoder.

Network is trained to output the input (learn identify function).

Trivial solution unless:- Constrain number of units in Layer 2 (learn compressed representation), or- Constrain Layer 2 to be sparse.

Andrew Ng

Layer 1

Layer 2

Layer 3

Andrew Ng

Layer 1

Layer 2

New representation for input.

Andrew Ng

Layer 1

Layer 2

Andrew Ng

Train parameters so that , subject to bi’s being sparse.

Andrew Ng

Use [c1, c3, c3] as representation to feed to learning algorithm.

Andrew Ng

Deep Belief Net

Deep Belief Net (DBN) is another algorithm for learning a feature hierarchy.

Building block: 2-layer graphical model (Restricted Boltzmann Machine).

Can then learn additional layers one at a time.

Andrew Ng

Restricted Boltzmann machine (RBM)

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3](binary-valued)

MRF with joint distribution:

Use Gibbs sampling for inference.

Given observed inputs x, want maximum likelihood estimation:

x4x1 x2 x3

a2 a3a1

Andrew Ng

Restricted Boltzmann machine (RBM)

Layer 2. [a1, a2, a3](binary-valued)

Gradient ascent on log P(x) :

[xiaj]obs from fixing x to observed value, and sampling a from P(a|x).

[xiaj]prior from running Gibbs sampling to convergence.

Adding sparsity constraint on ai’s usually improves results.

x4x1 x2 x3

a2 a3a1

Andrew Ng

Deep Belief Network

Layer 2. [a1, a2, a3]

Layer 3. [b1, b2, b3]

Similar to a sparse autoencoder in many ways. Stack RBMs on top of each other to get DBN.

Train with approximate maximum likelihood (often with sparsity constraint on ai’s):

Andrew Ng

Deep Belief Network

Layer 2. [a1, a2, a3]

Layer 3. [b1, b2, b3]

Layer 4. [c1, c2, c3]

Andrew Ng

Convolutional DBN for audio

Spectrogram

Detection units

Max pooling unit

Andrew Ng

Convolutional DBN for audio

Spectrogram

Andrew Ng

Convolutional DBN for Images

Detection layer H

Max-pooling layer P

Hidden nodes (binary)

“Filter” weights (shared)

‘’max-pooling’’ node (binary)

Input data V

Tutorial

image classifier demo

6.s093 visual recognition through machine learning competition

representation learning

deep learning tutorialplease

unsupervised feature

deep learningroe

deep learningwhat

input x

visual cortexmodel18http

visual cortexmodelneuron

Documents

spee'ch · competition authority, by virtue ofthe...

fchenkaibing, v cuicheng, duyuning, mengxianglong01, … ·...

bankwide service recognition program. 2 why service is...

evaluation of speaker recognition algorithms. speaker...

detecting figures and part labels in patents: competition...

6.s093’visual’recogni4on’through’...

prabhat dairy - mbacasecomp.com •low-cost •70% b2b ......

by : darmadi d.. low intensity competition moderate...

6.s093 visual recognition through machine learning...

2018-2019 student activity fee report · competition...

analysis - university of...

icdar2017 competition on recognition of documents with...

recognition documents complex layout competition v1a ·...

p920015/s093 – model 5019 high voltage splitter/adaptor...

icdar 2021 competition on multimodal emotion recognition

biometric recognition in spatial frequency domain ·...

agenda - berrigan shire · 7.4 recognition & appreciation...

the euro challenge 2015 competition...

pushing the envelope on material reuse through competition...

national recognition › sites › g › files ›...