scene understanding without labels - university of … · s. m. ali eslami, nicolas heess,...

59
Scene Understanding Without Labels S. M. Ali Eslami February 2017

Upload: hanguyet

Post on 21-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Scene Understanding Without Labels

S. M. Ali Eslami

February 2017

Learning paradigms

x

z

SupervisedLearning

yhorse

Learning paradigms

x

z

x

SupervisedLearning

Reinforcement Learning

y

z

ahorse left

Learning paradigms

Decoder

x

z

xx

z

x

SupervisedLearning

Reinforcement Learning

GenerativeModelling

y

z

a yhorse left

Learning paradigms

Decoder

x

z

xx

z

x

SupervisedLearning

Reinforcement Learning

GenerativeModelling

y

z

a y

(2.3, -1, 0.5, 3)

not blinkinghorse left

Highly structured

General Purpose Graphics Programming

Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, and Joshua B. Tenenbaum (2013)

Partially structured

A Stochastic Grammar of Images

Song-Chun Zhu and David Mumford (2007)

Partially structured

S. M. Ali Eslami and Christopher K. I. Williams (2012)

Fully unstructured

Geoffrey Hinton (2006) Antti Rasmus et al. (2016) Jeff Donahue et al. (2016)

Recurrent Neural Networksfor Image Generation

Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, Daan Wierstra (2015)

Variational Inference

Approximate p(z|x) using q(z|x)

Parameterise q(z|x) by deep network

Minimise KL[ q(z|x) | p(z|x) ] via SGD

Model

x

z

x

Recurrent Neural Networks for Image Generation

c

x

z

p(x|c)D

ecod

ing

Gen

erat

ion

Enc

odin

gIn

fere

nce

Hinton (2006)

Read Read

ct+2

x

Write

z

Recurrent Neural Networks for Image Generation

c

x

z

p(x|c)D

ecod

ing

Gen

erat

ion

Enc

odin

gIn

fere

nce

Write

Read

ct

x

ct+1

x

Write

p(x|cT)

z z

Gregor et al. (2015)Hinton (2006)

Recurrent Neural Networks for Image Generation

Recurrent Neural Networks for Image Generation

Towards Conceptual CompressionKarol Gregor, Frederic Besse, Danilo Rezende, Daan Wierstra (2016)

Read

ct+3

x

Write

z

Write

Read Read

ct+1

x

ct+2

x

Write

p(x|cT)

z z

Write

Read

ct+1

x

z

ct+3

Write

z

Write

Read

ct+1

x

ct+2

Write

p(x|cT)

z z

Write

Read

ct+1

x

z

76 bits

Original raw image:24576 bits

112 bits

221 bits

380 bits

2364 bits

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, Geoffrey Hinton (2016)

Attend, Infer, Repeat

x

z

blue brick

Mod

elIm

age

Cau

se

x

z

blue brick pile of bricks

x

z

Mod

elIm

age

Cau

se

not sufficient forgraspingcountingtransfergeneralisation

x

z

x

z1 z2

Mod

elIm

age

Cau

se

blue brick red brickpile of bricks

x

z

Mod

elIm

age

Cau

se

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty2

blue brick red brickpile of bricks blue brickabove

red brickbelow

x

z1 z2

Decoder

x y

z Decoder

x y

h1 h2 h3

z1 z2 z3

x

z

x

z1 z2 z3

Mod

elIn

fere

nce

Net

wor

k

Decoder

x y

h1 h2 h3

z1 z2 z3 Decoder

x y

h1 h2 h3

zpresz1 zpresz2 zpresz3zwhatz1 zwhatz2 zwhatz3zwherez1 zwherez2 zwherez3

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty2

Mod

elIn

fere

nce

Net

wor

k

x

z1 z2 z3

x y

zpres

zwhatxatt yatt

hi

zwhere...

VA

E

yi

i ii

i

i

... ...

Decoder

x y

h1 h2 h3

zpresz1 zpresz2 zpresz3zwhatz1 zwhatz2 zwhatz3zwherez1 zwherez2 zwherez3

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty21. Build in structure

Get out meaning

2. Inference networks that area. recurrentb. variable-lengthc. attentive

3. End-to-end learning througha. discrete, continuous varsb. inference and model nets

Recap: Key ideas

Omniglot

6

9

no

yes

Representational power

Sum? Increasing order?

Additional structure

x

z

distributed vector that correlates with blue brick

learned

x

z

Additional structure

x

z

distributed vector that correlates with blue brick

class=brickcolour=blueposition=Protation=R

learned

specified

Decoder

x y

h1 h2 h3

z1 z2 z3

x

z1 z2 z3

Additional structure

specified

Inverse graphics

Policy learning

Tabl

e-to

pM

NIS

T

Unsupervised Learning of 3D Structure from Images

Danilo Rezende, S. M. Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, Nicolas Heess (2016)

How to recover 3D structure from 2D images?To form stable representations, regardless of camera position

Motivation

How to recover 3D structure from 2D images?To form stable representations, regardless of camera position

● Inherently ill-posed○ All objects appear under self occlusion, infinite explanations○ Therefore build statistical models to know what’s likely and what’s not

● Even with models, inference is intractable○ Important to capture multi-modal explanations

● How are 3D scenes best represented?○ Meshes or voxels?

● Where is training data collected from?

Motivation

Unsupervised Learning of 3D Structure from Images

Unsupervised Learning of 3D Structure from Images

Projection operatorsUnsupervised Learning of 3D Structure from Images

Unsupervised Learning of 3D Structure from Images

Unconditional samples

Unsupervised Learning of 3D Structure from Images

Unconditional inference

Unsupervised Learning of 3D Structure from Images

Class-conditional samples

Unsupervised Learning of 3D Structure from Images

Multi-modality of inference

Unsupervised Learning of 3D Structure from Images

3D structure from multiple 2D images

Unsupervised Learning of 3D Structure from Images

Inferring object meshes

● Deep Supervised Learning

● Deep Reinforcement Learning

● Model-based Methods

● Deep Variational Inference

● Structured / Unstructured Generative Models

Recap

Questions

[email protected]