scene understanding without labels - university of … · s. m. ali eslami, nicolas heess,...

Scene Understanding Without Labels

S. M. Ali Eslami

February 2017

Learning paradigms

x

z

SupervisedLearning

yhorse

Learning paradigms

x

z

x

SupervisedLearning

Reinforcement Learning

y

z

ahorse left

Learning paradigms

Decoder

x

z

xx

z

x

SupervisedLearning


GenerativeModelling

y

z

a yhorse left

Learning paradigms

Decoder

x

z

xx

z

x

SupervisedLearning


GenerativeModelling

y

z

a y

(2.3, -1, 0.5, 3)

not blinkinghorse left

Highly structured

General Purpose Graphics Programming

Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, and Joshua B. Tenenbaum (2013)

Partially structured

A Stochastic Grammar of Images

Song-Chun Zhu and David Mumford (2007)

Partially structured

S. M. Ali Eslami and Christopher K. I. Williams (2012)

Fully unstructured

Geoffrey Hinton (2006) Antti Rasmus et al. (2016) Jeff Donahue et al. (2016)

Recurrent Neural Networksfor Image Generation

Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, Daan Wierstra (2015)

Recurrent Neural Networks for Image Generation

c

x

z

p(x|c)D

ecod

ing

Gen

erat

ion

Enc

odin

gIn

fere

nce

Hinton (2006)

Read Read

ct+2

x

Write

z


c

x

z

p(x|c)D

ecod

ing

Gen

erat

ion

Enc

odin

gIn

fere

nce

Write

Read

ct

x

ct+1

x

Write

p(x|cT)

z z

Gregor et al. (2015)Hinton (2006)

Towards Conceptual CompressionKarol Gregor, Frederic Besse, Danilo Rezende, Daan Wierstra (2016)

Read

ct+3

x

Write

z

Write

Read Read

ct+1

x

ct+2

x

Write

p(x|cT)

z z

Write

Read

ct+1

x

z

ct+3

Write

z

Write

Read

ct+1

x

ct+2

Write

p(x|cT)

z z

Write

Read

ct+1

x

z

76 bits

Original raw image:24576 bits

112 bits

221 bits

380 bits

2364 bits

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

S. M. Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, Koray Kavukcuoglu, Geoffrey Hinton (2016)

Attend, Infer, Repeat

x

z

blue brick

Mod

elIm

age

Cau

se

x

z

blue brick pile of bricks

x

z

Mod

elIm

age

Cau

se

not sufficient forgraspingcountingtransfergeneralisation

x

z

x

z1 z2

Mod

elIm

age

Cau

se

blue brick red brickpile of bricks

x

z

Mod

elIm

age

Cau

se

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty2

blue brick red brickpile of bricks blue brickabove

red brickbelow

x

z1 z2

Decoder

x y

z Decoder

x y

h1 h2 h3

z1 z2 z3

x

z

x

z1 z2 z3

Mod

elIn

fere

nce

Net

wor

k

Decoder

x y

h1 h2 h3

z1 z2 z3 Decoder

x y

h1 h2 h3

zpresz1 zpresz2 zpresz3zwhatz1 zwhatz2 zwhatz3zwherez1 zwherez2 zwherez3

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty2

Mod

elIn

fere

nce

Net

wor

k

x

z1 z2 z3

x y

zpres

zwhatxatt yatt

hi

zwhere...

VA

E

yi

i ii

i

i

... ...

Decoder

x y

h1 h2 h3

zpresz1 zpresz2 zpresz3zwhatz1 zwhatz2 zwhatz3zwherez1 zwherez2 zwherez3

x

zwhat

y1

z1 zwherez1 zwhat

y2

z2 zwherez2

atty1

atty21. Build in structure

Get out meaning

2. Inference networks that area. recurrentb. variable-lengthc. attentive

3. End-to-end learning througha. discrete, continuous varsb. inference and model nets

Recap: Key ideas

Demo reel

http://www.youtube.com/watch?v=EKsgjR4Txk4

Omniglot

6

9

no

yes

Representational power

Sum? Increasing order?

Additional structure

x

z

distributed vector that correlates with blue brick

learned

x

z


x

z

distributed vector that correlates with blue brick

class=brickcolour=blueposition=Protation=R

learned

specified

Decoder

x y

h1 h2 h3

z1 z2 z3

x

z1 z2 z3


specified

Inverse graphics

Policy learning

Tabl

e-to

pM

NIS

T

Unsupervised Learning of 3D Structure from Images

Danilo Rezende, S. M. Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, Nicolas Heess (2016)

How to recover 3D structure from 2D images?To form stable representations, regardless of camera position

Motivation

How to recover 3D structure from 2D images?To form stable representations, regardless of camera position

● Inherently ill-posed○ All objects appear under self occlusion, infinite explanations○ Therefore build statistical models to know what’s likely and what’s not

● Even with models, inference is intractable○ Important to capture multi-modal explanations

● How are 3D scenes best represented?○ Meshes or voxels?

● Where is training data collected from?

Motivation

Projection operatorsUnsupervised Learning of 3D Structure from Images


Unconditional samples


Unconditional inference

http://www.youtube.com/watch?v=bT_i5NGxuL0


Class-conditional samples


Multi-modality of inference


3D structure from multiple 2D images


Inferring object meshes


Inferring object meshes

http://www.youtube.com/watch?v=OW_cOyoE3oA

http://www.youtube.com/watch?v=sq8sf1Nfg2A

http://www.youtube.com/watch?v=RXVMvAdv4Wo

● Deep Supervised Learning

● Deep Reinforcement Learning

● Model-based Methods

● Deep Variational Inference

● Structured / Unstructured Generative Models

Recap

Questions

[email protected]

scene understanding without labels - university of … · s. m. ali eslami, nicolas heess,...

Documents