from language to vision and back again...source: barbu, andrei, alexander bridge, dan coroian, sven...
TRANSCRIPT
-
From language to vision and back again
Andrei Barbu
Andrei Barbu (MIT) language↔ vision August 28, 2015
Center for Brains,Minds & Machines
1
-
Andrei Barbu (MIT) language↔ vision August 28, 2015
© Source Unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
2
https://ocw.mit.edu/help/faq-fair-use/
-
Andrei Barbu (MIT) language↔ visionTorralba et al 2003
August 28, 20153
-
Andrei Barbu (MIT) language↔ vision August 28, 20154
-
Andrei Barbu (MIT) language↔ vision August 28, 20155
-
hammering
hammer
Andrei Barbu (MIT) language↔ vision August 28, 20156
-
Perception is unreliable.
Top-down knowledge affects our perception.
One integrated representation for many tasks.
Andrei Barbu (MIT) language↔ vision August 28, 20157
-
S(sentence, video)
argmax S(s, v)v∈V
argmax S(s, v)s∈L
argmax S′(s, sq, v)s∈L
argmax S(i, v)i∈parser(s)
argmax∏
S(s(p), v)p s,v
S(sentence, video)
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition
Retrieval
Generation
Question answering
Disambiguation
Acquiring language
Images, not videos
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 20158
-
Humans perform language–vision tasks all the time
Give me the cup.Which chair should I sit in?This is an apple.To win this game you have to make a straight line out of your pieces.
Andrei Barbu (MIT) language↔ vision August 28, 20159
-
argmax S(s, v)v∈V
argmax S(s, v)s∈L
argmax S′(s, sq, v)s∈L
argmax S(i, v)i∈parser(s)
argmax S(s(p), v)p
∏s,v
S(sentence, video)
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval
Generation
Question answering
Disambiguation
Acquiring language
Images, not videos
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201510
-
The person rode the skateboard leftward.
object detector, tracker, event recognizer
Andrei Barbu (MIT) language↔ vision August 28, 201511
-
Tracks Events Sentences
Andrei Barbu (MIT) language↔ vision August 28, 201512
-
False negatives
Object detection
Andrei Barbu (MIT) language↔ vision
Figure removed due to copyright restrictions. Please see the video.
Source: Barbu, Andrei, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. "Simultaneousobject detection, tracking, and event recognition." Advances in Cognitive Systems: 203-220 (2012).
Felzenszwalb et al 2008August 28, 201513
-
Object detection
Andrei Barbu (MIT) language↔ vision
Figure removed due to copyright restrictions. Please see the video.
Source: Barbu, Andrei, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. "Simultaneousobject detection, tracking, and event recognition." Advances in Cognitive Systems: 203-220 (2012).
Felzenszwalb et al 2008August 28, 201514
-
Object detection
Andrei Barbu (MIT) language↔ vision
Figure removed due to copyright restrictions. Please see the video.
Source: Barbu, Andrei, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. "Simultaneousobject detection, tracking, and event recognition." Advances in Cognitive Systems: 203-220 (2012).
Felzenszwalb et al 2008August 28, 201515
-
Object detectors work poorly
Andrei Barbu (MIT) language↔ vision
Figure removed due to copyright restrictions. Please see the video.
Source: Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang et al. "Imagenet large scale visual recognition challenge." International
Journal of Computer Vision 115, no. 3 (2015): 211-252.
Russakovsky et al 2015August 28, 201516
-
Fixing bad detectors with higher-level knowledge
detection / object / frametemporally coherent track
object detector confi-dence (f)motion coherence (g)
optimal path through thelattice of detectionsdynamic programmingBellman (1957), Viterbi (1967)
T
max∑ T
f (bt t 1 tjt) + gj1 ..., jT 1
∑(bjt
−−1 , bjt)
, t= t=2
Andrei Barbu (MIT) language↔ visionBarbu et al 2012
August 28, 201517
-
Fixing bad detectors with higher-level knowledge
detection / object / frametemporally coherent track
object detector confi-dence (f)motion coherence (g)
optimal path through thelattice of detectionsdynamic programmingBellman (1957), Viterbi (1967)
T
max∑ T
f (bt t 1 tjt) + gj1 ..., jT 1
∑(bjt
−−1 , bjt)
, t= t=2
Andrei Barbu (MIT) language↔ vision
Courtesy of Andrei Barbu, Alexander Bridge, Dan Coroian, Sven Dickinson,
Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt,Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Want,Jinlian Wei, Yifan Yin & Zhiqi Zhang. Used with permission.Source: Barbu, Andrei, Alexander Bridge, Dan Coroian, Sven Dickinson, Sam
Mussman, Siddharth Narayanaswamy, Dhaval Salvi et al. "Large-scale automaticlabeling of video events with verbs based on event-participant interaction."arXiv preprint arXiv:1204.3616 (2012).
Barbu et al 2012August 28, 201518
-
Feature vector — single participant
PersonPerson
position positiondt dt2
positiond d
aspect ratio aspect ratio area areadt dt
object class root-filter index
d d2
Andrei Barbu (MIT) language↔ vision August 28, 201519
-
Feature vector — dual participant
distance distance orientation orientationdt dtd d
person riding skateboard person approaching person skateboard approaching person
Andrei Barbu (MIT) language↔ vision August 28, 201520
-
Event recognition
tracks· · · · · · · · ·
a
h
feature vectors
Baum and Petrie (1966)
OneHMMper event classTry each HMM
HMMs
Andrei Barbu (MIT) language↔ vision August 28, 201521
-
Event recognition
max , bt 1 tk k
∑h t(kt t̂ ) +1,...
=1
∑a(k − , k )
T, t t=2
T T
Andrei Barbu (MIT) language↔ vision August 28, 201522
-
Andrei Barbu (MIT) language↔ visionBarbu et al 2012
August 28, 201523
-
Tracks Events Sentences
Andrei Barbu (MIT) language↔ vision August 28, 201524
-
Tracking in the context of event recognition
∑T T T Tmax f t 1 t t t 1 t(bt tjt) + g(b
−1 , b ) t) + a(k − , k )jt jt + max h(k , bj ,...,jT 1t
∑̂1 − k kT,...,
=1 t=2
∑t=1
∑t=2
Andrei Barbu (MIT) language↔ vision August 28, 201525
-
Tracking in the context of event recognition
max max∑
f t(bt bt t btjt) +∑
g(b t−11 , jt) +∑
h(k , t ) +j j∑
a kt−1, kt( )j1 T 1 T,...,j k ,...,k
−
t=1 t=2 t=1 t=2
T T T T
Andrei Barbu (MIT) language↔ vision August 28, 201526
-
Tracking in the context of event recognition
∑T Tmax
∑ Tmax f t) + g( t , bt(btj b
t−11 jt) +
∑ Th t(k , btt ) +
∑a(kt−1, kt)
1 jj1 j...,jT, k T,...,k−
t=1 t=2 t=1 t=2
Andrei Barbu (MIT) language↔ vision August 28, 201527
-
Tracking in the context of event recognition in action
tracking tracking and event recognition
Andrei Barbu (MIT) language↔ vision August 28, 201528
Courtesy of Andrei Barbu, Alexander Bridge, Dan Coroian, Sven Dickinson, Sam Mussman, SiddharthNarayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, JarrellWaggoner, Song Want, Jinlian Wei, Yifan Yin & Zhiqi Zhang. Used with permission.Source: Barbu, Andrei, Alexander Bridge, Dan Coroian, Sven Dickinson, Sam Mussman, SiddharthNarayanaswamy, Dhaval Salvi et al. "Large-scale automatic labeling of video events with verbsbased on event-participant interaction." arXiv preprint arXiv:1204.3616 (2012).
-
Tracking in the context of event recognition in action
tracking tracking and event recognition
Andrei Barbu (MIT) language↔ vision August 28, 201529
-
Tracks Events Sentences
Andrei Barbu (MIT) language↔ vision August 28, 201530
-
track L
Building sentences out of trackers and words
Viterbi tracker
track 1
× · · ·×...
......
...
. . .
. . .
. . .
. . .
f g
t = 1 t = 2 t = 3 t = T
j = 1
j = 3
j = 2
j = Jt
b11
b12
b13
b1J1
b21
b22
b23
b2J2
b31
b32
b33
b3J3
bT1
bT2
bT3
bTJT
×...
......
...
. . .
. . .
. . .
. . .
h a
× · · ·×...
......
...
. . .
. . .
. . .
. . .
h a
word 1 word W
...j1L,..., j
TL
maxk11 ,..., k
T1
...k1W ,..., k
TW
L∑l=1
T T
max f t(bt 1 tjt) + g(b−
j11, , jT l jt−1
, bjt)l... l1 t=1 t=2
∑ ∑
+
W∑w=1
T∑t=1
hw(ktw, btjtθ1w
, btjtθ2w
) +
T∑t=2
a(kt−1, kt)
Andrei Barbu (MIT) language↔ visionSiddharth et al 2014
August 28, 201531
-
track L
Building sentences out of trackers and words
Event tracker for intransitive verbs
track 1
× · · ·×...
......
...
. . .
. . .
. . .
. . .
f g
t = 1 t = 2 t = 3 t = T
j = 1
j = 3
j = 2
j = Jt
b11
b12
b13
b1J1
b21
b22
b23
b2J2
b31
b32
b33
b3J3
bT1
bT2
bT3
bTJT
×
× · · ·×. . .
. . .
. . .
. . . .. . . .. . . .
. . .
h a
word W
L
1. .. .
∑l=
. .j1L,..., j
TL k
1W ,..., k
TW
word 1
T T T T
max max∑
f bt t 1 t t t t 1 t( jt) +∑
g(b t− 1 , bjt) +∑
h(k , b̂jt) +∑
a(k − , k )1 1 T jj T l
−l
1,..., j1 k1 ,..., k l1 t=1 t=2 t=1 t=2
W∑w=1
Andrei Barbu (MIT) language↔ visionSiddharth et al 2014
August 28, 201532
-
Building sentences out of trackers and words
Event tracker
track 1 track L
×
× · · ·×. . .
. . .
. . .
. . . .. . . .. . . .
. . .
h a
word W
..
∑Ww=1
.k1 TW ,..., kW
word 1
max maxj11, k
1,..., kT..., jT1 1
∑L1 l=1
..
∑T ∑T T Tf bt t 1 t t t t t 1 t( jt) + g(bjt
−1 , bjt) + h(k , b̂jt , b̂jt ) + a(k
− , k )l l 1 2l
−
t 1 t 2
∑t
θ θ= = =1
∑t=2
.j1L , j
T,... L
Andrei Barbu (MIT) language↔ visionSiddharth et al 2014
August 28, 201533
-
Building sentences out of trackers and words
Sentence tracker
track 1 track L
word 1 word W
∑L ∑T ∑T W T Tmax max f bt g bt−1 bt
∑∑h kt bt( jt) + ( t 1 , jt) + w( w, jt , b
t t 1 tjt ) +
∑aw(k− w
− , kw)1 T 1 T l jj j k k l 1 2,..., l1 1 1 ,..., 1 l=1 t=1 t=2 w 1 θw θw.
= t=1 t.
=2
. .. .j1 T 1L, kW , , k
T..., jL ... W
Andrei Barbu (MIT) language↔ visionSiddharth et al 2014
August 28, 201534
-
Sentences
The tall person quickly rode the horse leftward away from the other horse.
agent-location patient-location
. . .
source-location
Andrei Barbu (MIT) language↔ vision
© Source Unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
August 28, 201535
https://ocw.mit.edu/help/faq-fair-use/
-
Understanding sentences as a whole
We can differentiate events based on the verb:The person picked up an object.The person put down an object.
Andrei Barbu (MIT) language↔ visionSiddharth et al 2014
August 28, 2015
© Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional
Framework for Grounding Language Inference, Generation, and Acquisition in Video."
J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.
36
https://ocw.mit.edu/help/faq-fair-use/
-
Understanding sentences as a whole
We can differentiate events based on the subject noun:The backpack approached the bin.The chair approached the bin.Andrei Barbu (MIT) language↔ vision
Siddharth et al 2014August 28, 2015
© Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional
Framework for Grounding Language Inference, Generation, and Acquisition in Video."
J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.
37
https://ocw.mit.edu/help/faq-fair-use/
-
Understanding sentences as a whole
We can differentiate events based on an adjective in the subject NP:The red object approached the chair.The blue object approached the chair.Andrei Barbu (MIT) language↔ vision
Siddharth et al 2014August 28, 2015
© Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional
Framework for Grounding Language Inference, Generation, and Acquisition in Video."
J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.
38
https://ocw.mit.edu/help/faq-fair-use/
-
Understanding sentences as a whole
We can differentiate events based on a preposition in the object NP:The person picked up an object to the right of the bin.The person picked up an object to the left of the bin.Andrei Barbu (MIT) language↔ vision
Siddharth et al 2014August 28, 2015
© Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional
Framework for Grounding Language Inference, Generation, and Acquisition in Video."
J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.
39
https://ocw.mit.edu/help/faq-fair-use/
-
argmax S(s, v)s∈L
argmax S′(s, sq, v)s∈L
argmax S(i, v)i∈parser(s)
argmax S(s(p), v)p
∏s,v
S(sentence, video)
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation
Question answering
Disambiguation
Acquiring language
Images, not videos
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201540
-
Sentential retrieval
Andrei Barbu (MIT) language↔ vision
© Warner Bros. Family
Entertainment. All rights
reserved. This content is
excluded from our Creative
Commons license. For more
information, see https://ocw.
mit.edu/help/faq-fair-use/.
© United Artists. All
rights reserved. This
content is excluded
from our Creative
Commons license.
or more information,
see https://ocw.mit.edu/help/faq-fair-use/.
© Warner Bros. Family
Entertainment. All rights
reserved. This content is
excluded from ourCreative Commons license.
For more information, see
https://ocw.mit.edu/help/
faq-fair use/.
© Buena Vista Pictures.
All rights reserved.
Th is content isexcluded from our
Creative Commons
license. For more
information, see
https://ocw.mit.edu
/help/faq-fair-use/.
© Columbia Pictures. All
rights reserved. This
content is excluded
from our Creative
Commons license.
For more information,
see https://ocw.mit.
edu/help/faq-fair-use/.-
August 28, 201541
© Metro-Goldwyn-Mayer.
All rights reserved. This
content is excluded from
o ur Creative Commonslicense. For more
information, see https://ocw.mit.edu/help/faq-fair-use/.
© Columbia Pictures. All rightsreserved. This content ise xcluded from our Creative Commons license. For more information, see https://ocw.
mit.edu/help/faq-fair-use/.
© Universal Pictures. All
rights reserved. This
content is excluded
from our Creative
Commons license. Formore information, see
https://ocw.mit.edu/
help/faq-fair-use/.
© United Artists. All
rights reserved. This
content is excluded
from our Creative
Commons license. For
more information, see
https://ocw.mit.edu/
help/faq-fair-use/.
© Warner Bros. Family
Entertainment. All rights
reserved. This content is
xcluded from our Creative
Commons license. Formore information, see
https://ocw.mit.edu/help/faq-fair-use/
e
.
https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/
-
argmax S′(s, sq, v)s∈L
argmax S(i, v)i∈parser(s)
argmax∏
S(s(p), v)p s,v
S(sentence, video)
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering
Disambiguation
Acquiring language
Images, not videos
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201542
-
Generating sentences
S→ NP VPNP→ D [A] N [PP]D→ an | theA→ blue | redN→ person | backpack | chair | bin | objectPP→ P NPP→ to the left of | to the right of
VP→ V NP [Adv] [PPM]V→ approached | carried | picked up | put down
Adv→ quickly | slowlyPPM → PM NPPM → towards | away from
147,123,874,800 sentences without recursion
“the person carried the backpack”
Andrei Barbu (MIT) language↔ vision August 28, 201543
-
Generated sentences
The person to the right of the bin picked up the backpack.
Andrei Barbu (MIT) language↔ vision August 28, 201544
© Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from our
Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional
Framework for Grounding Language Inference, Generation, and Acquisition in Video."
J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.
https://ocw.mit.edu/help/faq-fair-use/
-
argmax S(i, v)i∈parser(s)
argmax∏
S(s(p), v)p s,v
S(sentence, video)
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S′(s, sq, v)s∈L
Disambiguation
Acquiring language
Images, not videos
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201545
-
argmax S(i, v)i∈parser(s)
argmax∏
S(s(p), v)p s,v
S(sentence, video)
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation
Acquiring language
Images, not videos
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201546
-
Question answering
What did the person put on top of the red car?The person put NP on top of the red car.
The person put the pear on top of the red car.
Andrei Barbu (MIT) language↔ vision August 28, 201547
-
Question answering
NP put an object on top of the red car.The person on the left of the car put an object on top of the red car.
Who put an object on top of the red car?
Andrei Barbu (MIT) language↔ vision August 28, 201548
-
· · ·The person on the left of the car put an object on top of the red car.
Generation for question answering
Who put an object on top of the red car?
NP put an object on top of the red car.· · ·
The person put an object on top of the red car.
Andrei Barbu (MIT) language↔ vision August 28, 201549
-
Generation for question answering
Who put an object on top of the red car?
NP put an object on top of the red car.· · ·
The person put an object on top of the red car.
· · ·The person on the left of the car put an object on top of the red car.
Andrei Barbu (MIT) language↔ vision August 28, 201550
-
Discriminative generation for question answering
Who put an object on top of the red car?
NP put an object on top of the red car.· · ·
The person put an object on top of the red car.· · ·
The person on the left of the car put an object on top of the red car.
Andrei Barbu (MIT) language↔ vision August 28, 201551
-
Question answering
Who put an object on top of the red car?NP put an object on top of the red car.
The person on the left of the car put an object on top of the red car.
Andrei Barbu (MIT) language↔ vision August 28, 201552
-
argmax∏
S(s(p), v)p s,v
S(sentence, video)
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language
Images, not videos
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201553
-
Disambiguation
Andrei Barbu (MIT) language↔ vision
Danny approached the chair with a bag.
Berzak et al 2015August 28, 201554
-
Disambiguation
S
NP
NNP
Sam
VP
VP
V
VBD
approached
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
V
VBD
approached
NP
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
Andrei Barbu (MIT) language↔ vision
Danny approached the chair with a bag.
Berzak et al 2015August 28, 201555
-
VP Attachment
Conjunction
Logical Form
Anaphora
Ellipsis
Disambiguation
PP AttachmentDanny looked at Andrei with a telescope.
Andrei Barbu (MIT) language↔ vision
Courtesy of Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz & Shimon Ullman. License CC BY.
Source: Berzak, Yevgeni, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. "Do you seewhat i mean? visual resolution of linguistic ambiguities." arXiv preprint arXiv:1603.08079 (2016).
Berzak et al 2015August 28, 201556
-
Conjunction
Logical Form
Anaphora
Ellipsis
Disambiguation
PP AttachmentAndrei approached the person holding a green chair.
VP Attachment
Andrei Barbu (MIT) language↔ visionBerzak et al 2015
August 28, 201557
-
Logical Form
Anaphora
Ellipsis
Disambiguation
PP AttachmentDanny and Andrei picked up the yellow bag and chair.
VP Attachment
Conjunction
Andrei Barbu (MIT) language↔ visionBerzak et al 2015
August 28, 201558
Courtesy of Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz & Shimon Ullman. License CC BY.
Source: Berzak, Yevgeni, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. "Do you seewhat i mean? visual resolution of linguistic ambiguities." arXiv preprint arXiv:1603.08079 (2016).
-
Anaphora
Ellipsis
Disambiguation
PP AttachmentSomeone put down the bags.
VP Attachment
Conjunction
Logical Form
Andrei Barbu (MIT) language↔ visionBerzak et al 2015
August 28, 201559
-
Ellipsis
Disambiguation
PP AttachmentDanny picked up the bag and the chair. It is yellow.
VP Attachment
Conjunction
Logical Form
Anaphora
Andrei Barbu (MIT) language↔ visionBerzak et al 2015
August 28, 201560
Courtesy of Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz & Shimon Ullman. License CC BY.
Source: Berzak, Yevgeni, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. "Do you seewhat i mean? visual resolution of linguistic ambiguities." arXiv preprint arXiv:1603.08079 (2016).
-
Disambiguation
PP AttachmentDanny left Andrei. Also Yevgeni.
VP Attachment
Conjunction
Logical Form
Anaphora
Ellipsis
Andrei Barbu (MIT) language↔ visionBerzak et al 2015
August 28, 201561
Courtesy of Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz & Shimon Ullman. License CC BY.
Source: Berzak, Yevgeni, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. "Do you seewhat i mean? visual resolution of linguistic ambiguities." arXiv preprint arXiv:1603.08079 (2016).
-
Not just parse trees
Danny and Andrei moved a chair.
Danny and Andrei move the same chair.chair(x)
move(Danny, x),move(Andrei, x)
Danny and Andrei move different chairs.chair(x), chair(y)
move(Danny, x),move(Andrei, y), x 6= y
Andrei Barbu (MIT) language↔ vision August 28, 201562
-
Not just parse trees
Danny and Andrei moved a chair.
Danny and Andrei move the same chair.chair(x),person(y),person(z), y 6= z
move(y, x),move(z, x)
Danny and Andrei move different chairs.chair(x), chair(y),person(z),person(w), z 6= w
move(z, x),move(w, y), x 6= y
Andrei Barbu (MIT) language↔ vision August 28, 201563
-
S(sentence, video)
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmax∏
S(s(p), v)p s,v
Images, not videos
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201564
-
Language learning
Split into two tasks:Learning word meaningLearning syntax
Andrei Barbu (MIT) language↔ vision August 28, 201565
-
Language learning: word meaning
person
chair
picked up
approachedbackpack
The chair approached the backpack.
The person picked up the backpack.
The person picked up the chair.
Andrei Barbu (MIT) language↔ vision66
© Journal of Artificial Intelligence Research. All rights reserved. This content is excluded fromour Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional
Framework for Grounding Language Inference, Generation, and Acquisition in Video." J.
Artif. Intell. Res.(JAIR) 52 (2015): 601-713.
August 28, 2015
https://ocw.mit.edu/help/faq-fair-use/
-
Language learning: word meaning
person
chair
picked up
approachedbackpack
he chair approached the backpack.
The person picked up the backpack.
The person picked up the chair.
T
Andrei Barbu (MIT) language↔ vision
© Journal of Artificial Intelligence Research. All rights reserved. This content is excluded fromour Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional
Framework for Grounding Language Inference, Generation, and Acquisition in Video." J.
Artif. Intell. Res.(JAIR) 52 (2015): 601-713.
Yu et al 2014August 28, 201567
https://ocw.mit.edu/help/faq-fair-use/
-
Language learning
Split into two tasks:Learning word meaningLearning syntax
Andrei Barbu (MIT) language↔ vision August 28, 201568
-
Language learning: syntax; in progress
Danny approached the chair with a bag.
≈
parser
S
NP
NNP
Sam
VP
VP
V
VBD
approached
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
V
VBD
approached
NP
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
· · ·
Andrei Barbu (MIT) language↔ vision August 28, 201569
-
Language learning: syntax; in progress
Danny approached the chair with a bag.
≈parser
S
NP
NNP
Sam
VP
VP
V
VBD
approached
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
V
VBD
approached
NP
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
VP
V
VBD
approached
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
V
VBD
approached
NP
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
VP
V
VBD
approached
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
V
VBD
approached
NP
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
VP
V
VBD
approached
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
S
NP
NNP
Sam
VP
V
VBD
approached
NP
NP
DT
the
NN
chair
PP
IN
with
NP
DT
a
NN
bag
· · ·
Andrei Barbu (MIT) language↔ vision August 28, 201570
-
Andrei Barbu (MIT) language↔ vision
Courtesy of Elsevier, Inc., http://www.sciencedirect.com. Used with permission.
Source: Pilley, John W., and Alliston K. Reid. "Border collie comprehends object
names as verbal referents." Behavioural processes 86, no. 2 (2011): 184-195.
Pilley and Reid 2011August 28, 201571
http://www.sciencedirect.com
-
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmax∏
S(s(p), v)p s,v
Images, not videos S(sentence, video)
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201572
-
argmin f (s, s′)s′∈Lb
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmax ( (p),p
∏S s v)
s,v
Images, not videos S(sentence, Flow(image))
Translation
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201573
-
Single-frame optical flow
Andrei Barbu (MIT) language↔ vision
© Source Unknown. All rights reserved. This content is excluded © Source Unknown. All rights reserved. This content is excluded
from our Creative Commons license. For more information, see from our Creative Commons license. For more information, see
https://ocw.mit.edu/help/faq-fair-use/. https://ocw.mit.edu/help/faq-fair-use/.
August 28, 201574
https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/
-
Single-frame optical flow
Input Flow Predicted
Andrei Barbu (MIT) language↔ vision
© Source Unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
August 28, 201575
https://ocw.mit.edu/help/faq-fair-use/
-
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmax S(s(p), v)p
∏s,v
Images, not videos S(sentence, Flow(image))
Translation argmin f (s, s′)s′∈Lb
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201576
-
argmax∫
S(s, v0 : v : vn)s∈L v
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmax∏
S(s(p), v)p s,v
Images, not videos S(sentence, Flow(image))
Translation argmin∫|S(s′, v)
s′ Lb− S(s, v)
∈ v|
Planning
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201577
-
Statistical machine translationSam was happy
Sam a fost fericitaСэм была счастлива
parallel corpus
In Thai you specify siblings by age not gender.In English you specify relative time but you don’t need to in Chinese.Guugu Yimithirr language only uses absolute directions.Many languages don’t distinguish blue/green.Swahili specifies color as “the color of X“.In Turkish you have to report if something is hearsay.
Andrei Barbu (MIT) language↔ vision August 28, 201578
-
Translation by imagination
sentence
videos
sample
translation
generation
Andrei Barbu (MIT) language↔ vision August 28, 201579
-
S(s, v)
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmax∏
S(s(p), v)p s,v
Images, not videos S(sentence, Flow(image∫ ))Translation argmin S(s′, v) S(s, v)
s′∈Lb ∫v | − |Planning argmax S(s, v0 : v : vn)
s∈L v
Theory of mind
. . .Andrei Barbu (MIT) language↔ vision August 28, 201580
-
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmax∏
S(s(p), v)p s,v
Images, not videos S(sentence, Flow(image))
Translation argmin S(s′, v) S(s, v)s′∈Lb
∫v| − |
Planning argmax∫
S(s, v0 : v : vn)s∈L v
Theory of mind S(s, v)
. . .Andrei Barbu (MIT) language↔ vision August 28, 201581
-
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmax∏
S(s(p), v)p s,v
Images, not videos S(sentence, Flow(image∫ ))Translation argmin |S(s′, v)− S(s, v)
s′∈Lb v|
Planning argmax∫
S(s, v0 : v : vn)s∈L v
Theory of mind S(s, tracks)S(tracks, v)
. . .Andrei Barbu (MIT) language↔ vision August 28, 201582
-
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmaxp
∏S(s(p), v)
s,v
Images, not videos S(sentence, Flow(image))
Translation argmin S(s′, v) S(s, v)s′∈Lb
∫v| − |
Planning argmax∫
S(s, v0 : v : vn)s∈L v
Theory of mind S(planner, tracks)S(s, tracks)S(tracks, v)
. . .Andrei Barbu (MIT) language↔ vision August 28, 201583
-
Andrei Barbu (MIT) language↔ visionSiddharth et al 2012
August 28, 201584
-
SegmentationParts and low-level featuresTheory of mindPhysicsModificationThe vast majority of verbs: absolve, admire, anger, approve,
bark, bend, chase, cheat, complete, concede, discover, fire,follow, fumble, hurry, race, recruit, reject, scratch, steal,taste, want, etc.
Metaphoric extensionetc.
The long road ahead . . .
Coherent stories3DForces & contact relations
Andrei Barbu (MIT) language↔ vision August 28, 201585
-
Andrei Barbu (MIT) language↔ vision
© sodlvs at Youtube.com. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
August 28, 201586
https://ocw.mit.edu/help/faq-fair-use/
-
Andrei Barbu (MIT) language↔ vision
© Erickson. All rights reserved. This content is excluded from our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
August 28, 201587
https://ocw.mit.edu/help/faq-fair-use/
-
Theory of mindPhysicsModificationThe vast majority of verbs: absolve, admire, anger, approve,
bark, bend, chase, cheat, complete, concede, discover, fire,follow, fumble, hurry, race, recruit, reject, scratch, steal,taste, want, etc.
Metaphoric extensionetc.
The long road ahead . . .
Coherent stories3DForces & contact relationsSegmentationParts and low-level features
Andrei Barbu (MIT) language↔ vision August 28, 201588
-
Andrei Barbu (MIT) language↔ vision
© Source Unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
August 28, 201589
https://ocw.mit.edu/help/faq-fair-use/
-
The vast majority of verbs: absolve, admire, anger, approve,bark, bend, chase, cheat, complete, concede, discover, fire,follow, fumble, hurry, race, recruit, reject, scratch, steal,taste, want, etc.
Metaphoric extensionetc.
The long road ahead . . .
Coherent stories3DForces & contact relationsSegmentationParts and low-level featuresTheory of mindPhysicsModification
Andrei Barbu (MIT) language↔ vision August 28, 201590
-
Andrei Barbu (MIT) language↔ vision
© Source Unknown. All rights reserved. This content is excluded from our Creative
Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
August 28, 201591
https://ocw.mit.edu/help/faq-fair-use/
-
The long road ahead . . .
Coherent stories3DForces & contact relationsSegmentationParts and low-level featuresTheory of mindPhysicsModificationThe vast majority of verbs: absolve, admire, anger, approve,
bark, bend, chase, cheat, complete, concede, discover, fire,follow, fumble, hurry, race, recruit, reject, scratch, steal,taste, want, etc.
Metaphoric extensionetc.Andrei Barbu (MIT) language↔ vision August 28, 201592
-
Thanks to many great collaborators
Yevgeni Berzak, Danny Harari, Maximilian Nickel,Candance Ross, Victor Carbarera, Santiago Perez,
Boris Katz, Shimon Ullman, Tomaso Poggio
Siddharth Narayanaswamy, Jeffrey Siskind,Sven Dickinson, Song Wang,Haonan Yu, Caiming Xiong
Andrei Barbu (MIT) language↔ vision August 28, 201593
-
Recognition S(sentence, video)
Retrieval argmax S(s, v)v∈V
Generation argmax S(s, v)s∈L
Question answering argmax S(Q(s, sq), v)s∈L
Disambiguation argmax S(i, v)i∈parser(s)
Acquiring language argmaxp
∏S(s(p), v)
s,v
Images, not videos S(sentence, Flow(image))
Translation argmin S(s′, v) S(s, v)s′∈Lb
∫v| − |
Planning argmax∫
S(s, v0 : v : vn)s∈L v
Theory of mind S(planner, tracks)S(s, tracks)S(tracks, v)
. . .Andrei Barbu (MIT) language↔ vision August 28, 201594
-
MIT OpenCourseWarehttps://ocw.mit.edu
Resource: Brains, Minds and Machines Summer CourseTomaso Poggio and Gabriel Kreiman
The following may not correspond to a p articular course on MIT OpenCourseWare, but has beenprovided by the author as an individual learning resource.
For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.
https://ocw.mit.edu/termshttps://ocw.mit.edu