from language to vision and back again...source: barbu, andrei, alexander bridge, dan coroian, sven...

95
From language to vision and back again Andrei Barbu Andrei Barbu (MIT) language vision August 28, 2015 Center for Brains, Minds & Machines 1

Upload: others

Post on 30-Jan-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • From language to vision and back again

    Andrei Barbu

    Andrei Barbu (MIT) language↔ vision August 28, 2015

    Center for Brains,Minds & Machines

    1

  • Andrei Barbu (MIT) language↔ vision August 28, 2015

    © Source Unknown. All rights reserved. This content is excluded from our Creative

    Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    2

    https://ocw.mit.edu/help/faq-fair-use/

  • Andrei Barbu (MIT) language↔ visionTorralba et al 2003

    August 28, 20153

  • Andrei Barbu (MIT) language↔ vision August 28, 20154

  • Andrei Barbu (MIT) language↔ vision August 28, 20155

  • hammering

    hammer

    Andrei Barbu (MIT) language↔ vision August 28, 20156

  • Perception is unreliable.

    Top-down knowledge affects our perception.

    One integrated representation for many tasks.

    Andrei Barbu (MIT) language↔ vision August 28, 20157

  • S(sentence, video)

    argmax S(s, v)v∈V

    argmax S(s, v)s∈L

    argmax S′(s, sq, v)s∈L

    argmax S(i, v)i∈parser(s)

    argmax∏

    S(s(p), v)p s,v

    S(sentence, video)

    argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition

    Retrieval

    Generation

    Question answering

    Disambiguation

    Acquiring language

    Images, not videos

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 20158

  • Humans perform language–vision tasks all the time

    Give me the cup.Which chair should I sit in?This is an apple.To win this game you have to make a straight line out of your pieces.

    Andrei Barbu (MIT) language↔ vision August 28, 20159

  • argmax S(s, v)v∈V

    argmax S(s, v)s∈L

    argmax S′(s, sq, v)s∈L

    argmax S(i, v)i∈parser(s)

    argmax S(s(p), v)p

    ∏s,v

    S(sentence, video)

    argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval

    Generation

    Question answering

    Disambiguation

    Acquiring language

    Images, not videos

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201510

  • The person rode the skateboard leftward.

    object detector, tracker, event recognizer

    Andrei Barbu (MIT) language↔ vision August 28, 201511

  • Tracks Events Sentences

    Andrei Barbu (MIT) language↔ vision August 28, 201512

  • False negatives

    Object detection

    Andrei Barbu (MIT) language↔ vision

    Figure removed due to copyright restrictions. Please see the video.

    Source: Barbu, Andrei, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. "Simultaneousobject detection, tracking, and event recognition." Advances in Cognitive Systems: 203-220 (2012).

    Felzenszwalb et al 2008August 28, 201513

  • Object detection

    Andrei Barbu (MIT) language↔ vision

    Figure removed due to copyright restrictions. Please see the video.

    Source: Barbu, Andrei, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. "Simultaneousobject detection, tracking, and event recognition." Advances in Cognitive Systems: 203-220 (2012).

    Felzenszwalb et al 2008August 28, 201514

  • Object detection

    Andrei Barbu (MIT) language↔ vision

    Figure removed due to copyright restrictions. Please see the video.

    Source: Barbu, Andrei, Aaron Michaux, Siddharth Narayanaswamy, and Jeffrey Mark Siskind. "Simultaneousobject detection, tracking, and event recognition." Advances in Cognitive Systems: 203-220 (2012).

    Felzenszwalb et al 2008August 28, 201515

  • Object detectors work poorly

    Andrei Barbu (MIT) language↔ vision

    Figure removed due to copyright restrictions. Please see the video.

    Source: Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean

    Ma, Zhiheng Huang et al. "Imagenet large scale visual recognition challenge." International

    Journal of Computer Vision 115, no. 3 (2015): 211-252.

    Russakovsky et al 2015August 28, 201516

  • Fixing bad detectors with higher-level knowledge

    detection / object / frametemporally coherent track

    object detector confi-dence (f)motion coherence (g)

    optimal path through thelattice of detectionsdynamic programmingBellman (1957), Viterbi (1967)

    T

    max∑ T

    f (bt t 1 tjt) + gj1 ..., jT 1

    ∑(bjt

    −−1 , bjt)

    , t= t=2

    Andrei Barbu (MIT) language↔ visionBarbu et al 2012

    August 28, 201517

  • Fixing bad detectors with higher-level knowledge

    detection / object / frametemporally coherent track

    object detector confi-dence (f)motion coherence (g)

    optimal path through thelattice of detectionsdynamic programmingBellman (1957), Viterbi (1967)

    T

    max∑ T

    f (bt t 1 tjt) + gj1 ..., jT 1

    ∑(bjt

    −−1 , bjt)

    , t= t=2

    Andrei Barbu (MIT) language↔ vision

    Courtesy of Andrei Barbu, Alexander Bridge, Dan Coroian, Sven Dickinson,

    Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, Lara Schmidt,Jiangnan Shangguan, Jeffrey Mark Siskind, Jarrell Waggoner, Song Want,Jinlian Wei, Yifan Yin & Zhiqi Zhang. Used with permission.Source: Barbu, Andrei, Alexander Bridge, Dan Coroian, Sven Dickinson, Sam

    Mussman, Siddharth Narayanaswamy, Dhaval Salvi et al. "Large-scale automaticlabeling of video events with verbs based on event-participant interaction."arXiv preprint arXiv:1204.3616 (2012).

    Barbu et al 2012August 28, 201518

  • Feature vector — single participant

    PersonPerson

    position positiondt dt2

    positiond d

    aspect ratio aspect ratio area areadt dt

    object class root-filter index

    d d2

    Andrei Barbu (MIT) language↔ vision August 28, 201519

  • Feature vector — dual participant

    distance distance orientation orientationdt dtd d

    person riding skateboard person approaching person skateboard approaching person

    Andrei Barbu (MIT) language↔ vision August 28, 201520

  • Event recognition

    tracks· · · · · · · · ·

    a

    h

    feature vectors

    Baum and Petrie (1966)

    OneHMMper event classTry each HMM

    HMMs

    Andrei Barbu (MIT) language↔ vision August 28, 201521

  • Event recognition

    max , bt 1 tk k

    ∑h t(kt t̂ ) +1,...

    =1

    ∑a(k − , k )

    T, t t=2

    T T

    Andrei Barbu (MIT) language↔ vision August 28, 201522

  • Andrei Barbu (MIT) language↔ visionBarbu et al 2012

    August 28, 201523

  • Tracks Events Sentences

    Andrei Barbu (MIT) language↔ vision August 28, 201524

  • Tracking in the context of event recognition

    ∑T T T Tmax f t 1 t t t 1 t(bt tjt) + g(b

    −1 , b ) t) + a(k − , k )jt jt + max h(k , bj ,...,jT 1t

    ∑̂1 − k kT,...,

    =1 t=2

    ∑t=1

    ∑t=2

    Andrei Barbu (MIT) language↔ vision August 28, 201525

  • Tracking in the context of event recognition

    max max∑

    f t(bt bt t btjt) +∑

    g(b t−11 , jt) +∑

    h(k , t ) +j j∑

    a kt−1, kt( )j1 T 1 T,...,j k ,...,k

    t=1 t=2 t=1 t=2

    T T T T

    Andrei Barbu (MIT) language↔ vision August 28, 201526

  • Tracking in the context of event recognition

    ∑T Tmax

    ∑ Tmax f t) + g( t , bt(btj b

    t−11 jt) +

    ∑ Th t(k , btt ) +

    ∑a(kt−1, kt)

    1 jj1 j...,jT, k T,...,k−

    t=1 t=2 t=1 t=2

    Andrei Barbu (MIT) language↔ vision August 28, 201527

  • Tracking in the context of event recognition in action

    tracking tracking and event recognition

    Andrei Barbu (MIT) language↔ vision August 28, 201528

    Courtesy of Andrei Barbu, Alexander Bridge, Dan Coroian, Sven Dickinson, Sam Mussman, SiddharthNarayanaswamy, Dhaval Salvi, Lara Schmidt, Jiangnan Shangguan, Jeffrey Mark Siskind, JarrellWaggoner, Song Want, Jinlian Wei, Yifan Yin & Zhiqi Zhang. Used with permission.Source: Barbu, Andrei, Alexander Bridge, Dan Coroian, Sven Dickinson, Sam Mussman, SiddharthNarayanaswamy, Dhaval Salvi et al. "Large-scale automatic labeling of video events with verbsbased on event-participant interaction." arXiv preprint arXiv:1204.3616 (2012).

  • Tracking in the context of event recognition in action

    tracking tracking and event recognition

    Andrei Barbu (MIT) language↔ vision August 28, 201529

  • Tracks Events Sentences

    Andrei Barbu (MIT) language↔ vision August 28, 201530

  • track L

    Building sentences out of trackers and words

    Viterbi tracker

    track 1

    × · · ·×...

    ......

    ...

    . . .

    . . .

    . . .

    . . .

    f g

    t = 1 t = 2 t = 3 t = T

    j = 1

    j = 3

    j = 2

    j = Jt

    b11

    b12

    b13

    b1J1

    b21

    b22

    b23

    b2J2

    b31

    b32

    b33

    b3J3

    bT1

    bT2

    bT3

    bTJT

    ×...

    ......

    ...

    . . .

    . . .

    . . .

    . . .

    h a

    × · · ·×...

    ......

    ...

    . . .

    . . .

    . . .

    . . .

    h a

    word 1 word W

    ...j1L,..., j

    TL

    maxk11 ,..., k

    T1

    ...k1W ,..., k

    TW

    L∑l=1

    T T

    max f t(bt 1 tjt) + g(b−

    j11, , jT l jt−1

    , bjt)l... l1 t=1 t=2

    ∑ ∑

    +

    W∑w=1

    T∑t=1

    hw(ktw, btjtθ1w

    , btjtθ2w

    ) +

    T∑t=2

    a(kt−1, kt)

    Andrei Barbu (MIT) language↔ visionSiddharth et al 2014

    August 28, 201531

  • track L

    Building sentences out of trackers and words

    Event tracker for intransitive verbs

    track 1

    × · · ·×...

    ......

    ...

    . . .

    . . .

    . . .

    . . .

    f g

    t = 1 t = 2 t = 3 t = T

    j = 1

    j = 3

    j = 2

    j = Jt

    b11

    b12

    b13

    b1J1

    b21

    b22

    b23

    b2J2

    b31

    b32

    b33

    b3J3

    bT1

    bT2

    bT3

    bTJT

    ×

    × · · ·×. . .

    . . .

    . . .

    . . . .. . . .. . . .

    . . .

    h a

    word W

    L

    1. .. .

    ∑l=

    . .j1L,..., j

    TL k

    1W ,..., k

    TW

    word 1

    T T T T

    max max∑

    f bt t 1 t t t t 1 t( jt) +∑

    g(b t− 1 , bjt) +∑

    h(k , b̂jt) +∑

    a(k − , k )1 1 T jj T l

    −l

    1,..., j1 k1 ,..., k l1 t=1 t=2 t=1 t=2

    W∑w=1

    Andrei Barbu (MIT) language↔ visionSiddharth et al 2014

    August 28, 201532

  • Building sentences out of trackers and words

    Event tracker

    track 1 track L

    ×

    × · · ·×. . .

    . . .

    . . .

    . . . .. . . .. . . .

    . . .

    h a

    word W

    ..

    ∑Ww=1

    .k1 TW ,..., kW

    word 1

    max maxj11, k

    1,..., kT..., jT1 1

    ∑L1 l=1

    ..

    ∑T ∑T T Tf bt t 1 t t t t t 1 t( jt) + g(bjt

    −1 , bjt) + h(k , b̂jt , b̂jt ) + a(k

    − , k )l l 1 2l

    t 1 t 2

    ∑t

    θ θ= = =1

    ∑t=2

    .j1L , j

    T,... L

    Andrei Barbu (MIT) language↔ visionSiddharth et al 2014

    August 28, 201533

  • Building sentences out of trackers and words

    Sentence tracker

    track 1 track L

    word 1 word W

    ∑L ∑T ∑T W T Tmax max f bt g bt−1 bt

    ∑∑h kt bt( jt) + ( t 1 , jt) + w( w, jt , b

    t t 1 tjt ) +

    ∑aw(k− w

    − , kw)1 T 1 T l jj j k k l 1 2,..., l1 1 1 ,..., 1 l=1 t=1 t=2 w 1 θw θw.

    = t=1 t.

    =2

    . .. .j1 T 1L, kW , , k

    T..., jL ... W

    Andrei Barbu (MIT) language↔ visionSiddharth et al 2014

    August 28, 201534

  • Sentences

    The tall person quickly rode the horse leftward away from the other horse.

    agent-location patient-location

    . . .

    source-location

    Andrei Barbu (MIT) language↔ vision

    © Source Unknown. All rights reserved. This content is excluded from our Creative

    Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    August 28, 201535

    https://ocw.mit.edu/help/faq-fair-use/

  • Understanding sentences as a whole

    We can differentiate events based on the verb:The person picked up an object.The person put down an object.

    Andrei Barbu (MIT) language↔ visionSiddharth et al 2014

    August 28, 2015

    © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional

    Framework for Grounding Language Inference, Generation, and Acquisition in Video."

    J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.

    36

    https://ocw.mit.edu/help/faq-fair-use/

  • Understanding sentences as a whole

    We can differentiate events based on the subject noun:The backpack approached the bin.The chair approached the bin.Andrei Barbu (MIT) language↔ vision

    Siddharth et al 2014August 28, 2015

    © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional

    Framework for Grounding Language Inference, Generation, and Acquisition in Video."

    J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.

    37

    https://ocw.mit.edu/help/faq-fair-use/

  • Understanding sentences as a whole

    We can differentiate events based on an adjective in the subject NP:The red object approached the chair.The blue object approached the chair.Andrei Barbu (MIT) language↔ vision

    Siddharth et al 2014August 28, 2015

    © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional

    Framework for Grounding Language Inference, Generation, and Acquisition in Video."

    J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.

    38

    https://ocw.mit.edu/help/faq-fair-use/

  • Understanding sentences as a whole

    We can differentiate events based on a preposition in the object NP:The person picked up an object to the right of the bin.The person picked up an object to the left of the bin.Andrei Barbu (MIT) language↔ vision

    Siddharth et al 2014August 28, 2015

    © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from ourCreative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional

    Framework for Grounding Language Inference, Generation, and Acquisition in Video."

    J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.

    39

    https://ocw.mit.edu/help/faq-fair-use/

  • argmax S(s, v)s∈L

    argmax S′(s, sq, v)s∈L

    argmax S(i, v)i∈parser(s)

    argmax S(s(p), v)p

    ∏s,v

    S(sentence, video)

    argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation

    Question answering

    Disambiguation

    Acquiring language

    Images, not videos

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201540

  • Sentential retrieval

    Andrei Barbu (MIT) language↔ vision

    © Warner Bros. Family

    Entertainment. All rights

    reserved. This content is

    excluded from our Creative

    Commons license. For more

    information, see https://ocw.

    mit.edu/help/faq-fair-use/.

    © United Artists. All

    rights reserved. This

    content is excluded

    from our Creative

    Commons license.

    or more information,

    see https://ocw.mit.edu/help/faq-fair-use/.

    © Warner Bros. Family

    Entertainment. All rights

    reserved. This content is

    excluded from ourCreative Commons license.

    For more information, see

    https://ocw.mit.edu/help/

    faq-fair use/.

    © Buena Vista Pictures.

    All rights reserved.

    Th is content isexcluded from our

    Creative Commons

    license. For more

    information, see

    https://ocw.mit.edu

    /help/faq-fair-use/.

    © Columbia Pictures. All

    rights reserved. This

    content is excluded

    from our Creative

    Commons license.

    For more information,

    see https://ocw.mit.

    edu/help/faq-fair-use/.-

    August 28, 201541

    © Metro-Goldwyn-Mayer.

    All rights reserved. This

    content is excluded from

    o ur Creative Commonslicense. For more

    information, see https://ocw.mit.edu/help/faq-fair-use/.

    © Columbia Pictures. All rightsreserved. This content ise xcluded from our Creative Commons license. For more information, see https://ocw.

    mit.edu/help/faq-fair-use/.

    © Universal Pictures. All

    rights reserved. This

    content is excluded

    from our Creative

    Commons license. Formore information, see

    https://ocw.mit.edu/

    help/faq-fair-use/.

    © United Artists. All

    rights reserved. This

    content is excluded

    from our Creative

    Commons license. For

    more information, see

    https://ocw.mit.edu/

    help/faq-fair-use/.

    © Warner Bros. Family

    Entertainment. All rights

    reserved. This content is

    xcluded from our Creative

    Commons license. Formore information, see

    https://ocw.mit.edu/help/faq-fair-use/

    e

    .

    https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/

  • argmax S′(s, sq, v)s∈L

    argmax S(i, v)i∈parser(s)

    argmax∏

    S(s(p), v)p s,v

    S(sentence, video)

    argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering

    Disambiguation

    Acquiring language

    Images, not videos

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201542

  • Generating sentences

    S→ NP VPNP→ D [A] N [PP]D→ an | theA→ blue | redN→ person | backpack | chair | bin | objectPP→ P NPP→ to the left of | to the right of

    VP→ V NP [Adv] [PPM]V→ approached | carried | picked up | put down

    Adv→ quickly | slowlyPPM → PM NPPM → towards | away from

    147,123,874,800 sentences without recursion

    “the person carried the backpack”

    Andrei Barbu (MIT) language↔ vision August 28, 201543

  • Generated sentences

    The person to the right of the bin picked up the backpack.

    Andrei Barbu (MIT) language↔ vision August 28, 201544

    © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded from our

    Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional

    Framework for Grounding Language Inference, Generation, and Acquisition in Video."

    J. Artif. Intell. Res.(JAIR) 52 (2015): 601-713.

    https://ocw.mit.edu/help/faq-fair-use/

  • argmax S(i, v)i∈parser(s)

    argmax∏

    S(s(p), v)p s,v

    S(sentence, video)

    argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S′(s, sq, v)s∈L

    Disambiguation

    Acquiring language

    Images, not videos

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201545

  • argmax S(i, v)i∈parser(s)

    argmax∏

    S(s(p), v)p s,v

    S(sentence, video)

    argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation

    Acquiring language

    Images, not videos

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201546

  • Question answering

    What did the person put on top of the red car?The person put NP on top of the red car.

    The person put the pear on top of the red car.

    Andrei Barbu (MIT) language↔ vision August 28, 201547

  • Question answering

    NP put an object on top of the red car.The person on the left of the car put an object on top of the red car.

    Who put an object on top of the red car?

    Andrei Barbu (MIT) language↔ vision August 28, 201548

  • · · ·The person on the left of the car put an object on top of the red car.

    Generation for question answering

    Who put an object on top of the red car?

    NP put an object on top of the red car.· · ·

    The person put an object on top of the red car.

    Andrei Barbu (MIT) language↔ vision August 28, 201549

  • Generation for question answering

    Who put an object on top of the red car?

    NP put an object on top of the red car.· · ·

    The person put an object on top of the red car.

    · · ·The person on the left of the car put an object on top of the red car.

    Andrei Barbu (MIT) language↔ vision August 28, 201550

  • Discriminative generation for question answering

    Who put an object on top of the red car?

    NP put an object on top of the red car.· · ·

    The person put an object on top of the red car.· · ·

    The person on the left of the car put an object on top of the red car.

    Andrei Barbu (MIT) language↔ vision August 28, 201551

  • Question answering

    Who put an object on top of the red car?NP put an object on top of the red car.

    The person on the left of the car put an object on top of the red car.

    Andrei Barbu (MIT) language↔ vision August 28, 201552

  • argmax∏

    S(s(p), v)p s,v

    S(sentence, video)

    argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language

    Images, not videos

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201553

  • Disambiguation

    Andrei Barbu (MIT) language↔ vision

    Danny approached the chair with a bag.

    Berzak et al 2015August 28, 201554

  • Disambiguation

    S

    NP

    NNP

    Sam

    VP

    VP

    V

    VBD

    approached

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    V

    VBD

    approached

    NP

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    Andrei Barbu (MIT) language↔ vision

    Danny approached the chair with a bag.

    Berzak et al 2015August 28, 201555

  • VP Attachment

    Conjunction

    Logical Form

    Anaphora

    Ellipsis

    Disambiguation

    PP AttachmentDanny looked at Andrei with a telescope.

    Andrei Barbu (MIT) language↔ vision

    Courtesy of Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz & Shimon Ullman. License CC BY.

    Source: Berzak, Yevgeni, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. "Do you seewhat i mean? visual resolution of linguistic ambiguities." arXiv preprint arXiv:1603.08079 (2016).

    Berzak et al 2015August 28, 201556

  • Conjunction

    Logical Form

    Anaphora

    Ellipsis

    Disambiguation

    PP AttachmentAndrei approached the person holding a green chair.

    VP Attachment

    Andrei Barbu (MIT) language↔ visionBerzak et al 2015

    August 28, 201557

  • Logical Form

    Anaphora

    Ellipsis

    Disambiguation

    PP AttachmentDanny and Andrei picked up the yellow bag and chair.

    VP Attachment

    Conjunction

    Andrei Barbu (MIT) language↔ visionBerzak et al 2015

    August 28, 201558

    Courtesy of Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz & Shimon Ullman. License CC BY.

    Source: Berzak, Yevgeni, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. "Do you seewhat i mean? visual resolution of linguistic ambiguities." arXiv preprint arXiv:1603.08079 (2016).

  • Anaphora

    Ellipsis

    Disambiguation

    PP AttachmentSomeone put down the bags.

    VP Attachment

    Conjunction

    Logical Form

    Andrei Barbu (MIT) language↔ visionBerzak et al 2015

    August 28, 201559

  • Ellipsis

    Disambiguation

    PP AttachmentDanny picked up the bag and the chair. It is yellow.

    VP Attachment

    Conjunction

    Logical Form

    Anaphora

    Andrei Barbu (MIT) language↔ visionBerzak et al 2015

    August 28, 201560

    Courtesy of Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz & Shimon Ullman. License CC BY.

    Source: Berzak, Yevgeni, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. "Do you seewhat i mean? visual resolution of linguistic ambiguities." arXiv preprint arXiv:1603.08079 (2016).

  • Disambiguation

    PP AttachmentDanny left Andrei. Also Yevgeni.

    VP Attachment

    Conjunction

    Logical Form

    Anaphora

    Ellipsis

    Andrei Barbu (MIT) language↔ visionBerzak et al 2015

    August 28, 201561

    Courtesy of Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz & Shimon Ullman. License CC BY.

    Source: Berzak, Yevgeni, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman. "Do you seewhat i mean? visual resolution of linguistic ambiguities." arXiv preprint arXiv:1603.08079 (2016).

  • Not just parse trees

    Danny and Andrei moved a chair.

    Danny and Andrei move the same chair.chair(x)

    move(Danny, x),move(Andrei, x)

    Danny and Andrei move different chairs.chair(x), chair(y)

    move(Danny, x),move(Andrei, y), x 6= y

    Andrei Barbu (MIT) language↔ vision August 28, 201562

  • Not just parse trees

    Danny and Andrei moved a chair.

    Danny and Andrei move the same chair.chair(x),person(y),person(z), y 6= z

    move(y, x),move(z, x)

    Danny and Andrei move different chairs.chair(x), chair(y),person(z),person(w), z 6= w

    move(z, x),move(w, y), x 6= y

    Andrei Barbu (MIT) language↔ vision August 28, 201563

  • S(sentence, video)

    argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmax∏

    S(s(p), v)p s,v

    Images, not videos

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201564

  • Language learning

    Split into two tasks:Learning word meaningLearning syntax

    Andrei Barbu (MIT) language↔ vision August 28, 201565

  • Language learning: word meaning

    person

    chair

    picked up

    approachedbackpack

    The chair approached the backpack.

    The person picked up the backpack.

    The person picked up the chair.

    Andrei Barbu (MIT) language↔ vision66

    © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded fromour Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional

    Framework for Grounding Language Inference, Generation, and Acquisition in Video." J.

    Artif. Intell. Res.(JAIR) 52 (2015): 601-713.

    August 28, 2015

    https://ocw.mit.edu/help/faq-fair-use/

  • Language learning: word meaning

    person

    chair

    picked up

    approachedbackpack

    he chair approached the backpack.

    The person picked up the backpack.

    The person picked up the chair.

    T

    Andrei Barbu (MIT) language↔ vision

    © Journal of Artificial Intelligence Research. All rights reserved. This content is excluded fromour Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.Source: Yu, Haonan, N. Siddharth, Andrei Barbu, and Jeffrey Mark Siskind. "A Compositional

    Framework for Grounding Language Inference, Generation, and Acquisition in Video." J.

    Artif. Intell. Res.(JAIR) 52 (2015): 601-713.

    Yu et al 2014August 28, 201567

    https://ocw.mit.edu/help/faq-fair-use/

  • Language learning

    Split into two tasks:Learning word meaningLearning syntax

    Andrei Barbu (MIT) language↔ vision August 28, 201568

  • Language learning: syntax; in progress

    Danny approached the chair with a bag.

    parser

    S

    NP

    NNP

    Sam

    VP

    VP

    V

    VBD

    approached

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    V

    VBD

    approached

    NP

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    · · ·

    Andrei Barbu (MIT) language↔ vision August 28, 201569

  • Language learning: syntax; in progress

    Danny approached the chair with a bag.

    ≈parser

    S

    NP

    NNP

    Sam

    VP

    VP

    V

    VBD

    approached

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    V

    VBD

    approached

    NP

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    VP

    V

    VBD

    approached

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    V

    VBD

    approached

    NP

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    VP

    V

    VBD

    approached

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    V

    VBD

    approached

    NP

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    VP

    V

    VBD

    approached

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    S

    NP

    NNP

    Sam

    VP

    V

    VBD

    approached

    NP

    NP

    DT

    the

    NN

    chair

    PP

    IN

    with

    NP

    DT

    a

    NN

    bag

    · · ·

    Andrei Barbu (MIT) language↔ vision August 28, 201570

  • Andrei Barbu (MIT) language↔ vision

    Courtesy of Elsevier, Inc., http://www.sciencedirect.com. Used with permission.

    Source: Pilley, John W., and Alliston K. Reid. "Border collie comprehends object

    names as verbal referents." Behavioural processes 86, no. 2 (2011): 184-195.

    Pilley and Reid 2011August 28, 201571

    http://www.sciencedirect.com

  • argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmax∏

    S(s(p), v)p s,v

    Images, not videos S(sentence, video)

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201572

  • argmin f (s, s′)s′∈Lb

    argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmax ( (p),p

    ∏S s v)

    s,v

    Images, not videos S(sentence, Flow(image))

    Translation

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201573

  • Single-frame optical flow

    Andrei Barbu (MIT) language↔ vision

    © Source Unknown. All rights reserved. This content is excluded © Source Unknown. All rights reserved. This content is excluded

    from our Creative Commons license. For more information, see from our Creative Commons license. For more information, see

    https://ocw.mit.edu/help/faq-fair-use/. https://ocw.mit.edu/help/faq-fair-use/.

    August 28, 201574

    https://ocw.mit.edu/help/faq-fair-use/https://ocw.mit.edu/help/faq-fair-use/

  • Single-frame optical flow

    Input Flow Predicted

    Andrei Barbu (MIT) language↔ vision

    © Source Unknown. All rights reserved. This content is excluded from our Creative

    Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    August 28, 201575

    https://ocw.mit.edu/help/faq-fair-use/

  • argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmax S(s(p), v)p

    ∏s,v

    Images, not videos S(sentence, Flow(image))

    Translation argmin f (s, s′)s′∈Lb

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201576

  • argmax∫

    S(s, v0 : v : vn)s∈L v

    S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmax∏

    S(s(p), v)p s,v

    Images, not videos S(sentence, Flow(image))

    Translation argmin∫|S(s′, v)

    s′ Lb− S(s, v)

    ∈ v|

    Planning

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201577

  • Statistical machine translationSam was happy

    Sam a fost fericitaСэм была счастлива

    parallel corpus

    In Thai you specify siblings by age not gender.In English you specify relative time but you don’t need to in Chinese.Guugu Yimithirr language only uses absolute directions.Many languages don’t distinguish blue/green.Swahili specifies color as “the color of X“.In Turkish you have to report if something is hearsay.

    Andrei Barbu (MIT) language↔ vision August 28, 201578

  • Translation by imagination

    sentence

    videos

    sample

    translation

    generation

    Andrei Barbu (MIT) language↔ vision August 28, 201579

  • S(s, v)

    Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmax∏

    S(s(p), v)p s,v

    Images, not videos S(sentence, Flow(image∫ ))Translation argmin S(s′, v) S(s, v)

    s′∈Lb ∫v | − |Planning argmax S(s, v0 : v : vn)

    s∈L v

    Theory of mind

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201580

  • Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmax∏

    S(s(p), v)p s,v

    Images, not videos S(sentence, Flow(image))

    Translation argmin S(s′, v) S(s, v)s′∈Lb

    ∫v| − |

    Planning argmax∫

    S(s, v0 : v : vn)s∈L v

    Theory of mind S(s, v)

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201581

  • Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmax∏

    S(s(p), v)p s,v

    Images, not videos S(sentence, Flow(image∫ ))Translation argmin |S(s′, v)− S(s, v)

    s′∈Lb v|

    Planning argmax∫

    S(s, v0 : v : vn)s∈L v

    Theory of mind S(s, tracks)S(tracks, v)

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201582

  • Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmaxp

    ∏S(s(p), v)

    s,v

    Images, not videos S(sentence, Flow(image))

    Translation argmin S(s′, v) S(s, v)s′∈Lb

    ∫v| − |

    Planning argmax∫

    S(s, v0 : v : vn)s∈L v

    Theory of mind S(planner, tracks)S(s, tracks)S(tracks, v)

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201583

  • Andrei Barbu (MIT) language↔ visionSiddharth et al 2012

    August 28, 201584

  • SegmentationParts and low-level featuresTheory of mindPhysicsModificationThe vast majority of verbs: absolve, admire, anger, approve,

    bark, bend, chase, cheat, complete, concede, discover, fire,follow, fumble, hurry, race, recruit, reject, scratch, steal,taste, want, etc.

    Metaphoric extensionetc.

    The long road ahead . . .

    Coherent stories3DForces & contact relations

    Andrei Barbu (MIT) language↔ vision August 28, 201585

  • Andrei Barbu (MIT) language↔ vision

    © sodlvs at Youtube.com. All rights reserved. This content is excluded from our Creative

    Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    August 28, 201586

    https://ocw.mit.edu/help/faq-fair-use/

  • Andrei Barbu (MIT) language↔ vision

    © Erickson. All rights reserved. This content is excluded from our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    August 28, 201587

    https://ocw.mit.edu/help/faq-fair-use/

  • Theory of mindPhysicsModificationThe vast majority of verbs: absolve, admire, anger, approve,

    bark, bend, chase, cheat, complete, concede, discover, fire,follow, fumble, hurry, race, recruit, reject, scratch, steal,taste, want, etc.

    Metaphoric extensionetc.

    The long road ahead . . .

    Coherent stories3DForces & contact relationsSegmentationParts and low-level features

    Andrei Barbu (MIT) language↔ vision August 28, 201588

  • Andrei Barbu (MIT) language↔ vision

    © Source Unknown. All rights reserved. This content is excluded from our Creative

    Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    August 28, 201589

    https://ocw.mit.edu/help/faq-fair-use/

  • The vast majority of verbs: absolve, admire, anger, approve,bark, bend, chase, cheat, complete, concede, discover, fire,follow, fumble, hurry, race, recruit, reject, scratch, steal,taste, want, etc.

    Metaphoric extensionetc.

    The long road ahead . . .

    Coherent stories3DForces & contact relationsSegmentationParts and low-level featuresTheory of mindPhysicsModification

    Andrei Barbu (MIT) language↔ vision August 28, 201590

  • Andrei Barbu (MIT) language↔ vision

    © Source Unknown. All rights reserved. This content is excluded from our Creative

    Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.

    August 28, 201591

    https://ocw.mit.edu/help/faq-fair-use/

  • The long road ahead . . .

    Coherent stories3DForces & contact relationsSegmentationParts and low-level featuresTheory of mindPhysicsModificationThe vast majority of verbs: absolve, admire, anger, approve,

    bark, bend, chase, cheat, complete, concede, discover, fire,follow, fumble, hurry, race, recruit, reject, scratch, steal,taste, want, etc.

    Metaphoric extensionetc.Andrei Barbu (MIT) language↔ vision August 28, 201592

  • Thanks to many great collaborators

    Yevgeni Berzak, Danny Harari, Maximilian Nickel,Candance Ross, Victor Carbarera, Santiago Perez,

    Boris Katz, Shimon Ullman, Tomaso Poggio

    Siddharth Narayanaswamy, Jeffrey Siskind,Sven Dickinson, Song Wang,Haonan Yu, Caiming Xiong

    Andrei Barbu (MIT) language↔ vision August 28, 201593

  • Recognition S(sentence, video)

    Retrieval argmax S(s, v)v∈V

    Generation argmax S(s, v)s∈L

    Question answering argmax S(Q(s, sq), v)s∈L

    Disambiguation argmax S(i, v)i∈parser(s)

    Acquiring language argmaxp

    ∏S(s(p), v)

    s,v

    Images, not videos S(sentence, Flow(image))

    Translation argmin S(s′, v) S(s, v)s′∈Lb

    ∫v| − |

    Planning argmax∫

    S(s, v0 : v : vn)s∈L v

    Theory of mind S(planner, tracks)S(s, tracks)S(tracks, v)

    . . .Andrei Barbu (MIT) language↔ vision August 28, 201594

  • MIT OpenCourseWarehttps://ocw.mit.edu

    Resource: Brains, Minds and Machines Summer CourseTomaso Poggio and Gabriel Kreiman

    The following may not correspond to a p articular course on MIT OpenCourseWare, but has beenprovided by the author as an individual learning resource.

    For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

    https://ocw.mit.edu/termshttps://ocw.mit.edu