yiannis a picture is worth 13.6 words - umiacshal/tmp/talk.pdf · ethnicity french asian american...

101
1/101 A picture is worth 13.6 words Hal Daumé III, [email protected] A picture is worth 13.6 words (on average) Alex Berg Amit Goyal Tamara Berg Jesse Dodge Yejin Choi Yiannis Aloimonos Kota Yamaguchi Alyssa Mensch Karl Stratos Meg Mitchell Xufeng Han Ching Lik Teo Yezhou Yang

Upload: vuongnga

Post on 13-May-2018

219 views

Category:

Documents


3 download

TRANSCRIPT

1/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

A picture is worth13.6 words

(on average)

AlexBerg

AmitGoyal

TamaraBerg

JesseDodge

YejinChoi

YiannisAloimonos

KotaYamaguchi

AlyssaMensch

KarlStratos

MegMitchell

XufengHan

Ching LikTeo

YezhouYang

2/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

An on-paper experiment

Write a captionfor this image,one sentencein length.

(In English.)

3/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

People write weird captions

Another dream car toadd to the list, this onespotted in Hanbury St.

4/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

People write weird captions

Another dream car toadd to the list, this onespotted in Hanbury St.

Shot out my car windowwhile stuck in trafficbecause people in

Cincinatti can'tdrive in the rain

5/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

People write weird captions

Another dream car toadd to the list, this onespotted in Hanbury St.

Shot out my car windowwhile stuck in trafficbecause people in

Cincinatti can'tdrive in the rain

1. A distorted photo of a mancutting up a large cut ofmeat in a garage.

2. A man smiling at thecamera while carvingup meat.

3. A man smiling while hecuts up a piece of meat.

4. A smiling man is standing next to a table dressinga piece of venison.

5. The man is smiling into the camera as he cuts meat.

6/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Two complementary questions...Image ⇒ Text?

“two women sitting brunette blonde on bench reading magazine”

7/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Two complementary questions...Image ⇒ Text? Text ⇒ Image?

“two women sitting brunette blonde on bench reading magazine”

“looking for castles in the clouds out my car window”

8/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Two complementary questions...Image ⇒ Text? Text ⇒ Image?

“two women sitting brunette blonde on bench reading magazine”

“looking for castles in the clouds out my car window”

Understanding andUnderstanding andPredicting ImportancePredicting Importancein Imagesin ImagesBBDDGHMMSSY, CVPR 2012BBDDGHMMSSY, CVPR 2012

Detecting Visual TextDetecting Visual TextDGHMMSYCDBB, NAACL 2012DGHMMSYCDBB, NAACL 2012

Corpus-Guided SentenceCorpus-Guided SentenceGeneration of Natural ImagesGeneration of Natural ImagesYTDA, EMNLP 2011YTDA, EMNLP 2011

Midge: Generating ImageMidge: Generating ImageDescriptions fromDescriptions fromComputer Vision DetectionsComputer Vision DetectionsMDGYSHMBBD, EACL 2012MDGYSHMBBD, EACL 2012

9/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

10/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

the sheep meandered along a desolate road in the highlands of Scotland through frozen grass

11/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

12/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

Visual Scene Construction

13/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

Visual Scene Construction

the small white cat is -17 inches above the hat. the tiny white illuminator is in front of the cat. it is night. the ground is red.

the 200 foot tall dragon is facing the 100 foot tall car. The ground is a checkerboard. the sky is pink

Coyne & Sproat, SIGGRAPH 2001WordsEye: An Automatic

Text-to-Scene Conversion System

14/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

Visual Scene Construction

15/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

Visual Scene Construction

Training Object Detectors from Text

16/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

Visual Scene Construction

Training Object Detectors from Text

“elephant in the beach”

“a personriding a horse”

≠Person + Horse

Farhadi + Sadeghi, CVPR 2011Recognition Using Visual

Phrases

17/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why do this?Caption Generation

Visual Scene Construction

Training Object Detectors from Text

18/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What is “visual text”● Photographer/viewer distinctions

Kevin’s mom, so punxrawkin Kev’s black flag hat

19/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What is “visual text”● Photographer/viewer distinctions

● Amount of inference

Kevin’s mom, so punxrawkin Kev’s black flag hat

Another dream car toadd to the list, this onespotted in Hanbury St.

20/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What is “visual text”● Photographer/viewer distinctions

● Amount of inference

● Temporal events

Kevin’s mom, so punxrawkin Kev’s black flag hat

Another dream car toadd to the list, this onespotted in Hanbury St.

Tuckered out from playingin Nannie’s yard.

21/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What is “visual text”● Photographer/viewer distinctions

● Amount of inference

● Temporal events

Kevin’s mom, so punxrawkin Kev’s black flag hat

Another dream car toadd to the list, this onespotted in Hanbury St.

Tuckered out from playingin Nannie’s yard.

A phrase is visual if there is apiece of the image you can cut

out, place in another image,and still use the same description.

22/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Okay, so can we detect it?● SBU Flickr data● 3 NPs per caption● 800 images: ≥3 annotations● 48k images: 1 annotation● People largely agree

(74% whatever that means...)● 3 NPs per caption, 70% visual

23/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Okay, so can a computer detect it?Word+stems

BigramsSpelling

Hypernyms(Inside, Before and After)

Another dream car to add to the list...another anothdream dreamcar caranother_dream dream_carAa+ a+ a+Vehicle … artifact … entity

to toadd addto_adda+ a+

24/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Okay, so can a computer detect it?Word+stems

BigramsSpelling

Hypernyms(Inside, Before and After)

Another dream car to add to the list...another anothdream dreamcar caranother_dream dream_carAa+ a+ a+Vehicle … artifact … entity

to toadd addto_adda+ a+

≈67% AUC

25/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation

car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty

brown green wooden striped orangerectangular furry shiny rusty feathered

public original whole righteous adjectivespolitical personal intrinsic seeds individual

Adj

Nou

n V

NV

V

NV

Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic

26/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation

car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty

brown green wooden striped orangerectangular furry shiny rusty feathered

public original whole righteous adjectivespolitical personal intrinsic seeds individual

Adj

Nou

n V

NV

V

NV

Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic

grayish, chestnut, emerald, rufous

27/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation

car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty

brown green wooden striped orangerectangular furry shiny rusty feathered

public original whole righteous adjectivespolitical personal intrinsic seeds individual

Adj

Nou

n V

NV

V

NV

Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic

grayish, chestnut, emerald, rufous

#A81C07

28/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation

car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty

brown green wooden striped orangerectangular furry shiny rusty feathered

public original whole righteous adjectivespolitical personal intrinsic seeds individual

Adj

Nou

n V

NV

V

NV

Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic

grayish, chestnut, emerald, rufous

#A81C07#A81C07

29/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation

car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty

brown green wooden striped orangerectangular furry shiny rusty feathered

public original whole righteous adjectivespolitical personal intrinsic seeds individual

Adj

Nou

n V

NV

V

NV

Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic

grayish, chestnut, emerald, rufous

oblong, hemispherical, quadrangular, convex

#A81C07#A81C07

30/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation

car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty

brown green wooden striped orangerectangular furry shiny rusty feathered

public original whole righteous adjectivespolitical personal intrinsic seeds individual

Adj

Nou

n V

NV

V

NV

Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic

grayish, chestnut, emerald, rufous

oblong, hemispherical, quadrangular, convex

#A81C07#A81C07

≈67% AUC

31/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Bootstrapping visual terminology● Start with some seeds● Apply bootstrapping or label propagation

car house tree horse animal mantable bottle woman computeridea bravery deceit trust dedicationanger humour luck inflation honesty

brown green wooden striped orangerectangular furry shiny rusty feathered

public original whole righteous adjectivespolitical personal intrinsic seeds individual

Adj

Nou

n V

NV

V

NV

Color purple blue maroon beige greenMaterial plastic cotton wooden metallic silverShape circular square round rectangular triangularSize small big tiny tall hugeSurface coarse smooth furry fluffy roughDirection sideways north upward left downPattern striped dotted checked plaid quiltedQuality shiny rusty dirty burned glitteryBeauty beautiful cute pretty gorgeous lovelyAge young mature immature older seniorEthnicity french asian american greek hispanic

grayish, chestnut, emerald, rufous

oblong, hemispherical, quadrangular, convex

#A81C07#A81C07

≈67% AUC≈71% AUC

32/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

But this doesn't use the images!!!

50

55

60

65

70

75

80

85

90

95

RandomModelModel+ListsHuman

33/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

But this doesn't use the images!!!

50

55

60

65

70

75

80

85

90

95

RandomModelModel+ListsHuman

34/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What I used to think vision did...

35/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What I used to think vision did...

36/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What I used to think vision did...

37/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What I used to think vision did...

38/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Now I know better....

39/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Adding in image features

Ecuador, amazon basin, near coca, rain forest, passion fruit flower

● Does a detector corresponding to this head noun exist?

● Did it fire?● How many times did it fire?● How confident was the “best”

firing?● What %age of pixels in the image

are in that bounding box?

40/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Results with vision features

50

55

60

65

70

75

80

85

90

95

RandomModelModel+Lists+VisionHuman

41/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Results with vision features

50

55

60

65

70

75

80

85

90

95

RandomModelModel+Lists+VisionHuman

Features only availableon about 11% of examples

42/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Results with vision features

50

55

60

65

70

75

80

85

90

95

RandomModelModel+Lists+VisionHuman

Features only availableon about 11% of examples

8% improvement onphrases with recognizers

A picture is worth 13.6 words43 Hal Daumé III ([email protected])

bird

boat

bottle

bowl

Detecting on a large scale...

A picture is worth 13.6 words44 Hal Daumé III ([email protected])

Given an image

1)

What do people describe?

A picture is worth 13.6 words45 Hal Daumé III ([email protected])

Predict what people will describe

Given an image

1)

“two women sitting brunette blonde on bench reading magazine”

What do people describe?

A picture is worth 13.6 words46 Hal Daumé III ([email protected])

Predict what people will describe

Given an image

1)

“two women sitting brunette blonde on bench reading magazine”

women ● bench ●

magazine● grass

skirt

What do people describe?

A picture is worth 13.6 words47 Hal Daumé III ([email protected])

What’s in this image?

Predicting what will be described

A picture is worth 13.6 words48 Hal Daumé III ([email protected])

manbabysling

ladderfridgetable

watermelonchair

boxescups

water bottlewall

pacifierbeard

glassesshirt

What’s in this image?

Predicting what will be described

A picture is worth 13.6 words49 Hal Daumé III ([email protected])

What do people describe?“A bearded man is holding a child in a sling.”

manbabysling

ladderfridgetable

watermelonchair

boxescups

water bottlewall

pacifierbeard

glassesshirt

What’s in this image?

Predicting what will be described

A picture is worth 13.6 words50 Hal Daumé III ([email protected])

What do people describe?“A bearded man is holding a child in a sling.”

manbabysling

ladderfridgetable

watermelonchair

boxescups

water bottlewall

pacifierbeard

glassesshirt

What’s in this image?

Predicting what will be described

“A bearded man stands while holdinga small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girlin green sack.” “Man with beard and baby”

A picture is worth 13.6 words51 Hal Daumé III ([email protected])

What do people describe?“A bearded man is holding a child in a sling.”

manbabysling

ladderfridgetable

watermelonchair

boxescups

water bottlewall

pacifierbeard

glassesshirt

What’s in this image?

Predicting what will be described

“A bearded man stands while holdinga small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girlin green sack.” “Man with beard and baby”

A picture is worth 13.6 words52 Hal Daumé III ([email protected])

Two kinds of factors– Compositional– Semantic

What factors influence what someone will describe about an image?

Description factors

A picture is worth 13.6 words53 Hal Daumé III ([email protected])

“A sail boat on the ocean.”

Size/Saliency

Location

Compositional factors

A picture is worth 13.6 words54 Hal Daumé III ([email protected])

Compositional factors

“Two men standing on beach.”

Size/Saliency

Location

A picture is worth 13.6 words55 Hal Daumé III ([email protected])

“girl in the street”

Object Type

Nameable Scene

Unusualness

Semantic factors

A picture is worth 13.6 words56 Hal Daumé III ([email protected])

Semantic factors

“kitchen in house”

Object Type

Nameable Scene

Unusualness

A picture is worth 13.6 words57 Hal Daumé III ([email protected])

Semantic factors

“elephant in the beach”

Object Type

Nameable Scene

Unusualness

A picture is worth 13.6 words58 Hal Daumé III ([email protected])

Semantic factors

“A tree in water and a boy with a beard”

Object Type

Nameable Scene

Unusualness

A picture is worth 13.6 words59 Hal Daumé III ([email protected])

Generating captions

a) Detect objects and scenes from input image;b) Estimate optimal sentence structure quadruplet T;c) Generating a sentence from T;

A picture is worth 13.6 words60 Hal Daumé III ([email protected])

Example

A picture is worth 13.6 words61 Hal Daumé III ([email protected])

Sample Results

A picture is worth 13.6 words62 Hal Daumé III ([email protected])

Evaluation Result

A picture is worth 13.6 words63 Hal Daumé III ([email protected])

Using large corpora to compose natural captions

(why write your own material when you can just “steal” it?)

A picture is worth 13.6 words64 Hal Daumé III ([email protected])

a) monkey playing in the tree canopy, Monte Verde in the rain forest

e) the monkey sitting in a tree, posing for his picture

c) monkey spotted in Apenheul Netherlands under the tree

d) a white-faced or capuchin in the tree in the garden

b) capuchin monkey in frontof my window

Composing captions

A picture is worth 13.6 words65 Hal Daumé III ([email protected])

a) monkey playing in the tree canopy, Monte Verde in the rain forest

e) the monkey sitting in a tree, posing for his picture

c) monkey spotted in Apenheul Netherlands under the tree

d) a white-faced or capuchin in the tree in the garden

b) capuchin monkey in frontof my window

Composing captions

A picture is worth 13.6 words66 Hal Daumé III ([email protected])

Caption images where:

We assume some evidence for 1 object

&

Object detector is confident

Captioning with (some) evidence

A picture is worth 13.6 words67 Hal Daumé III ([email protected])

Caption images where:

We assume some evidence for 1 object

&

Object detector is confident

Tag: “mare” Evidence for horse

Captioning with (some) evidence

A picture is worth 13.6 words68 Hal Daumé III ([email protected])

Caption images where:

We assume some evidence for 1 object

&

Object detector is confident

Tag: “mare”

High detection score

Evidence for horse

Captioning with (some) evidence

A picture is worth 13.6 words69 Hal Daumé III ([email protected])

Grab phrases based on image similarity between query and captioned data baseObject detection similarity - NPs, VPs Stuff detection similarity – PPs Scene similarity - PPs

Mash phrases Compose descriptions using simple rule based concatenation

Generation: Grab 'N Mash

A picture is worth 13.6 words70 Hal Daumé III ([email protected])

Detect: fruit

Getting NPs – Objects

A picture is worth 13.6 words71 Hal Daumé III ([email protected])

Detect: fruit

Find matching fruit detections by color similarity

Getting NPs – Objects

A picture is worth 13.6 words72 Hal Daumé III ([email protected])

Detect: fruit

Find matching fruit detections by color similarity

Tray of glace fruit in the market at Nice, France

Fresh fruit in the market

A box of oranges was just catching the sun, bringing out detail in the skin.

The street market in Santanyi, Mallorca is a must for the oranges and local crafts.

An orange tree in the backyard of the house.

mandarin oranges in glass bowl

Getting NPs – Objects

A picture is worth 13.6 words73 Hal Daumé III ([email protected])

Getting NPs – Objects

The muddy elephantAn elephantsmall elephantA very large and seemingly old elephantmusk male elephantAfrican elephantthe temple elephant

Fushia flowera flowera pink zinna flowerThis beautiful flowera roman pink flowera tiny pink flowerpink bursting flowersa perfectly pink gerbera daisy

a lonesome ducka native new zealand duckThe duckmale Mallard duckseveral other ducksa so-called navigation duckthis ducka duckduckmandarin duck

A picture is worth 13.6 words74 Hal Daumé III ([email protected])

theses cows live in the field behind my house A cow eating flowers in

the south of the Netherlands.

The cow was more interested in eating than looking at me with a camera!

While cycling north on Tremaine Road near Milton, this cow gazed across the road intently.

Detect: cow

Find matching cow detections by shape/pose similarity

Getting VPs – objects

A picture is worth 13.6 words75 Hal Daumé III ([email protected])

Detect: grassgreen manure in the veg field - Plaw Hatch

Find matching grass detections by color similarity

Found on hawthorn in boggy grass field

Sheep in a field spotted during a coastal drive from Tramore to Dungervan

I am happy in a field of green Maryland grass

Getting PPs – stuff

A picture is worth 13.6 words76 Hal Daumé III ([email protected])

View from our B&B in this photo

Extract scene descriptor

Find matching images by scene similarity

Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere

I'm about to blow the building across the street over with my massive lung power.

Only in Paris will you find a bottle of wine on a table outside a bookstore

Getting PPs – scenes

A picture is worth 13.6 words77 Hal Daumé III ([email protected])

Composing captions

A picture is worth 13.6 words78 Hal Daumé III ([email protected])

object color

object pose

scene

stuff

Composing captions

A picture is worth 13.6 words79 Hal Daumé III ([email protected])

NP: the sheep

VP: meandered along a desolate road

PP: in the highlands of Scotland

PP: through frozen grass

object color

object pose

scene

stuff

Composing captions

A picture is worth 13.6 words80 Hal Daumé III ([email protected])

NP: the sheep

VP: meandered along a desolate road

PP: in the highlands of Scotland

PP: through frozen grass

object color

object pose

scene

stuff

Various composition patterns:NP VPNP PP_stuffNP PP_scene…NP VP PP_scene PP_stuff

Composing captions

A picture is worth 13.6 words81 Hal Daumé III ([email protected])

the sheep meandered along a desolate road in the highlands of Scotland through frozen grass

NP: the sheep

VP: meandered along a desolate road

PP: in the highlands of Scotland

PP: through frozen grass

object color

object pose

scene

stuff

Various composition patterns:NP VPNP PP_stuffNP PP_scene…NP VP PP_scene PP_stuff

Composing captions

A picture is worth 13.6 words82 Hal Daumé III ([email protected])

cat enjoys hiding under the tree

A female Monarch butterfly was visiting the plant in my front yard in Devon 17/10/10 Stained glass window

depicting Christ and numerous saints in Washington National Cathedral in the Eglise

A double-decker bus under some spreading shade trees

her flower girl dress designed by Mainbocher in the house

A duck was having a bath in the harbor at whitehaven, cumbria, england in the water near Camley St

Good results

A picture is worth 13.6 words83 Hal Daumé III ([email protected])

Not so good results

A picture is worth 13.6 words84 Hal Daumé III ([email protected])

Language issues

A Moo cow tied up around the city eating grass in various places under the tree at the young tree

male tiger sighting in twelve months of a street

Not so good results

A picture is worth 13.6 words85 Hal Daumé III ([email protected])

Language issues Vision issues

A Moo cow tied up around the city eating grass in various places under the tree at the young tree

The silhouetted building and cross stands under water around Loon Mountain

male tiger sighting in twelve months of a street

a girl walking by in a green field in the sun

Not so good results

A picture is worth 13.6 words86 Hal Daumé III ([email protected])

Language issues Vision issues Just plain silly

A Moo cow tied up around the city eating grass in various places under the tree at the young tree

dogs running pic, this time, racing through the sea at Fraisthorpe near Bridlington of Christmas tree in bed

The silhouetted building and cross stands under water around Loon Mountain

male tiger sighting in twelve months of a street

a girl walking by in a green field in the sun

bike was left here by an ancient civilization not as sophisticated as our own in the grass of granite

Not so good results

A picture is worth 13.6 words87 Hal Daumé III ([email protected])

Open question...➢ Can we do this without using pre-defined object/scene/etc.

detectors?

➢ Build a representation of each image in the database➢ Build a representation of the test image➢ Find 10 most similar database images➢ Merge their NL descriptions using text-to-text generation

techniques

➢ Q: Where do these representations come from???

88/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

And why are we trying to do this...???● Captioning the world for

people with visual impairments● But the captions we have are not

really descriptive of the world

89/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

And why are we trying to do this...???● Captioning the world for

people with visual impairments● But the captions we have are not

really descriptive of the world

● Use vision to “ground out”language● Is it turtles

all the way down?

90/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

And why are we trying to do this...???● Captioning the world for

people with visual impairments● But the captions we have are not

really descriptive of the world

● Use vision to “ground out”language● Is it turtles

all the way down?

● That's how babies work!● Sadly we don't have

baby-esque robots yet

91/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Why work on a task at all?● A solution is of benefit to society● The process focuses attention on

phenomena that are worthy of study

● What is worthy of study? (IMO)● Low-level linguistic phenomena that hide in the tail● Human-like abilities to generalize from small data● Very basic learning of correlations between different

modalities (operant conditioning)

René Descartes(1596-1650)

92/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What about 2nd language learning?● Obvious problems

● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...

93/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What about 2nd language learning?● Obvious problems

● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...

● But we do havesoftware withexercises for SLA

94/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What about 2nd language learning?● Obvious problems

● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...

● But we do havesoftware withexercises for SLA

It's hard for people, too!

95/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What about 2nd language learning?● Obvious problems

● Assumes knowledge 1st language● Assumes knowledge of the world● Still don't have a robot...

● But we do havesoftware withexercises for SLA

It's hard for people, too!

96/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Aspects of computational 2ndLL● Very specific linguistic variants

● Number, case, agreement, etc.● Not enough to get the majority case

97/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Aspects of computational 2ndLL● Very specific linguistic variants

● Number, case, agreement, etc.● Not enough to get the majority case

● Focus on subtle visual differences

98/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Aspects of computational 2ndLL● AI-style

reasoning &one-shotlearning

99/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Aspects of computational 2ndLL● AI-style

reasoning &one-shotlearning

● “It's learnable” proof of concept:

100/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

What is needed to solve this?● Linguistic model over character

sequences (words not okay!)w/o any L-specific background

● Pre-trained (?) visual detectorsfor objects, poses andphysical relationships (eg., gaze)

● Ability to reason and generalizefrom a few examples

101/101 A picture is worth 13.6 wordsHal Daumé III, [email protected]

Thanks!Questions?

AlexBerg

AmitGoyal

TamaraBerg

JesseDodge

YejinChoi

YiannisAloimonos

KotaYamaguchi

AlyssaMensch

KarlStratos

MegMitchell

XufengHan

Ching LikTeo

YezhouYang