sort story: sorting jumbled images and captions into ...carjun/emnlp16-poster.pdf · sort story:...

1
] CNN Correct and Incorrect Predictions Relative position prediction (Neural Position Embeddings NPE) Approach Motivation Correct predictions Incorrect predictions Learn the temporal order of a sequence of events — learn multimodal temporal common sense via images and captions Skipthought Applications Task Sort Story: Sorting Jumbled Images and Captions into Stories Jumbled set of aligned image-caption pairs Input: a group of friends got together to celebrate st. [female]’s day. after that, everyone finished off the night with dancing. everyone waited patiently for the food to arrive. when the food arrived it was all very delicious. then everyone got together for a group photo. then everyone got together for a group photo. everyone waited patiently for the food to arrive. after that, everyone finished off the night with dancing. a group of friends got together to celebrate st. [female]’s day. when the food arrived it was all very delicious. Ordered sequence that forms a coherent story Output: Dataset Question answering Multi-document summarization Human-AI communication Harsh Agrawal* ,1 , Arjun Chandrasekaran* ,1,2 , Dhruv Batra 3,1 , Devi Parikh 3,1 , Mohit Bansal 4,2 1 Virginia Tech, 2 Toyota Technological Institute, 3 Georgia Institute of Technology, 4 UNC Chapel Hill Relative position prediction (Pairwise) Classify Position Classify Relative Position Concatenate Quantitative Results emnlp 2016 Qualitative Analysis x i Confusion matrix for predicting position SK = Skipthought Img = Image CNN activations UN = Unary NPE = Neural Position Embeddings so we had a gathering with tons of food . we were sending off our friend . we will miss him ! we each took turns writing down some memories . too much junk food was consumed by all . the group shot was perfect ! i love tripods and self-timers . we went out by the water for the party . the view was gorgeous . the women stood on the deck and conversed with each other . the children went down on the dock and played in the water . later at night there were fireworks . and of course dessert . everyone showed up to my apartment last week for the party . there were a lot of people . we all had to sit very close to each to fit . we had some food delivered . it was delicious . afterward we all sat and relaxed while drinking some tea . a man is holding his child . a family is holding a baby . a grandmother is holding her grandchild . people are sitting on the floor . Correct Positions: a man is holding a baby . 5 4 2 1 3 Correct Positions: 2 5 4 1 3 people are upset over terrorist attacks in their country . one of the protest leaders talks to the crowd and demands change . everyone is coming together to demand peace . protests are happening in location . proud people hold the country ’s flag in solidarity . Image captions are narrative and conversational Temporal common sense Less certainty about the middle elements of the story Model has learnt the 3-act structure in stories: setup, middle, and climax First and last story elements are often predicted correctly Word clouds of discriminative words at each sentence position in the story GRU MLP MLP 4800-d 1024-d 1024-d Shared Weights MLP MLP 4096-d 1024-d 1024-d ] Concatenate Train ~ 40K stories, Val ~ 5K stories, Test ~ 5K stories Absolute position prediction (Unary) 1 2 5 4 3 4800-d GRU MLP MLP MLP x i , x j = Embedded sentences x i x j O D 2 D 1 + R d Embedding space >α >α Caption i Caption j A set of 5 images and associated narrative captions form a story 5 4 2 1 3 Shared Weights 1 0 (i > j) (i < j) MLP CNN Caption i occurs earlier than Caption j in ground truth LSTM LSTM MLP x j MLP L ij Skipthought L ij = ||max(0, - (x j - x i ))|| 2 Loss = 1<=i<j =n L ij Lower is better Higher is better PW = Pairwise 0.0 0.3 0.5 0.8 1.1 1.3 1.6 Random UN SK UN SK+Img NPE PW SK PW SK+Img Voting (PW Sk+Img+NPE) Spearman’s correlation Pairwise comparison Avg. distance Word size relative frequency of occurring in that position α Unary < NPE < Pairwise < Voting Text < Img+Text We use Microsoft’s SIND (Sequential Image Narrative Dataset) Caption 4096-d Pre-trained Caption i Caption j CNN 4096-d Pre-trained GRU 4800-d All events in the story are similar No clear temporal order of events

Upload: others

Post on 11-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sort Story: Sorting Jumbled Images and Captions into ...carjun/emnlp16-poster.pdf · Sort Story: Sorting Jumbled Images and Captions into Stories Jumbled set of aligned image-caption

]CNN

Correct and Incorrect Predictions

Relative position prediction (Neural Position Embeddings — NPE)

ApproachMotivationCorrect predictions

Incorrect predictions

Learn the temporal order of a sequence of events — learn multimodal temporal common sense via images and captions

Skipthought

Applications

Task

Sort Story: Sorting Jumbled Images and Captions into Stories

Jumbled set of aligned image-caption pairs

Input:

a group of friends got together to celebrate

st. [female]’s day.

after that, everyone finished off the night

with dancing.

everyone waited patiently for the food

to arrive.

when the food arrived it was all very delicious.

then everyone got together for a group photo.

then everyone got together for a group photo.

everyone waited patiently for the food to arrive.

after that, everyone finished off the night

with dancing.

a group of friends got together to celebrate

st. [female]’s day.

when the food arrived it was all very delicious.

Ordered sequence that forms a coherent story

Output:

Dataset

• Question answering • Multi-document summarization • Human-AI communication

Harsh Agrawal*,1, Arjun Chandrasekaran*,1,2, Dhruv Batra3,1, Devi Parikh3,1, Mohit Bansal4,2 1Virginia Tech, 2Toyota Technological Institute, 3Georgia Institute of Technology, 4UNC Chapel Hill

Relative position prediction (Pairwise)

ClassifyPosition

Classify Relative Position

Concatenate

Quantitative Results

emnlp2016

Qualitative Analysis

xi

Confusion matrix for predicting position

SK = Skipthought Img = Image CNN activations

UN = Unary NPE = Neural Position Embeddings

so we had a gathering with tons of food .

we were sending off our friend . we will

miss him !

we each took turns writing down some memories .

too much junk food was

consumed by all .

the group shot was perfect ! i

love tripods and self-timers .

we went out by the water for the party . the view was gorgeous .

the women stood on the

deck and conversed with

each other .

the children went down on the

dock and played in the water .

later at night there were fireworks .

and of course dessert .

everyone showed up to my

apartment last week for the party .

there were a lot of people .

we all had to sit very close to each to fit .

we had some food delivered . it was delicious .

afterward we all sat and relaxed while drinking

some tea .

a man is holding his

child .

a family is holding a baby .

a grandmother is holding her grandchild .

people are sitting on the

floor .

Correct Positions:

a man is holding a

baby .

5 4 2 1 3

Correct Positions:

2 5 4 1 3

people are upset over terrorist

attacks in their country .

one of the protest leaders talks to the crowd and

demands change .

everyone is coming together

to demand peace .

protests are happening in

location .

proud people hold the

country ’s flag in solidarity .

• Image captions are narrative and conversational

Temporal common sense

• Less certainty about the middle elements of the story

Model has learnt the 3-act structure in stories: setup, middle, and climax

• First and last story elements are often predicted correctly

• Word clouds of discriminative words at each sentence position in the story

GRU

MLP

MLP4800-d

1024-d

1024-d

Shared Weights

MLP

MLP

4096-d 1024-d

1024-d

]Concatenate

• Train ~ 40K stories, Val ~ 5K stories, Test ~ 5K stories

Absolute position prediction (Unary)

1 2

543

4800-dGRU

MLP

MLPMLP

xi , xj = Embedded sentences

xi

xj

O D2

D1

+Rd Embedding space

Caption i

Caption j

• A set of 5 images and associated narrative captions form a story

5

4

2

1

3

Shared Weights

1

0 (i > j)

(i < j)MLP

CNN

Caption i occurs earlier than Caption j in ground truth

LSTM

LSTM

MLP

xjMLP

Lij

Skipthought

Lij = ||max(0,↵� (xj � xi))||2 Loss =X

1<=i<j=n

Lij

Lower is betterHigher is better

PW = Pairwise

0.00.30.50.81.11.31.6

Random UN SK UN SK+Img NPE PW SK PW SK+Img Voting (PW Sk+Img+NPE)

Spearman’s correlation Pairwise comparison Avg. distance

• Word size relative frequency of occurring in that position

α

Unary < NPE < Pairwise < Voting

Text < Img+Text

• We use Microsoft’s SIND (Sequential Image Narrative Dataset)

Caption

4096-dPre-trained

Caption i

Caption j

CNN 4096-dPre-trained

GRU4800-d

All events in the story are similar

No clear temporal order of events