speaking the same language: matching machine to human ... · wave apersonona surfboardridingawave...

1
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training Rakshith Shetty 1 , Marcus Rohrbach 2,3 , Lisa Anne Hendricks 2 , Mario Fritz 1 , Bernt Schiele 1 1 Max Planck Institute for Informatics 2 UC Berkeley EECS 3 Facebook AI Research Contact: [email protected] https://goo.gl/3yRVnq Summary Motivation: Image captioning models generate correct but “safe” cap- tions severely lacking in diversity compared to human written captions. Generative Adversarial Network framework used to train our caption generator Core Idea: Use GAN [1] framework to better match the data distribution. Generator produces multiple captions for an image by sampling. Discriminator scores this caption set on correctness and diversity. Significantly higher diversity, larger vocabulary, more novel sentences, better match of language statistics, while maintaining same level of correctness. Diverse captions on similar images Ours a group of people standing around a shop a group of young people standing around talking on cell phones a group of soldiers stand in front of microphones a couple of women standing next to a man in front of a store a group of people posing for a photo in formal wear Baseline a group of people standing around a table Ours a person on skis jumping over a ramp a skier is making a turn on a course a person cross country skiing on a trail a skier is headed down a steep slope a cross country skier makes his way through the snow Baseline a man riding skis down a snow covered slope Ours a surfer rides a large wave in the ocean a surfer is falling off his board as he rides a wave a person on a surfboard riding a wave a man surfing on a surfboard in rough waters a surfer rides a small wave in the ocean Baseline a man riding a wave on top of a surfboard Ours a bathroom with a walk in shower and a sink a dirty bathroom with a broken toilet and sink a view of a very nice looking rest room a white toilet in a public restroom stall a small bathroom has a broken toilet and a broken sink Baseline a bathroom with a toilet and a sink Model Generator 3-layers LSTM with residual connections. Use Gumbel-softmax approximation [2] for differentiability. Gumbel-Max Trick: r = one_hot arg max i (g i + log θ i ) Softmax approximation: r = softmax (g i + log θ i ) Feature matching loss [3] helps. Discriminator Evaluate a set of five captions per image. Compute two distances with image and sentence embeddings: Image to sentence distances – for semantic correctness Intra-sentence distances – for sufficient diversity Quantitative Results Adversarial model better matches the n-gram distribution of the dataset. Figure compares n-gram count ratios of the generated captions to true test set captions. Scatter plots show the n-gram count-ratios as a function of counts on training set. Adjoining the scatter plots on the right are the histogram plots of the count-ratios. Vocab- % Novel Method n Div-1 Div-2 mBleu-4 ulary Sentences Base-beamsearch 1 of 5 756 34.18 5 of 5 0.28 0.38 0.78 1085 44.27 Base-sampling 1 of 5 839 52.04 5 of 5 0.31 0.44 0.68 1460 55.24 1 of 5 1508 68.62 Adv-beamsearch 5 of 5 0.34 0.44 0.70 2176 72.53 1 of 5 1616 73.92 Adv-sampling 5 of 5 0.41 0.55 0.51 2671 79.84 Human 1 of 5 3347 92.80 captions 5 of 5 0.53 0.74 0.20 7253 95.05 Adversarial model has significantly better diversity statistics. Div-1 and Div-2 measure the n-gram uniqueness in the 5 samples. mBleu-4 measures the similarity in terms of Bleu-4. Vocabulary size increases by 100% and 82% when using beamsearch and sampling respectively with the adversarial model. Comparison Adversarial - Better Adversarial - Worse Beamsearch 36.9 34.8 Sampling 35.7 33.2 Human evaluation shows that the adversarial model is on par to the baseline model in correctness. Five human evaluators were asked to pick the more correct caption of the two on 482 random images. Image Feature Evalset size (p) Feature Matching Meteor Div-2 Vocab. Size Baseline Model with VGG features 0.247 0.44 1367 VGG 1 No 0.179 0.40 812 VGG 5 No 0.197 0.52 1810 VGG 5 yes 0.207 0.59 2547 ResNet 5 yes 0.236 0.55 2671 Ablation study shows that multi-caption evaluation and feature matching are key to increasing diversity Comparison done on the validation set. Switching to ResNet features help improve semantics References [1]I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014. [2] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in ICLR, 2016. [3] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in NIPS, 2016.

Upload: others

Post on 18-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speaking the Same Language: Matching Machine to Human ... · wave apersonona surfboardridingawave amansurfingona surfboardinrough waters asurferridesasmall waveintheocean Baseline

Speaking the Same Language: Matching Machine to Human Captions by Adversarial TrainingRakshith Shetty1, Marcus Rohrbach2,3, Lisa Anne Hendricks2, Mario Fritz1, Bernt Schiele1

1Max Planck Institute for Informatics 2UC Berkeley EECS 3Facebook AI ResearchContact:[email protected]://goo.gl/3yRVnq

Summary

Motivation: Image captioning models generate correct but “safe” cap-tions severely lacking in diversity compared to human written captions.

Generative Adversarial Network framework used to train our caption generator

Core Idea:•Use GAN [1] framework to better match the data distribution.•Generator produces multiple captions for an image by sampling.•Discriminator scores this caption set on correctness and diversity.

Significantly higher diversity, larger vocabulary, morenovel sentences, better match of language statistics,while maintaining same level of correctness.

Diverse captions on similar images

Ours a group of peoplestanding around a shop

a group of youngpeople standing aroundtalking on cell phones

a group of soldiersstand in front ofmicrophones

a couple of womenstanding next to a manin front of a store

a group of peopleposing for a photo informal wear

Baseline a group of people standing around a table

Ours a person on skisjumping over a ramp

a skier is making aturn on a course

a person cross countryskiing on a trail

a skier is headed downa steep slope

a cross country skiermakes his way throughthe snow

Baseline a man riding skis down a snow covered slope

Ours a surfer rides a largewave in the ocean

a surfer is falling off hisboard as he rides awave

a person on asurfboard riding a wave

a man surfing on asurfboard in roughwaters

a surfer rides a smallwave in the ocean

Baseline a man riding a wave on top of a surfboard

Ours a bathroom with awalk in shower and asink

a dirty bathroom witha broken toilet and sink

a view of a very nicelooking rest room

a white toilet in apublic restroom stall

a small bathroom hasa broken toilet and abroken sink

Baseline a bathroom with a toilet and a sink

Model

Generator

• 3-layers LSTM with residual connections.•Use Gumbel-softmax approximation [2] fordifferentiability.Gumbel-Max Trick: r = one_hot

argmaxi

(gi + log θi)

Softmax approximation: r′ = softmax (gi + log θi)•Feature matching loss [3] helps.

Discriminator

•Evaluate a set of five captions per image.•Compute two distances with image and sentenceembeddings:• Image to sentence distances – for semanticcorrectness

• Intra-sentence distances – for sufficient diversity

Quantitative Results

100 101 102 103 104 105

Uni-gram counts

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Mea

n U

ni-g

ram

cou

nt ra

tio

adversarialbaselinetest-references

0 50 100 150 200 250 300 350 400 450number of words

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Uni

-gra

m c

ount

ratio

adversarialbaselinetest-references

100 101 102 103 104

Bi-gram counts

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Mea

n Bi

-gra

m c

ount

ratio

adversarialbaselinetest-references

0 50 100 150 200 250 300 350 400 450number of Bi-grams

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Bi-g

ram

cou

nt ra

tio

adversarialbaselinetest-references

100 101 102 103

Tri-gram counts

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Mea

n Tr

i-gra

m c

ount

ratio

adversarialbaselinetest-references

0 50 100 150 200 250 300 350 400number of Tri-grams

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Tri-g

ram

cou

nt ra

tio

adversarialbaselinetest-references

Adversarial model better matches the n-gram distribution of the dataset. Figure compares n-gram count ratios of the generated captions to true test set captions.Scatter plots show the n-gram count-ratios as a function of counts on training set. Adjoining the scatter plots on the right are the histogram plots of the count-ratios.

Vocab- % NovelMethod n Div-1 Div-2 mBleu-4 ulary Sentences

Base-beamsearch 1 of 5 – – – 756 34.185 of 5 0.28 0.38 0.78 1085 44.27

Base-sampling 1 of 5 – – – 839 52.045 of 5 0.31 0.44 0.68 1460 55.241 of 5 – – – 1508 68.62Adv-beamsearch 5 of 5 0.34 0.44 0.70 2176 72.531 of 5 – – – 1616 73.92Adv-sampling 5 of 5 0.41 0.55 0.51 2671 79.84

Human 1 of 5 – – – 3347 92.80captions 5 of 5 0.53 0.74 0.20 7253 95.05

Adversarial model has significantly better diversity statistics. Div-1and Div-2 measure the n-gram uniqueness in the 5 samples. mBleu-4 measuresthe similarity in terms of Bleu-4. Vocabulary size increases by 100% and 82%when using beamsearch and sampling respectively with the adversarial model.

Comparison Adversarial - Better Adversarial - WorseBeamsearch 36.9 34.8Sampling 35.7 33.2

Human evaluation shows that the adversarial model is on par to thebaseline model in correctness. Five human evaluators were asked to pick themore correct caption of the two on 482 random images.

Image Feature Evalset size (p) Feature Matching Meteor Div-2 Vocab. SizeBaseline Model with VGG features 0.247 0.44 1367VGG 1 No 0.179 0.40 812VGG 5 No 0.197 0.52 1810VGG 5 yes 0.207 0.59 2547ResNet 5 yes 0.236 0.55 2671

Ablation study shows that multi-caption evaluation and featurematching are key to increasing diversity Comparison done on the validationset. Switching to ResNet features help improve semantics

References[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,

A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014.[2] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,”

in ICLR, 2016.

[3] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen,“Improved techniques for training gans,” in NIPS, 2016.