carnegie mellon university cmu sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfcmu...

24
CMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar* and Florian Metze Language Technologies Institute Carnegie Mellon University *equal contribution

Upload: others

Post on 08-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

CMU Sinbad’s Submission for the DSTC7 AVSD

ChallengeRamon Sanabria*, Shruti Palaskar* and Florian Metze

Language Technologies InstituteCarnegie Mellon University

*equal contribution

Page 2: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

● Generate system responses in a dialog about an input video.

● Dialog systems need to understand scenes.

● Task:

○ Visual question answering (VQA).

○ Video description.

Task Description

2

Page 3: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Investigating Different Visual Features

3

Page 4: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Visual Features

4

Object RecognitionPlace/Scene Recognition Action Features

● http://places2.csail.mit.edu/download.html

● Source: Image.● Target: Scene Context.

● http://www.image-net.org/● Source: Image.● Target: Object.

● Hara et al. 2018● Source: Video (16 frames).● Target: Action.

riding arena Tree Arranging flowers

Page 5: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

DSTC7 Baseline

● Sequence-to-Sequence model .

● Simple concatenation.

● 2 layers 128 units in encoder.

● 2 layers 128 units.

● Alamri et al. 2018.

Model Details

Attentive Decoder

2-layer BiGRU

Answer

Encoder

Attentive Decoder

2-layer BiGRU

Encoder

Naive Fusion

Page 6: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Feature ComparisonBaseline Results

Bleu-4 Rouge-L

Baseline 0.084 0.291

Place 0.082 0.286

Object 0.083 0.287

Action 0.079 0.284

All 0.085 0.287

● Results of proto set.6

Page 7: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Our models

7

Page 8: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Model

● Sequence-to-Sequence model with attention (Bahdanau et al. 2015)

● Bidirectional GRU encoder decoder

● 2 layers 256 units in encoder

● 2 layers 256 units of conditional GRU decoder

8

Attentive Decoder

2-layer BiGRU

Answer

Encoder

Model Details

Page 9: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Text-only Model

Attentive Decoder

2-layer BiGRU

Answer

TextEncoder

Summary +Question

9

ROUGE-L

0.330

BLEU-4

0.105

Page 10: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Action-only Model

ResNeXtaction prediction

Decoder

ROUGE-L

0.29410

BLEU-4

0.085

Answer

2-layer BiGRU

Page 11: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Hierarchical Attention for Text + Video

TextEncoder

ResNeXtaction prediction

MultimodalDecoder

Vocabulary

ROUGE-L

0.33811Libovicky et al. 2017

BLEU-4

0.112

Page 12: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

All models, Comparison

DSTC

Bleu-4 Rouge-L

Text-only 0.105 0.330

Video-only 0.085 0.294

Text + Video 0.112 0.338

● Video-only competitive with other models

● Text + Video improves over Text-only (but not drastically)

12

Page 13: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Investigating use of external data

13

Page 14: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

External Data: How2 Dataset

● Video● Speech● English Transcript

● Portuguese Transcript● Summary

Sanabria et al. 201814

Page 15: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Model Fine-tuned with How2 Data

Answer

Summary + Question

Video

Audio

Caption

Transcript

Video

Audio

Transcript +Video

Translation

Summary

15

Page 16: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Summarization with How2 Data

16

● Summarization○ Present subset of information in

a more compact form (maybe across modalities)

● “Description” field○ 2-3 sentences of meta-data:

template based, uploader provides

○ “Informative” and abstractive summary of a how-to video

○ Should generate interest of a potential viewer

Libovicky et al. 2018

Page 17: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Attention over the Video Features

17

cut cut

Talking and preparing the brush

Close-up of brushstrokes w/ hand

Black frames at the end

Close-up of brushstrokes no hand

Page 18: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Text-only Model Fine-tuned with How2

Attentive Decoder

2-layer BiGRU

Answer

TextEncoder

Summary +Question

->Answer

ROUGE-L

0.33718

BLEU-4

0.114Transcript

->Summary

Page 19: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Video-only Model Fine-tuned with How2

ResNeXtaction prediction

Vocabulary

Bidirectional RNN

Decoder

ROUGE-L

0.30019

BLEU-4

0.086Video

->Answer

Video ->

Summary

Page 20: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Text + Video Model Fine-tuned with How2

TextEncoder

ResNeXtaction prediction

MultimodalDecoder

Vocabulary

ROUGE-L

0.33920Libovicky et al. 2017

BLEU-4

0.113

Page 21: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

All models, Comparison + Human Ratings

DSTC only DSTC + How2

Bleu-4 Rouge-L Human Bleu-4 Rouge-L Human

Text-only 0.105 0.330 - 0.114 0.337 3.394

Video-only 0.085 0.294 - 0.086 0.300 -

Text + Video 0.112 0.338 3.491 0.113 0.339 3.459

● How2 data improves Text-only model

● Video-only competitive with other models

● Text + Video improves over Text-only but not drastically 21

● Our model performs well on human evaluation as well

Page 22: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Models outputs

Question: is he talking or reading out loud ?

Answer: no , he is not talking at all .

Question: what 's in the mug ?

Answer: i don 't know , i can 't see the inside .

Question: hello . did someone come to the door ?

Answer: no and it is a window that he is standing in front of .

Question: are they talking in the video ?

Answer:not really no i don 't hear anything

22

Page 23: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

References

● “Attention Strategies for Multi-Source Sequence-to-Sequence Learning”,

Jindřich Libovický, Jindřich Helcl

● “Hierarchical Question-Image Co-Attention for Visual Question Answering”, Jiasen

Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

● “NMTPY: A Flexible Toolkit for Advanced Neural Machine Translation Systems”, Ozan Caglayan, Mercedes García-Martínez, Adrien Bardet, Walid Aransa, Fethi Bougares, Loïc Barrault

● “How2: A Large-scale Dataset for Multimodal Language Understanding”, Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliot, Loic Barrault, Lucia Specia, Florian Metze

● “Multimodal Abstractive Summarization for Open-Domain Videos”, Jindrich Libovicky, Shruti Palaskar, Spandana Gella, Florian Metze

23

Page 24: Carnegie Mellon University CMU Sinbad’s …workshop.colips.org/dstc7/papers/30_slides.pdfCMU Sinbad’s Submission for the DSTC7 AVSD Challenge Ramon Sanabria*, Shruti Palaskar*

Thank you to the organizers!

Data https://github.com/srvk/how2-dataset How2 Dataset

Code https://github.com/lium-lst/nmtpytorch