self-critical reasoningfor robust visual question answering · 2020. 1. 14. · jialinwu and...

Post on 14-Aug-2021

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Self-Critical Reasoning for Robust Visual Question Answering

Jialin Wu and Raymond J. Mooney

Visual Question Answering (VQA)

• Common VQA system

What utensil is pictured?

Answer Prediction

Knife(0.72)

Fork(0.66)Visual feature set 𝒱Original image

Capture superficial statistical correlationsbetween QA pairs

VQAsystem Knife

I won’t bother to look at the image, Ican answer your question by just

looking at the questionWhat utensil is pictured?

Original image 0

20

40

60

80

100

knife fork

Training Answer Distribution

Force VQA to focus on what humans focus on

• Extract a proposal set of objects ( ) that humans focus on.

OR

There is a fork near the cake.

Proposal object set

Human visual explanation

Human textual explanation

Force VQA to focus on what humans focus on

• Enforce the gradients for the correct answer to have the largest valuefor at least one of the extracted objects.

∇#𝑝(𝑓𝑜𝑟𝑘|𝑄, 𝒱)

Proposal object set

Influence Strengthen Loss

Results

• Compared to baseline model on VQA-CP dataset• VQA-CP dataset manually set the train and test set in very different

distribution

38

43

48

53

All

VQA scores

Baseline Ours (infl)

Over sensitivity to the most common objects

VQAsystem

I can focus on the fork but I stillthink it is a knife

What utensil is pictured?

Knife

Focused objectsfor answer “fork”

Focused objectsfor answer “knife”

Criticizing the false influential object

• Find the most influential object for the correct answer using gradients

What utensil is pictured?

∇#𝑝(𝑓𝑜𝑟𝑘|𝑄, 𝒱)

OR

There is a fork near the cake.

Answer Prediction

Knife(0.72)

Fork(0.66)

Proposal object set

Explaining prediction “fork”

Visual feature set 𝒱Original image

Human visual explanation

Human textual explanation

The most influential object

Criticizing the false influential object

• Force the object to contribute more to the correct answer.

What utensil is pictured?

∇#𝑝(𝑓𝑜𝑟𝑘|𝑄, 𝒱)

OR

There is a fork near the cake.

Answer Prediction

Knife(0.72)

Fork(0.66)

Proposal object set

Explaining prediction “fork”

Visual feature set 𝒱Original image

Human visual explanation

Human textual explanation

The most influential object

∇#𝑝(𝑘𝑛𝑖𝑓𝑒|𝑄, 𝒱)

Explaining prediction “knife”

Self Critical Loss

Our self-critical approach

VQAsystem Fork

Oh, yes, the utensil should be a fork.What utensil is pictured?

Results

• Compared to baseline model on VQA-CP dataset

3840424446485052

All

VQA scores

Baseline Ours (infl) Ours (infl + crit)

top related