piqa: reasoning about physical commonsense in natural languagemooney/gnlp/slides/piqa.pdf ·...

PIQA: Reasoning about Physical Commonsense in

Natural LanguageShailesh M Pandey

Bisk, Yonatan et al. “PIQA: Reasoning about Physical Commonsense in Natural Language.” ArXiv abs/1911.11641 (2019)

Outline1. Motivation2. Dataset

2.1. Collection2.2. Cleaning2.3. Statistics

3. Experiments3.1. Results

4. Analysis4.1. Quantitative4.2. Qualitative

5. Critique

Motivation● Modeling physical common sense

knowledge is essential to true AI-completeness.

● Can AI systems learn to reliably answer physical common sense questions without experiencing the physical world?

○ The common sense properties are rarely directly reported.

● No extensive evaluation of SOTA models on questions that require physical common sense knowledge.

Dataset● Task: given a question and two possible answers, choose the most

appropriate answer.

● Question: indicates a post-condition (goal)● Answer: procedure for accomplishing the goal (solution)

Dataset - Collection● Qualification HIT for annotators

○ Identify well formed (goal, solution) pairs >80% times.

● Provided annotators with a prompt derived from instructables.com

○ Drawn from six categories - costume, outside, craft, home, food, and workshop

○ Reminds about less prototypical uses of everyday objects

● Annotators asked to construct two component tasks

○ Articulate the goal and solution○ Perturb the solution subtly to make it

invalid

Dataset - Cleaning● Removed examples with low agreement

○ Correct examples that require expert knowledge are removed● Used AFLite to perform systematic data bias reduction

○ Used 5k examples to fine-tune BERT-Large○ Computed corresponding embeddings of remaining instances○ Used ensemble of linear classifiers (trained on random subsets) to determine if embeddings

are strong indicators of the correct answer.○ Discarded instances whose embeddings are highly indicative of the target label.

AFLite (Adversarial Filtering Lite)000 1

001 1

010 1

011 1

100 0

101 0

110 0

111 0Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

000 1

001 1

010 1

011 1

100 0

101 0

110 0

111 0

000 1

010 1

101 0

001 -

100 -

110 0

101 0

001 1

011 -

000 -

110 0

100 0

111 0

000 -

100 -

AFLite (Adversarial Filtering Lite)

Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.

000 [ ]

001 [ 1 ]

010 [ ]

011 [ ]

100 [ 0 ]

101 [ ]

110 [ ]

111 [ ]

000 1

010 1

101 0

001 1

100 0

110 0

101 0

001 1

011 -

000 -

110 0

100 0

111 0

000 -

100 -



000 [ 1 ]

001 [ 1 ]

010 [ ]

011 [ 0 ]

100 [ 0 ]

101 [ ]

110 [ ]

111 [ ]

000 1

010 1

101 0

001 1

100 0

110 0

101 0

001 1

011 0

000 1

110 0

100 0

111 0

000 -

100 -



000 [ 1 0 ]

001 [ 1 ]

010 [ ]

011 [ 0 ]

100 [ 0 0 ]

101 [ ]

110 [ ]

111 [ ]

000 1

010 1

101 0

001 1

100 0

110 0

101 0

001 1

011 0

000 1

110 0

100 0

111 0

000 0

100 0



000 1 [ 1 0 ] 0.5

001 1 [ 1 ] 1.0

010 1 [ ]

011 1 [ 0 ] 0.0

100 0 [ 0 0 ] 1.0

101 0 [ ]

110 0 [ ]

111 0 [ ]



000 1 [ 1 0 ] 0.5

001 1 [ 1 ] 1.0

010 1 [ ]

011 1 [ 0 ] 0.0

100 0 [ 0 0 ] 1.0

101 0 [ ]

110 0 [ ]

111 0 [ ]

Threshold - 0.75



Examples

Dataset - Statistics● Number of QA pairs

○ Training: > 16k○ Development: ~ 2k○ Testing: ~3k

● Average number of words:○ Goal: 7.8○ Correct solution: 21.3○ Incorrect solution: 21.3

Dataset - Statistics● Nearly identical distribution of correct and incorrect solution length.● At least 85% overlap b/w words used in correct and incorrect solutions.

Experiments● For each choice, provided the model with

goal, solution, and [CLS].● Extracted hidden states corresponding to

[CLS].● Applied linear transformation to each

hidden state and softmax over the two options.

● Trained using a cross-entropy loss.● Truncated examples at 150 tokens -

affects ~1% of the data.● Human performance was calculated by a

majority vote on the development set.

Quantitative Analysis● Two solution choices that differ by editing

a single phrase must test the common sense understanding of that phrase.

● ~60% data involves 1-2 word edit-distance b/w solutions.

● Dataset complexity generally increases with the edit distance b/w the solution pairs.

Quantitative Analysis● RoBERTa struggles to understand certain

flexible relations.○ ‘before’, ‘after’, ‘top’, and ‘bottom’

● Performs worse than average on solutions differing in ‘water’ even after ~300 training examples.

● Performs much better at certain nouns, such as ‘spoon’.

Quantitative Analysis● ‘water’ is prevalent but highly versatile.

○ Substituted with a variety of different household items.

● ‘spoon’ has fewer common replacements which indicates RoBERTa understands these simple affordances.

Qualitative Analysis● RoBERTa distinguishes prototypical correct solutions from clearly ridiculous

trick solutions.● Struggles with subtle relations and non-prototypical situations.

Critique● Try to advance a crucial ‘grounding’ problem

○ A benchmark for testing physical understanding of new models○ Evaluation of physical common sense of SOTA models - unsurprisingly these models don’t

perform very good

● Good effort at creating an unbiased dataset○ No ‘annotate for smart robot’ instruction to the workers.○ Good cleaning of the dataset - agreement scores and AFLite.

● Reasonably good analysis of the performance of RoBERTa on their dataset.

Critique● An intelligent model will have good performance on this benchmark but is the

converse true?○ What if we pre-train RoBERTa on text from ‘instructables.com’?

● Should we expect models trained on text to have physical understanding?○ How would a text-trained model know that squeezing and then releasing a bottle creates

suction?○ Should the focus have been on some ‘grounded’ models? e.g. VQA models.

● Is the dataset easy because we have just two choices?● The paper does not report a few important dataset statistics

○ What is the distribution of words in incorrect solutions? Is it similar to the correct solutions?○ How many examples were actually removed during cleaning?

● Is a majority vote good indicator of human performance?○ What is the average score of a single person?○ Should the dataset have questions where majority vote gets it wrong?

Questions?

piqa: reasoning about physical commonsense in natural languagemooney/gnlp/slides/piqa.pdf ·...

Documents