Download - Text Generation Towards High Precision
![Page 1: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/1.jpg)
Towards High Precision Text GenerationAnkur Parikh
Joint work with Thibault Sellam, Ran Tian, Xuezhi Wang, Sebastian Gehrmann, Shashi Narayan, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das.
![Page 2: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/2.jpg)
Text Generation
● Given a source sequence x, output a target y
● Examples: ○ Translation: x = English sentence, y = French sentence○ Summarization: x = Document, y = One paragraph summary○ Data to Text: x = Structured data, y = Textual description.
![Page 3: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/3.jpg)
Encoder Decoder Paradigm
P 3
Encoder = some neural network
Decoder (conditional language model) = some neural network
![Page 4: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/4.jpg)
Encoder Decoder Paradigm
P 4
![Page 5: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/5.jpg)
Neural Text Generation
● Neural text generation is surprisingly fluent but prone to hallucination which makes them unsuitable for many real world applications.
● Challenges are multifaceted:
○ Models○ Data
○ Evaluation
![Page 6: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/6.jpg)
Neural Text Generation
● Neural text generation is surprisingly fluent but prone to hallucination which makes them unsuitable for many real world applications.
● Challenges are multifaceted:
○ Models○ Data
○ Evaluation
![Page 7: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/7.jpg)
Motivation - Hallucination
● Neural generation models often state phrases that are unsupported or contradictory to the source data.
ROTOWIRE [Wiseman et al. 2017]
![Page 8: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/8.jpg)
Motivation - Hallucination
Frank Lino
FBI surveillance photo
Birth date October 30, 1938
Birth place Gravesend, Brooklyn, New York,
United States
Source (Wikipedia infobox):
Neural baseline: Frank Lino (born October 30, 1938 in Brooklyn, New York, United States) is an American criminal defense attorney.
Target:
https://en.wikipedia.org/wiki/Frank_Lino
WIKIBIO [Lebret et al. 2016]
![Page 9: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/9.jpg)
Causes of Hallucination (Data)
P 9
● In some datasets, the target contains information that cannot be inferred by the source due to heuristic data collection.
● This makes it unclear if hallucination is caused by modeling weaknesses or data noise.
Frank Lino
FBI surveillance photo
Birth date October 30, 1938
Birth place Gravesend, Brooklyn, New York,
United States
Reference: Frank ``Curly '' Lino (born October 30, 1938 Brooklyn) is a Sicilian-American Caporegime in the Bonanno crime family who later became an informant.
WIKIBIO [Lebret et al. 2016]
![Page 10: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/10.jpg)
Causes of Hallucination (Evaluation)
BLEU prefers coverage/fluency over precision.
Michael Dahlquist
Born December 22, 1965
Seattle, Washington
Died July 14, 2005 (aged 39)
Skokie, Illinois
Genres Male
Occupation Drummer
Instruments Drums
Reference: Michael Dahlquist (December 22 , 1965 -- July 14 , 2005) was a drummer in the Seattle band Silkworm.
Candidate 1: Michael Dahlquist (December 22 , 1965 -- July 14 , 2005) was a drummer in the California band Grateful dead.
Candidate 2: Michael Dahlquist (December 22 , 1965 -- July 14 , 2005) was a drummer.
Candidate 3: Michael Dahlquist (December 22 , 1965 -- July 14 , 2005) was a drummer from Seattle Washington.
https://en.wikipedia.org/wiki/Michael_Dahlquist
BLEU
0.79
0.71
0.73
![Page 11: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/11.jpg)
Causes of Hallucination (Models)
Loss functions like maximum likelihood are often not suitable for stopping hallucination.
![Page 12: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/12.jpg)
Our work
Data
Evaluation
Models
Machine Translation
● ToTTo: A Controlled Table-to-Text Generation Dataset
● Handling Divergent Reference Texts when Evaluating Table-to-Text Generation● BLEURT: Learning Robust Metrics for Text Generation● Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020
Shared Task
● Text Generation with Exemplar-based Adaptive Decoding● Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
● Consistency by Agreement in Zero-shot Neural Machine Translation● A Multilingual View of Unsupervised Machine Translation
![Page 13: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/13.jpg)
Focus of this Talk
Data
Evaluation
Models
Machine Translation
● ToTTo: A Controlled Table-to-Text Generation Dataset
● Handling Divergent Reference Texts when Evaluating Table-to-Text Generation● BLEURT: Learning Robust Metrics for Text Generation● Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020
Shared Task
● Text Generation with Exemplar-based Adaptive Decoding● Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
● Consistency by Agreement in Zero-shot Neural Machine Translation● A Multilingual View of Unsupervised Machine Translation
![Page 14: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/14.jpg)
Focus of this Talk
Data
Evaluation
Models
Machine Translation
● ToTTo: A Controlled Table-to-Text Generation Dataset
● Handling Divergent Reference Texts when Evaluating Table-to-Text Generation● BLEURT: Learning Robust Metrics for Text Generation● Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020
Shared Task
● Text Generation with Exemplar-based Adaptive Decoding● Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
● Consistency by Agreement in Zero-shot Neural Machine Translation● A Multilingual View of Unsupervised Machine Translation
![Page 15: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/15.jpg)
ToTTo: A Controlled Table-To-Text Generation Dataset
Sebastian Gehrmann
Xuezhi Wang Manaal Faruqui Bhuwan Dhingra Diyi Yang Dipanjan Das
TL;DR: Novel data-to-text dataset with clean references over Wikipedia. Shows that cleaning data does not stop hallucination.
EMNLP 2020
https://github.com/google-research-datasets/ToTTo
![Page 16: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/16.jpg)
1
Overview - Controlled Generation Task
Source: Table, metadata, set of highlighted cells
Table Title: Robert Craig (American football) Section Title: National Football League statistics Table Description:None
Target: One sentence description
120K training examples
Craig finished his eleven NFL seasons with 8,189 rushing yards and 566 receptions for 4,911 receiving yards.
![Page 17: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/17.jpg)
2
Overview - Controlled Generation Task
Source: Table, metadata, set of highlighted cells
Table Title: Robert Craig (American football) Section Title: National Football League statistics Table Description:None
Target: One sentence description
120K training examples
Craig finished his eleven NFL seasons with 8,189 rushing yards and 566 receptions for 4,911 receiving yards.
![Page 18: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/18.jpg)
3
Overview - Controlled Generation Task
Source: Table, metadata, set of highlighted cells
Table Title: Robert Craig (American football) Section Title: National Football League statistics Table Description:None
Target: One sentence description
120K training examples
Craig finished his eleven NFL seasons with 8,189 rushing yards and 566 receptions for 4,911 receiving yards.
![Page 19: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/19.jpg)
4
Overview - Controlled Generation Task
Source: Table, metadata, set of highlighted cells
Table Title: Robert Craig (American football) Section Title: National Football League statistics Table Description:None
Target: One sentence description
120K training examples
Craig finished his eleven NFL seasons with 8,189 rushing yards and 566 receptions for 4,911 receiving yards.
![Page 20: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/20.jpg)
5
Overview - Controlled Generation Task
Source: Table, metadata, set of highlighted cells
Table Title: Robert Craig (American football) Section Title: National Football League statistics Table Description:None
Target: One sentence description
120K training examples
Craig finished his eleven NFL seasons with 8,189 rushing yards and 566 receptions for 4,911 receiving yards.
![Page 21: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/21.jpg)
Causes of Hallucination (Data)
P 21
● In some datasets, the target contains information that cannot be inferred by the source due to heuristic data collection.
● This makes it unclear if hallucination is caused by modeling weaknesses or data noise.
Frank Lino
FBI surveillance photo
Birth date October 30, 1938
Birth place Gravesend, Brooklyn, New York,
United States
Reference: Frank ``Curly '' Lino (born October 30, 1938 Brooklyn) is a Sicilian-American Caporegime in the Bonanno crime family who later became an informant.
WIKIBIO [Lebret et al. 2016]
![Page 22: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/22.jpg)
Motivation - Data Noise
● In some datasets, the target contains information that cannot be inferred by the source due to heuristic data collection.
● This makes it unclear if hallucination is caused by modeling weaknesses or data noise.
● Asking annotators to write sentences from scratch typically results in vanilla targets that lack variety [Gururangan et al. 2018, Poliak et al. 2018]
Frank Lino
FBI surveillance photo
Birth date October 30, 1938
Birth place Gravesend, Brooklyn, New York,
United States
Reference: Frank ``Curly '' Lino (born October 30, 1938 Brooklyn) is a Sicilian-American Caporegime in the Bonanno crime family who later became an informant.
WIKIBIO [Lebret et al. 2016]
![Page 23: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/23.jpg)
Motivation - Task Definition
● Many tasks are defined as summarization which can be difficult to evaluate.
● On the other hand, tasks that are limited to verbalizing a fully specified meaning representation may not sufficiently challenge modern neural networks [Gardent et al. 2017]
ROTOWIRE [Wiseman et al. 2017]
![Page 24: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/24.jpg)
Our Dataset
ToTTo is novel in two ways:
● Task Design: “Controlled generation”: Set of highlighted cells gives guidance as to what to generate.
● Annotation process: Annotators iteratively revise natural sentences on Wikipedia so they are faithful to the table.
● ToTTo statistics:○ 120K training examples○ 7500 dev examples○ 7500 test examples
![Page 25: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/25.jpg)
(Heuristic) Data Collection
Get tables from Wikipedia.
Tara Summers
1956 United States presidential election in Louisiana
![Page 26: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/26.jpg)
(Heuristic) Data Collection
Use heuristics such as word overlap or hyperlinks to find sentences that may be related to the table.
After winning the German under-23 100 m title, she was selected to run at the 1995 World Championships in Athletics both individually and in the relay.
Gabriele Becker
![Page 27: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/27.jpg)
(Heuristic) Data Collection
After winning the German under-23 100 m title, she was selected to run at the 1995 World Championships in Athletics both individually and in the relay.
Gabriele Becker
Heuristic sentence is too noisy to be a generation target.
![Page 28: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/28.jpg)
Annotation ProcessAnnotators do the following:
● Highlight cells that support sentence
● Iteratively revise the sentence so that it is faithful to the table and standalone
Table Title: Gabriele BeckerSection Title: International competitions
After winning the German under-23 100 m title, she was selected to run at the 1995 World Championships in Athletics both individually and in the relay.
After winning the German under-23 100 m title, she was selected to run at the 1995 World Championships in Athletics both individually and in the relay.
Gabriele Becker competed at the 1995 World Championships in both individually and in the relay.
Gabriele Becker competed at the 1995 World Championships in both individually and in the relay.
Original Sentence After Deletion After Decontextualization
After Grammar
![Page 29: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/29.jpg)
Dataset Statistics
P 29
Property Value
Training set size 120,761
Number of target tokens 1,268,268
Avg Target Length (tokens) 17.4
Target Vocabulary size 136,777
Unique tables 83,141
Rows per table (median) 16
Cells per table (median) 87
No. of highlighted cells (median)
3
![Page 30: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/30.jpg)
Annotator Agreement
P 30
Annotation Stage BLEU-4
After Deletion 82.9
After Decontextualization 72.56
Final (After Grammar) 68.98
Our iterative annotation process allows us to measure annotator agreement at each stage.
![Page 31: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/31.jpg)
Linguistic Phenomena
P 31
Annotation Stage Percentage
Require reference to page title 82%
Require reference to section title 19%
Require reference to table description 3%
Reasoning (logical, numerical, temporal etc.) 21%
Comparison across rows / columns / cells 13%
Require background information 12%
![Page 32: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/32.jpg)
Baselines
P 32
● BERT-to-BERT [Rothe et al. 2019] - BERT initialized encoder-decoder model
● Pointer Generator [See et al. 2017] - Seq2Seq with Copy mechanism
● [Puduppully et al. 2019] - Content planning mechanism for data-to-text
![Page 33: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/33.jpg)
Baseline Results
● BLEU [Papineni et al. 2002] - n-gram metric● PARENT [Dhingra et al. 2019] - n-gram metric for data-to-text● BLEURT [Sellam et al. 2020] - learnt evaluation metric
P 33
![Page 34: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/34.jpg)
Baseline Results - Human Evaluation
P 34
Compared to the human oracle, baseline is
● Reasonably fluent● However, considerably less
faithful.
99.393.6
88.1
76.2
Evidence that neural models hallucinate even in the presence of clean data.
![Page 35: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/35.jpg)
Model Shortcomings - Rare topics
P 35
Table Title: MicrodriveSection Title: Microdrive models by timeline Table Description: Date of release of large sizes
Prediction: there were 512 microdrive models in 2000: 1 gigabyte.
Reference: a second generation of microdrive was announced by ibm in 2000 with increased capacities at 512 mb and 1 gb.
![Page 36: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/36.jpg)
28
Model Shortcomings - Reasoning
P 36
Table Title: Travis KelceSection Title: Collegiate statisticsTable Description: None
Prediction: kelce finished the 2012 season with 45 receptions for 722 yards (16.0 average) and eight touchdowns.
Reference: in travis kelce’s last collegiate season, he set personal career highs in receptions (45), receiving yards (722), yards per
receptions (16.0) and receiving touchdowns (8).
![Page 37: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/37.jpg)
29
Summary
Try out our dataset!
Feel free to contact:
P 37
https://github.com/google-research-datasets/ToTTo
![Page 38: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/38.jpg)
Focus of this Talk
Data
Evaluation
Models
Machine Translation
● ToTTo: A Controlled Table-to-Text Generation Dataset
● Handling Divergent Reference Texts when Evaluating Table-to-Text Generation● BLEURT: Learning Robust Metrics for Text Generation● Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020
Shared Task
● Text Generation with Exemplar-based Adaptive Decoding● Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
● Consistency by Agreement in Zero-shot Neural Machine Translation● A Multilingual View of Unsupervised Machine Translation
![Page 39: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/39.jpg)
BLEURT: Learning Robust Metrics for Text Generation
Dipanjan Das
TL;DR: State of the art learnt metric with fast domain/task adaptation.
ACL 2020
https://github.com/google-research/bleurt
Thibault Sellam
![Page 40: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/40.jpg)
Metrics are a Bottleneck to Generation Progress
Prediction Reference Comments
pete kmetovic from stanford university took third place at stanford university.
the 1956 grand prix motorcycle racing season consisted of eight grand prix races in six classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.
kelce finished the 2012 season with 45 receptions for 722 yards (16.0 average) and eight touchdowns.
the eagles first pick, and third overall, was pete kmetovic, a halfback from stanford university.
the 1956 grand prix motorcycle racing season consisted of six grand prix races in five classes: 500cc, 350cc, 250cc, 125cc and sidecars 500cc.
in travis kelce’s last collegiate season, he set personal career highs in receptions (45), receiving yards (722), yards per receptions (16.0) and receiving touchdowns (8).
incorrect and word overlap low
incorrect but word overlap high
Correct, but reference is more informative
![Page 41: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/41.jpg)
Learnt Metrics
● These types of task-oriented complex relationships can only be captured by learnt metrics.
● Naive learnt metric: Fine-Tune BERT on human ratings data.
● Problem: Brittle, requires lots of fine-tuning data for every new dataset/task.
BERT [Devlin et al. 2018]
+
Human RatingsData
0.10.7
0.4...
= Learnt Metric
![Page 42: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/42.jpg)
BLEURT● Additional pertaining step based on synthetic data.
● Makes model robust to train/test skew and enables fast adaptation to other domains.
● State of the art results on WMT 2017, 2018, 2019 and WebNLG
BERT [Devlin et al. 2018]
+
Synthetic Data
0.2
0.9
0.1... = BLEURT+
Human RatingsData
0.10.7
0.4...
![Page 43: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/43.jpg)
Generating Synthetic Data
● Goal is to generate pairs (x, x’) that resemble reference, prediction pairs.
● Model errors typically do not resemble existing paraphrasing datasets.● Generate synthetic data instead.
Reference Predictions
I bought soy milk from the store on Tuesday.
I bought soy milk from the store on Tuesday.
I bought soy milk from the store on Wednesday.
I bought soy milk from the grocery store on Tuesday.
I bought soy milk from the grocery store on Tuesday on Tuesday on Tuesday.
![Page 44: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/44.jpg)
Generating Synthetic Data
● Strategy 1: Randomly mask text and employ BERT to fill masks
● Strategy 2: Use back-translation.
I bought soy milk from the store on Tuesday.
Mardi, j'ai acheté du lait de soja au magasin.
Tuesday I bought soy milkat the store.
I bought [MASK] milk from [MASK] store on Tuesday.
I [MASK] [MASK] [MASK] from the store on Tuesday.
I bought almond milk from a store on Tuesday.
I stole some oranges from the store on Tuesday.
![Page 45: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/45.jpg)
Weak Supervision Signals
● Use a variety of easily computed metrics for a multi-task objective
[ [BLEU(x, x’)ROUGE(x, x’)
BERT-SCORE(x, x’)Entail(x, x’)
.
.
.BackTransProb(x, x’)
![Page 46: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/46.jpg)
BLEURT Results - WMT2017
47.5
57.9 57.1
72.476.8
79.2 81.8
![Page 47: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/47.jpg)
Robustness
P 47
● Skew training set toward lower ratings, Skew test toward higher ratings● Synthetic pretraining makes BLEURT more robust.
![Page 48: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/48.jpg)
WebNLG Results
![Page 49: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/49.jpg)
Focus of this Talk
Data
Evaluation
Models
Machine Translation
● ToTTo: A Controlled Table-to-Text Generation Dataset
● Handling Divergent Reference Texts when Evaluating Table-to-Text Generation● BLEURT: Learning Robust Metrics for Text Generation● Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020
Shared Task
● Text Generation with Exemplar-based Adaptive Decoding● Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
● Consistency by Agreement in Zero-shot Neural Machine Translation● A Multilingual View of Unsupervised Machine Translation
![Page 50: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/50.jpg)
Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
TL;DR: Learnt per-token confidence score that reduces hallucination by 50% on Wikibio dataset
arXiv
Ran Tian Shashi Narayan Thibault Sellam
![Page 51: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/51.jpg)
Reducing Hallucination
Frank Lino
FBI surveillance photo
Birth date October 30, 1938
Birth place Gravesend, Brooklyn, New York,
United States
Source (Wikipedia infobox):
Reference: Frank ``Curly '' Lino (born October 30, 1938 Brooklyn) is a Sicilian-American Caporegime in the Bonanno crime family who later became an informant.
Baseline (See et al. 2017): Frank Lino (born October 30, 1938 in Brooklyn, New York, United States) is an American criminal defense attorney.
Our model (Tian et al. 2019): Frank Lino (born October 30, 1938 in Brooklyn, New York, United States) is an American.
Target:
https://en.wikipedia.org/wiki/Frank_Lino
![Page 52: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/52.jpg)
Intuition
P 52
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
![Page 53: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/53.jpg)
Intuition
P 53
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
Templatic words do not convey source information and can be generated by an unconditioned language model
![Page 54: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/54.jpg)
Intuition
P 54
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
Faithful content words require attention to the source.
![Page 55: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/55.jpg)
Intuition
P 55
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
Hallucinations are typically content words that are not closely associated with the source.
![Page 56: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/56.jpg)
Confidence Score
P 56
Confidence score should high if the token is:● a templatic word (blue)
or● Is supported by the source (green)
Confidence score should be low if:● not a templatic word and not supported by the source.
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
![Page 57: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/57.jpg)
Confidence Score - Determining if Word is Templatic
P 57
Word is considered templatic if it has high probability under a base language model that is not conditioned on the source:
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
![Page 58: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/58.jpg)
Confidence Score - Determining if Word is Supported by the Source
P 58
Word is supported by the source if it has a high attention score:
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
where:
attention vector LSTM state
![Page 59: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/59.jpg)
Confidence Score - Determining if Word is Supported by the Source
P 59
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
Confidence score should high if the token is:● a templatic word (blue)
or● Is supported by the source (green)
In math:
![Page 60: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/60.jpg)
Using and Learning the Confidence Score
Confidence score depends on learned parameters. ● How do we train it jointly with the rest of the model?● How do we use it at inference time?
![Page 61: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/61.jpg)
Learning Confidence Score in Training
If we knew the confidence score before training the other model parameters then we could use it to clean noisy references to remove unsupported phrases.
However, confidence score is being jointly learned with the rest of the parameters…….
Michael Eric Dyson (born October 23, 1958) is an American academic, author and radio host.
![Page 62: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/62.jpg)
Learning Confidence Score in Training
Use a variational bayes objective to subsample confidence subsequences in training while learning confidence score.
![Page 63: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/63.jpg)
Confidence Score at Test Time
Based on [Braverman et al. 2018]. Add a one parameter model to the likelihood to re-calibriate output probabilities with the confidence score.
Learnable parameter
stop-gradient
![Page 64: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/64.jpg)
Experimental Results
● Datasets○ Wikibio [Lebret et al. 2016]○ WebNLG [Gardent et al. 2017]
● Models (with and without confident decoding)○ Pointer Generator○ BERT encoder + LSTM decoder
● Evaluation metrics● BLEU● PARENT [Dhingra et al. 2019]● Human evaluation
P 64
![Page 65: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/65.jpg)
Results - Wikibio Dataset
● Confidence decoding significantly reduces hallucination. BLEU doesn’t capture faithfulness
75.0%
66.1%
80.3%
86.8%
79.4%45.6 45.4
41.142.5
38.175.6% 41.7
![Page 66: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/66.jpg)
Results - Wikibio Dataset
● Confidence decoding significantly reduces hallucination. BLEU doesn’t capture faithfulness
75.0%
66.1%
80.3%
86.8%
79.4%45.6 45.4
41.142.5
38.175.6% 41.7
![Page 67: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/67.jpg)
Results - Wikibio Dataset
● Confidence decoding significantly reduces hallucination. BLEU doesn’t capture faithfulness
75.0%
66.1%
80.3%
86.8%
79.4%45.6 45.4
41.142.5
38.175.6% 41.7
![Page 68: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/68.jpg)
Results - Wikibio Dataset
● Fluency is similar across methods● Confidence decoding leads to minor coverage drop (can be increased using length
penalty but at a cost to fluency)
97.6%
89.0%93.6% 95.2% 97.0%94.6% 40.6% 40.6%
39.4%38.4%
37.8%38.7%
![Page 69: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/69.jpg)
Focus of this Talk
Data
Evaluation
Models
Machine Translation
● ToTTo: A Controlled Table-to-Text Generation Dataset
● Handling Divergent Reference Texts when Evaluating Table-to-Text Generation● BLEURT: Learning Robust Metrics for Text Generation● Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020
Shared Task
● Text Generation with Exemplar-based Adaptive Decoding● Sticking to the Facts: Confident Decoding for Faithful Data-to-Text Generation
● Consistency by Agreement in Zero-shot Neural Machine Translation● A Multilingual View of Unsupervised Machine Translation
![Page 70: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/70.jpg)
Conclusion
● In this talk, we presented a multi-faceted approach to tackling hallucination in text generation models from the perspectives of data, evaluation, and models.
● Some links:○ ToTTo Dataset: https://github.com/google-research-datasets/ToTTo○ BLEURT: https://github.com/google-research/bleurt
![Page 71: Text Generation Towards High Precision](https://reader031.vdocuments.us/reader031/viewer/2022012413/616cfb629db1770554381ae8/html5/thumbnails/71.jpg)
Thank you!