adversarial contrastive estimation - joey bose · the generator is not learning from the nce...

29
Adversarial Contrastive Estimation Adversarial Contrastive Estimation ACL 2018 *AVISHEK (JOEY) BOSE, *HUAN LING, *YANSHUAI CAO Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 1 / 29

Upload: others

Post on 05-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Adversarial Contrastive Estimation ACL 2018*AVISHEK (JOEY) BOSE, *HUAN LING, *YANSHUAI CAO

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 1 / 29

Page 2: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Contrastive Estimation

Many Machine Learning models learn by trying to separate positiveexamples from negative examples.

Positive Examples are taken from observed real data distribution(training set)

Negative Examples are any other configurations that are not observed

Data is in the form of tuples or triplets (x+, y+) and (x+, y−) arepositive and negative data points respectively.

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 2 / 29

Page 3: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Easy Negative Examples with NCE

Noise Constrastive Estimation samples negatives by taking p(y−|x+) to besome unconditional pnce(y). What’s wrong with this?

Negative y− in (x , y−) is not tailored toward x

Difficult to choose hard negatives as training progresses

Model doesn’t learn discriminating features between positive and hardnegative examples

NCE negatives are easy !!!

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 3 / 29

Page 4: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Hard Negative Examples

Informal Definition: Hard negative examples are data points that areextremely difficult for the training model to distinguish from positiveexamples.

Hard Negatives result to higher losses and thus more moreinformative gradients

Not necessarily closest to a positive datapoint in embedding space

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 4 / 29

Page 5: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Technical Contributions

Adversarial Contrastive Estimation: A general technique for hardnegative mining using a Conditional GAN like setup.

A novel entropy regularizer that prevents generator mode collapse andhas good empirical benefits

A strategy for handling false negative examples that allows training toprogress

Empirical validation across 3 different embedding tasks with state ofthe art results on some metrics

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 5 / 29

Page 6: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Adversarial Contrastive Estimation

Problem:We want to generate negatives that ... “fool” a discriminative modelinto misclassifying.

Solution:Use a Conditional GAN to sample hard negatives given x+. We canaugment NCE with an adversarial sampler, λpnce(y) + (1− λ)gθ(y |x).

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 6 / 29

Page 7: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Conditional GAN

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 7 / 29

Page 8: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Adversarial Contrastive Estimation

,Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 8 / 29

Page 9: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

The ACE Generator

The ACE generator defines a categorical distribution over all possibley− values

Picking a negative example is a discrete choice and not differentiable

Simplest way to train via Policy Gradients is the REINFORCEgradient estimator

Learning is done via a GAN style min-max game

minω

maxθ

V (ω, θ) = minω

maxθ

Ep+(x) L(ω, θ; x) (1)

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 9 / 29

Page 10: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Technical Contributions for effective training

Problem:GAN training can suffer from mode collapse? What happens if thegenerator collapses on its favorite few negative examples?

Solution:Add a entropy regularizer term to the generators loss:

Rent(x) = max(0, c − H(gθ(y |x))) (2)

H(gθ(y |x)) is the entropy of the categorical distribution

c = log(k) is the entropy of a uniform distribution over k choices

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 10 / 29

Page 11: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Technical Contributions for effective training

Problem:The Generator can sample false negatives → gradient cancellation

Solution:Apply an additional two-step technique, whenever computationallyfeasible.

1 Maintain an in memory hash map of the training data andDiscriminator filters out false negatives

2 Generator receives a penalty for producing the false negative

3 Entropy Regularizer spreads out the probability mass

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 11 / 29

Page 12: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Technical Contributions for effective training

Problem:The Generator can sample false negatives → gradient cancellation

Solution:Apply an additional two-step technique, whenever computationallyfeasible.

1 Maintain an in memory hash map of the training data andDiscriminator filters out false negatives

2 Generator receives a penalty for producing the false negative

3 Entropy Regularizer spreads out the probability mass

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 12 / 29

Page 13: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Technical Contributions for effective training

Problem:The Generator can sample false negatives → gradient cancellation

Solution:Apply an additional two-step technique, whenever computationallyfeasible.

1 Maintain an in memory hash map of the training data andDiscriminator filters out false negatives

2 Generator receives a penalty for producing the false negative

3 Entropy Regularizer spreads out the probability mass

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 13 / 29

Page 14: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Technical Contributions for effective training

Problem:REINFORCE is known to have extremely high variance.

Solution:Reduce Variance using the self-critical baseline. Other baselines andgradient estimators are also good options.

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 14 / 29

Page 15: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Technical Contributions for effective training

Problem:The generator is not learning from the NCE samples.

Solution:Use Importance Sampling. Generator can leverage NCE samples forexploration in an off-policy scheme. The modified reward now looks like

gθ(y−|x)/pnce(y−)

.

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 15 / 29

Page 16: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Related Work

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 16 / 29

Page 17: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Contemporary Work

GANs for NLP that are close to our work

MaskGAN Fedus et. al 2018

Incorporating GAN for Negative Sampling in KnowledgeRepresentation Learning Wang et. al 2018

KBGAN Cai and Wang 2017

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 17 / 29

Page 18: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Example: Knowledge Graph Embeddings

Data in the form of triplets (head entity, relation, tail entity). For example{United states of America, partially contained by ocean, Pacific}

Basic Idea: The embeddings for h, r , t should roughly satisfy h + r ≈ t

Link Prediction:Goal is to learn from observed positive entity relations and predict missinglinks.

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 18 / 29

Page 19: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

ACE for Knowledge Graph Embeddings

Positive Triplet: ξ+ = (h+, r+, t+)

Negative Triplet: Either negative head or tail is sampled i.e.ξ− = (h−, r+, t+) or ξ− = (h+, r+, t−)

Loss Function:L = max(0, η + sω(ξ+)− sω(ξ−)) (3)

ACE Generator: gθ(t−|r+, h+) or gθ(h−|r+, t+) parametrized by a feedforward neural net.

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 19 / 29

Page 20: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

ACE for Knowledge Graph Embeddings

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 20 / 29

Page 21: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Experimental Result: Ablation Study

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 21 / 29

Page 22: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

ACE for Order Embeddings

Hypernym Prediction: A hypernym pair is a pair of concepts where thefirst concept is a specialization or an instance of the second.

Learning embeddings that are hierarchy preserving. The Root Node isat the origin and all other embeddings lie on the positive semi-space

Constraint enforces the magnitude of the parent’s embedding to besmaller than child’s in every dimension

Sibling nodes are not subjected to this constraint.

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 22 / 29

Page 23: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Order Embeddings (Vendrov et. al 2016)

11Image taken from Vedrov et al. 2015

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 23 / 29

Page 24: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

ACE for Order Embeddings

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 24 / 29

Page 25: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

ACE for Word Embeddings: WordSim353

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 25 / 29

Page 26: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

ACE for Word Embeddings: Stanford Rare Word

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 26 / 29

Page 27: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Discriminator Loss on NCE vs. Adversarial Examples

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 27 / 29

Page 28: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Nearest Neighbors for NCE vs. ACE

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 28 / 29

Page 29: Adversarial Contrastive Estimation - Joey Bose · The generator is not learning from the NCE samples. Solution: Use Importance Sampling. Generator can leverage NCE samples for exploration

Adversarial Contrastive Estimation

Questions?

BlogPost:http://borealisai.com/2018/07/13/

adversarial-contrastive-estimation-harder-better-faster-stronger/

Avishek (Joey) Bose*, Huan Ling*, Yanshuai Cao* | Borealis AI, University of Toronto | August 2, 2018 29 / 29