inference by randomly perturbing max-solvers · outline •random perturbation - why and how?...

105
Inference by randomly perturbing max-solvers Tamir Hazan University of Haifa

Upload: others

Post on 15-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Inference by randomly perturbing max-solvers

Tamir Hazan University of Haifa

!!!!

Page 2: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Inference in machine learning

• 20 years ago: does an image contain a person?

Page 3: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Inference in machine learning

• 10 years ago: which object is in the image?

Page 4: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Inference in machine learning

• Today’s challenge: exponentially many options

• For each pixel: decide if it is foreground or background.

• The space of possible structures is exponential

Page 5: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Inference in machine learning

• Interactive annotation:

Page 6: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

Page 7: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max

Page 8: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max

• Application: interactive annotation. - New entropy bounds for perturb-max models.

Page 9: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• machine learning applications are characterized by: - complex structures y = (y1, ..., yn)

Inference in machine learning

Page 10: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• machine learning applications are characterized by: - complex structures y = (y1, ..., yn)

y 2 {0, 1}n

Inference in machine learning

Page 11: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• machine learning applications are characterized by: - complex structures y = (y1, ..., yn)

Inference in machine learning

Page 12: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• machine learning applications are characterized by: - complex structures - potential function that scores these structures

y = (y1, ..., yn)

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

Inference in machine learning

Page 13: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• machine learning applications are characterized by: - complex structures - potential function that scores these structures

y = (y1, ..., yn)

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

high score low score

Inference in machine learning

Page 14: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• machine learning applications are characterized by: - complex structures - potential function that scores these structures

y = (y1, ..., yn)

• For machine learning we need to efficiently infer from distributions over complex structures.

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

Inference in machine learning

Page 15: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Gibbs distribution

• MCMC samplers: - Gibbs sampling, Metropolis-Hastings, Swendsen-Wang

• Many efficient sampling algorithms for special cases: - Ising models (Jerrum 93) - Counting bi-partite matchings in planar graphs (Kasteleyn 61) - Approximating the permanent (Jerrum 04) - Many others…

p(y1, ..., yn) =1

Zexp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

Page 16: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)

Sampling likely structures

✓i(yi) = log p(yi|xi)

• RGB color of pixel ixi

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

✓i,j(yi, yj) =

(1 if yi = yj�1 otherwise

Page 17: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

Sampling likely structures

✓i(yi) = log p(yi|xi)

• RGB color of pixel ixi

✓i,j(yi, yj) =

(1 if yi = yj�1 otherwise

Page 18: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

Sampling likely structures

✓i(yi) = log p(yi|xi)

• RGB color of pixel ixi

✓i,j(yi, yj) =

(1 if yi = yj�1 otherwise

Page 19: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

Sampling likely structures

✓i(yi) = log p(yi|xi)

Page 20: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)

Sampling likely structures

✓i(yi) = log p(yi|xi)

• Recall: sampling from the Gibbs distribution is easy in Ising models (Jerrum 93)

✓i(yi) = 0

Page 21: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)

Sampling likely structures

✓i(yi) = log p(yi|xi)

• Recall: sampling from the Gibbs distribution is easy in Ising models (Jerrum 93)

✓i(yi) = 0

Page 22: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Sampling from the Gibbs distribution is provably hard in AI applications (Goldberg 05, Jerrum 93)

Sampling likely structures

✓i(yi) = log p(yi|xi)

• Recall: sampling from the Gibbs distribution is easy in Ising models (Jerrum 93)

✓i(yi) = 0

• Data terms (signals) that are important in AI applications significantly change the complexity of sampling

Page 23: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Instead of sampling, it may be significantly faster to find the most likely structure

Most likely structure

Page 24: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Instead of sampling, it may be significantly faster to find the most likely structure

Most likely structure

• The most likely structure

Page 25: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Most likely structure

y⇤ = arg max

y1,...,yn

X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)

• Maximum a-posterior (MAP) inference.

• Many efficient optimization algorithms for special cases: - Beliefs propagation: trees (Pearl 88), perfect graphs (Jebara 10), - Graph-cuts for image segmentation - branch and bound (Rother 09), branch and cut (Gurobi) - Linear programming relaxations (Schlesinger 76, Wainwright 05,

Kolmogorov 06, Werner 07, Sontag 08, Hazan 10, Batra 10, Nowozin 10, Pletscher 12, Kappes 13, Savchynskyy13, Tarlow 13, Kohli 13, Jancsary 13, Schwing 13)

- CKY for parsing - Many others…

Page 26: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The challenge

Sampling from the likely high dimensional structures (with millions of variables, e.g., image segmentation with 12 million pixels) as efficient as optimizing

Page 27: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others

structures

scores

y⇤

Most likely structure

Page 28: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others

structures

scores

y⇤

Most likely structure

✓(y) =X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)

y = (y1, ..., yn)

Page 29: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others

structures

scores

y⇤

Most likely structure

Page 30: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Selecting the maximizing structure is appropriate when one structure (e.g., segmentation / parse) dominates others

structures

scores

y⇤

Most likely structure

Page 31: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• The maximizing structure is not robust in case of multiple high scoring alternatives

Most likely structure

Page 32: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• The maximizing structure is not robust in case of multiple high scoring alternatives

structures

scores

y⇤

Most likely structure

Page 33: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• The maximizing structure is not robust in case of multiple high scoring alternatives

structures

scores

y⇤

Probabilistic Vs. Rule-Based

• Rule based grammars do not generalize well across domains and languages:

Probabilistic Vs. Rule-Based

• Rule based grammars do not generalize well across domains and languages:

Most likely structure

Page 34: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• The maximizing structure is not robust in case of multiple high scoring alternatives

Most likely structure

Page 35: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• The maximizing structure is not robust in case of multiple high scoring alternatives

structures

scores

y⇤

Most likely structure

Page 36: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Randomly perturbing the system reveals its complexity

structures

scores

- little effect when the maximizing structure is “evident”

y⇤

Random perturbations

Page 37: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident”

structures

scores

y⇤

Random perturbations

Page 38: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring

structures

structures

scores

structures

scores

y⇤

y⇤

Random perturbations

Page 39: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring

structures

structures

scores

structures

scores

y⇤

y⇤

Random perturbations

Page 40: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring

structures

structures

scores

structures

scores

y⇤

y⇤

Random perturbations

Page 41: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Randomly perturbing the system reveals its complexity - little effect when the maximizing structure is “evident” - substantial effect when there are alternative high scoring

structures

!

• Related work: - McFadden 74 (Discrete choice theory) - Talagrand 94, Barvinok 07 (Canonical processes)

structures

scores

structures

scores

y⇤

y⇤

Random perturbations

Page 42: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Random perturbations

structures

scores

y⇤

✓(y⇤)

✓(y)

y

• Notation:

scores (potential) ✓(y)

Page 43: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Random perturbations

structures

scores

y⇤

✓(y⇤)

✓(y)

y

• Notation:

scores (potential)

perturbed score

✓(y)

Page 44: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Random perturbations

structures

scores

y⇤

✓(y⇤)

✓(y)

y

• Notation:

�(y)

scores (potential)

perturbed score

perturbations �(y)

✓(y)

Page 45: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Random perturbations

structures

scores

y⇤

✓(y⇤)

✓(y)

y

• Notation:

�(y)

scores (potential)

perturbed score

perturbations �(y)

✓(y)

✓(y) + �(y)

Page 46: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Random perturbations

structures

scores

y⇤

✓(y⇤)

✓(y)

y

�(y)

• For every structure y, the perturbation value is a random variable (y is an index, traditional notation is ).

�(y)�y

• Perturb-max models: how stable is the maximal structure to random changes in the potential function.

Page 47: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max

• Application: interactive annotation. - New entropy bounds for perturb-max models.

Page 48: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

•Theorem Let be i.i.d. with Gumbel distribution with zero mean�(y)

F (t)def= P [�(y) t] = exp(� exp(�t))

Page 49: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

•Theorem Let be i.i.d. with Gumbel distribution with zero mean�(y)

F (t)def= P [�(y) t] = exp(� exp(�t))

f(t) = F 0(t) = exp(�t)F (t)

Page 50: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

•Theorem Let be i.i.d. with Gumbel distribution with zero mean�(y)

F (t)def= P [�(y) t] = exp(� exp(�t))

then the perturb-max model is the Gibbs distribution

1

Zexp(✓(y)) = P�⇠Gumbel[y = argmax

y{✓(y) + �(y)}]

Page 51: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

has Gumbel distribution whose mean is

Let be i.i.d Gumbel ( ). Then�(y)

logZ

max

y{✓(y) + �(y)}

P [�(y) t] = F (t)

Perturb-max models

• Why Gumbel distribution?

• Since maximum of Gumbel variables is a Gumbel variable.F (t) = exp(� exp(�t))

Z =

X

y

exp(✓(y))

Page 52: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

has Gumbel distribution whose mean is

Let be i.i.d Gumbel ( ). Then�(y)

logZ

max

y{✓(y) + �(y)}

P [�(y) t] = F (t)

Perturb-max models

• Why Gumbel distribution?

• Since maximum of Gumbel variables is a Gumbel variable.F (t) = exp(� exp(�t))

Page 53: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

• Why Gumbel distribution?

• Since maximum of Gumbel variables is a Gumbel variable.F (t) = exp(� exp(�t))

= exp(�X

y

exp(�(t� ✓(y)))) = F (t� logZ)

•Proof: P� [max

y{✓(y) + �(y)} t] =

Y

y

F (t� ✓(y))

has Gumbel distribution whose mean is

Let be i.i.d Gumbel ( ). Then�(y)

logZ

max

y{✓(y) + �(y)}

P [�(y) t] = F (t)

Page 54: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

• Max stability:

1

Zexp(✓(y)) = P�⇠Gumbel[y = argmax

y{✓(y) + �(y)}]

• Implications (taking gradients):

log

⇣X

y

exp(✓(y))

⌘= E�⇠Gumbel

hmax

y{✓(y) + �(y)}

i

Page 55: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations

Page 56: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations

P� [y = argmax

y{✓(y) + �(y)}]

Page 57: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations

P� [y = argmax

y{✓(y) + �(y)}]

y = (y1, ..., yn)

Page 58: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations

P� [y = argmax

y{✓(y) + �(y)}]

y = (y1, ..., yn)

Page 59: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max models

• Representing the Gibbs distribution using perturb-max models may require exponential number of perturbations

P� [y = argmax

y{✓(y) + �(y)}]

• Use low dimension perturbations

P� [y = argmax

y{✓(y) +

nX

i=1

�i(yi)}]

Page 60: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max

• Application: interactive annotation. - New entropy bounds for perturb-max models.

Page 61: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope✓(y1, ..., yn) =

X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

Page 62: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

y1 y2 y3

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

Page 63: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

y1 y2 y3

✓2(0)

✓2(1)

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

✓1(0)

✓1(1)

✓3(0)

✓3(0)

Page 64: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

y1 y2 y3

✓2(0)

✓2(1)

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

✓1,2(0, 0)✓1,2(0, 1)

✓1,2(1, 0)✓1,2(1, 1)

✓2,3(0, 0)✓2,3(0, 1)

✓2,3(1, 0)✓2,3(1, 1)✓1(0)

✓1(1)

✓3(0)

✓3(0)

Page 65: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

y1 y2 y3

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

M

µ

Page 66: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

y1 y2 y3

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

AM

µ

Page 67: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

y1 y2 y3

✓(y1, ..., yn) =X

i2V

✓i(yi) +X

i,j2E

✓i,j(yi, yj)

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

AM

µ9p(y1, y2, y3) s.t. µ1(y1) =

X

y2,y3

p(y1, y2, y3), ...

µ1,2(y1, y2) =X

y3

p(y1, y2, y3), ...

Page 68: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

Page 69: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

Page 70: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

Page 71: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

minimal

Page 72: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

minimal

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

Page 73: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

minimal

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

Page 74: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) / exp

⇣X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)⌘

minimal

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal

Page 75: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal

Page 76: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

A

Page 77: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal

• Proof idea:

µi(yi) =@E�

hmaxy

�Pi ✓i(yi) +

Pi,j ✓i,j(yi, yj) +

Pi �i(yi)

i

@✓i(yi)

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

A

Page 78: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The marginal polytope

M

p(y) = P�

hy = argmax

y

�X

i

✓i(yi) +X

i,j

✓i,j(yi, yj) +X

i

�i(yi) i

minimal

• Proof idea:

µi(yi) =@E�

hmaxy

�Pi ✓i(yi) +

Pi,j ✓i,j(yi, yj) +

Pi �i(yi)

i

@✓i(yi)

µ =

0

@µ1(0), µ1(1), µ2(0), µ2(1), µ3(0), µ3(1),µ1,2(0, 0), µ1,2(0, 1), µ1,2(1, 0), µ1,2(1, 1),µ2,3(0, 0), µ2,3(0, 1), µ2,3(1, 0), µ2,3(1, 1))

1

A

µi,j(yi, yj) =@E�

hmaxy

�Pi ✓i(yi) +

Pi,j ✓i,j(yi, yj) +

Pi �i(yi)

i

@✓i,j(yi, yj)

Page 79: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs with perturb-max

• Application: interactive annotation. - New entropy bounds for perturb-max models.

Page 80: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Non-MCMC sampling

• Perturb-max provably approximate the Gibbs marginals on trees [H, Maji, Jaakkola 13] and practically on general graphs.

• Perturb-max sample from tree-shaped Gibbs distribution [Gane, H, Jaakkola 14]

Page 81: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs distributions with perturb-max

• Application: interactive annotation. - New entropy bounds for perturb-max models.

Page 82: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Image annotation

• Image annotation is a time consuming (and tedious) task. Can computers do it for us?

Page 83: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Image annotation• Why not to use the most likely annotation instead?

Page 84: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Image annotation

• Most likely annotation is inaccurate around - “thin” areas

• Why not to use the most likely annotation instead?

Page 85: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Image annotation

• Most likely annotation is inaccurate around - “thin” areas - clutter

• Why not to use the most likely annotation instead?

Page 86: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Interactive image annotation

• Perturb-max models show the boundary of decision.

Page 87: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Interactive image annotation

• Perturb-max models show the boundary of decision.

Page 88: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Interactive image annotation

• Interactive annotation directs the human annotator to areas of uncertainty - significantly reduces annotation time [Maji, H., Jaakkola 14].

• Perturb-max models show the boundary of decision.

Page 89: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Uncertainty

• Entropy H(p✓) = �X

y

p✓(y) log p✓(y)

• Entropy = uncertainty - It is a nonnegative function over probability distributions. - It attains its maximal value for the uniform distribution. - It attains its minimal value for the zero-one distribution.

• Computing the entropy requires summing over exponential many configurations y = (y1, ..., yn)

• Can we bound it with perturb-max approach?

Page 90: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Uncertainty

• Perturb-max models

p✓(y)def= P� [y = argmax

y{✓(y) +

nX

i=1

�i(yi)}]

• Entropy

H(p✓) = �X

y

p✓(y) log p✓(y)

• Entropy bound H(p✓) E�

h nX

i=1

�i(y⇤i )i

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Page 91: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Perturb-max entropy bound:

H(p✓) EhX

i

�i(y⇤i )i=

X

i

Eh�i(y

⇤i )i

• Standard entropy independence bound:

H(p✓) X

i

H(p✓(yi))

p✓(yi) = P� [yi = argmax

y{✓(y) +

nX

i=1

�i(yi)}]

• Perturb-max entropy bound requires less samples since sampled average tail decreases exponentially [Orabona, H. Sarwate, Jaakkola 14].

Uncertainty

• How does it compare to standard entropy bounds?

Page 92: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Perturb-max entropy bounds• Spin glass, 10x10 grid

X

i

✓i(yi) +X

i,j

✓i,j(yi, yj)

10−1 100 101 1020

1

2

3

4

5

6

7

λ (E = Σi θi yi − Σi,j λθijyiyj)

Entro

py e

stim

ate

3 x 3 grid

EntropyMarginal entropyPerturb entropy

Page 93: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Interactive image annotation

initial boundary final boundary

refinedcurrent refinedcurrent

Page 94: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Outline

• Random perturbation - why and how? - Sampling likely structures as fast as finding the most likely one.

• Connections and Alternatives to Gibbs distribution: - the marginal polytope - non-MCMC sampling for Gibbs distributions with perturb-max

• Application: interactive annotation. - New entropy bounds for perturb-max models.

Page 95: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Open problems

• Perturb-max models: - When does fixing variables in the max-function amount to statistical

conditioning? - When do perturb-max models preserve the most likely assignment? - How do the perturbations dimension affect the model properties? - In what ways higher dimension perturbations reveal complex

structures in the model? - How to apply perturbations in restricted spaces, e.g., super-modular

potential functions? - How to encourage diverse sampling? - Perturb-max models stabilize the prediction. Do they connect

computational and statistical stability?

Page 96: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

Thank you

Page 97: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

Page 98: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Proof idea: conjugate duality

H(p) = min

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

Page 99: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Proof idea: conjugate duality

H(p) = min

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

Page 100: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

The flashback slide

• Max stability:

log

⇣X

y

exp(✓(y))

⌘= E�⇠Gumbel

hmax

y{✓(y) + �(y)}

i

Page 101: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Proof idea: conjugate duality

H(p) = min

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

n

E�

h

max

y

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

Page 102: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Proof idea: conjugate duality

H(p) = min

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

n

E�

h

max

y

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

p✓ p✓

Page 103: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Proof idea: conjugate duality

H(p) = min

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

n

E�

h

max

y

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

p✓ p✓✓⇤ = ✓ ✓⇤ = ✓

Page 104: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Proof idea: conjugate duality

H(p) = min

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

n

E�

h

max

y

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

p✓ p✓✓⇤ = ✓ ✓⇤ = ✓

H(p✓) E�

hmax

y

�✓(y) +

X

i

�i(yi) i

�X

y

✓(y)p✓(y)

Page 105: Inference by randomly perturbing max-solvers · Outline •Random perturbation - why and how? !-Sampling likely structures as fast as finding the most likely one.!•Connections

• Theorem:

y⇤ = argmax

y{✓(y) +

nX

i=1

�i(yi)}

Uncertainty*

H(p✓) EhX

i

�i(y⇤i )i

• Proof idea: conjugate duality

H(p) = min

n

logZ(

ˆ✓) �X

y

ˆ✓(y)p(y)

o

logZ(

ˆ✓) E�

hmax

y

�ˆ✓(y) +

X

i

�i(yi) i

H(p) min

n

E�

h

max

y

ˆ✓(y) +X

i

�i(yi)

i

�X

y

ˆ✓(y)p(y)o

p✓ p✓✓⇤ = ✓ ✓⇤ = ✓

H(p✓) E�

hmax

y

�✓(y) +

X

i

�i(yi) i

�X

y

✓(y)p✓(y)