recent advances in hilbert space representation of probability...

136
1/53 Recent Advances in Hilbert Space Representation of Probability Distributions Krikamol Muandet Max Planck Institute for Intelligent Systems ubingen, Germany RegML 2020, Genova, Italy July 1, 2020

Upload: others

Post on 28-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

1/53

Recent Advances in Hilbert Space Representationof Probability Distributions

Krikamol MuandetMax Planck Institute for Intelligent Systems

Tubingen, Germany

RegML 2020, Genova, ItalyJuly 1, 2020

Page 2: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

2/53

Reference

Kernel Mean Embedding of Distributions: A Review and BeyondKM, K. Fukumizu, B. Sriperumbudur, and B. Scholkopf. FnT ML, 2017.

Page 3: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

3/53

Kernel Methods

From Points to Probability Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Recent Development

Page 4: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

4/53

Kernel Methods

From Points to Probability Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Recent Development

Page 5: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

5/53

Classification Problem

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

Page 6: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

6/53

Feature Map

φ : (x1, x2) 7−→ (x21 , x

22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(z)〉R3 = x21 z

21 + x2

2 z22 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2

Question: How to generalize the idea of implicit feature map?

Page 7: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

6/53

Feature Map

φ : (x1, x2) 7−→ (x21 , x

22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(z)〉R3 = x21 z

21 + x2

2 z22 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2

Question: How to generalize the idea of implicit feature map?

Page 8: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

6/53

Feature Map

φ : (x1, x2) 7−→ (x21 , x

22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(z)〉R3 = x21 z

21 + x2

2 z22 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2

Question: How to generalize the idea of implicit feature map?

Page 9: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

6/53

Feature Map

φ : (x1, x2) 7−→ (x21 , x

22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(z)〉R3 = x21 z

21 + x2

2 z22 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2

Question: How to generalize the idea of implicit feature map?

Page 10: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

6/53

Feature Map

φ : (x1, x2) 7−→ (x21 , x

22 ,√

2x1x2)

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5x1

−1.0

−0.5

0.0

0.5

1.0

x 2

Data in Input Space+1-1

ϕ1 0.00.20.40.60.81.0ϕ2

0.20.30.40.50.60.70.80.9

ϕ 3

−0.8−0.6−0.4−0.20.00.20.40.60.8

Data in Feature Space +1-1

〈φ(x), φ(z)〉R3 = x21 z

21 + x2

2 z22 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2

Question: How to generalize the idea of implicit feature map?

Page 11: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

7/53

https://xkcd.com/655/

Recipe for ML Problems

1. Collect a data set D = x1, x2, . . . , xn.2. Specify or learn a feature map φ : X → H.

3. Apply the feature map Dφ = φ(x1), φ(x2), . . . , φ(xn).4. Solve the (easier) problem in the feature space H using Dφ.

Page 12: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

7/53

https://xkcd.com/655/

Recipe for ML Problems

1. Collect a data set D = x1, x2, . . . , xn.2. Specify or learn a feature map φ : X → H.

3. Apply the feature map Dφ = φ(x1), φ(x2), . . . , φ(xn).4. Solve the (easier) problem in the feature space H using Dφ.

Page 13: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

8/53

Representation Learning

Perceptron1: f (x) = w>x + b

Explicit Representation

...Perceptron

f (x) = w>2 σ(w>1 x + b1) + b2

Implicit Representation

x k(x, ·)

f (x) = w>φ(x) + b

1Rosenblatt 1958; Minsky and Papert 1969

Page 14: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

9/53

Kernels

A function k : X ×X → R is called a kernel on X if there exists a Hilbertspace H and a map φ : X → H such that for all x, x′ ∈ X we have

k(x, x′) = 〈φ(x), φ(x′)〉H

We call φ a feature map and H a feature space associated with k.

Example

1. k(x, x′) = (x · x′)2 for x, x′ ∈ R2

I φ(x) = (x21 , x

22 ,√

2x1x2)I H = R3

2. k(x, x′) = (x · x′ + c)m, x, x′ ∈ Rd

I dim(H) =(d+mm

)3. k(x, x′) = exp

(−γ‖x− x′‖2

2

)I H = R∞

φx , x ′ φ(x), φ(x ′)

k(x , x ′)

〈·, ·〉

Page 15: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

9/53

Kernels

A function k : X ×X → R is called a kernel on X if there exists a Hilbertspace H and a map φ : X → H such that for all x, x′ ∈ X we have

k(x, x′) = 〈φ(x), φ(x′)〉H

We call φ a feature map and H a feature space associated with k.

Example

1. k(x, x′) = (x · x′)2 for x, x′ ∈ R2

I φ(x) = (x21 , x

22 ,√

2x1x2)I H = R3

2. k(x, x′) = (x · x′ + c)m, x, x′ ∈ Rd

I dim(H) =(d+mm

)3. k(x, x′) = exp

(−γ‖x− x′‖2

2

)I H = R∞

φx , x ′ φ(x), φ(x ′)

k(x , x ′)

〈·, ·〉

Page 16: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

9/53

Kernels

A function k : X ×X → R is called a kernel on X if there exists a Hilbertspace H and a map φ : X → H such that for all x, x′ ∈ X we have

k(x, x′) = 〈φ(x), φ(x′)〉H

We call φ a feature map and H a feature space associated with k.

Example

1. k(x, x′) = (x · x′)2 for x, x′ ∈ R2

I φ(x) = (x21 , x

22 ,√

2x1x2)I H = R3

2. k(x, x′) = (x · x′ + c)m, x, x′ ∈ Rd

I dim(H) =(d+mm

)3. k(x, x′) = exp

(−γ‖x− x′‖2

2

)I H = R∞

φx , x ′ φ(x), φ(x ′)

k(x , x ′)

〈·, ·〉

Page 17: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

10/53

Positive Definite Kernels

A function k : X × X → R is called positive definite if, for all n ∈ N,α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X , we have

α>Kα =n∑

i=1

n∑j=1

αiαjk(xj , xi ) ≥ 0, K :=

k(x1, x1) · · · k(x1, xn)...

. . ....

k(xn, x1) · · · k(xn, xn)

Equivalently, the Gram matrix K is positive definite.

Any explicit kernel is positive definiteFor any kernel k(x , x ′) := 〈φ(x), φ(x ′)〉H,

n∑i=1

n∑j=1

αiαjk(xj , xi ) =

⟨n∑

i=1

αiφ(xi ),n∑

j=1

αjφ(xj)

⟩H

≥ 0.

Positive definiteness is a necessary (and sufficient) condition.

Page 18: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

10/53

Positive Definite Kernels

A function k : X × X → R is called positive definite if, for all n ∈ N,α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X , we have

α>Kα =n∑

i=1

n∑j=1

αiαjk(xj , xi ) ≥ 0, K :=

k(x1, x1) · · · k(x1, xn)...

. . ....

k(xn, x1) · · · k(xn, xn)

Equivalently, the Gram matrix K is positive definite.

Any explicit kernel is positive definiteFor any kernel k(x , x ′) := 〈φ(x), φ(x ′)〉H,

n∑i=1

n∑j=1

αiαjk(xj , xi ) =

⟨n∑

i=1

αiφ(xi ),n∑

j=1

αjφ(xj)

⟩H

≥ 0.

Positive definiteness is a necessary (and sufficient) condition.

Page 19: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

10/53

Positive Definite Kernels

A function k : X × X → R is called positive definite if, for all n ∈ N,α1, . . . , αn ∈ R and all x1, . . . , xn ∈ X , we have

α>Kα =n∑

i=1

n∑j=1

αiαjk(xj , xi ) ≥ 0, K :=

k(x1, x1) · · · k(x1, xn)...

. . ....

k(xn, x1) · · · k(xn, xn)

Equivalently, the Gram matrix K is positive definite.

Any explicit kernel is positive definiteFor any kernel k(x , x ′) := 〈φ(x), φ(x ′)〉H,

n∑i=1

n∑j=1

αiαjk(xj , xi ) =

⟨n∑

i=1

αiφ(xi ),n∑

j=1

αjφ(xj)

⟩H

≥ 0.

Positive definiteness is a necessary (and sufficient) condition.

Page 20: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

11/53

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of real-valued functions on X .

1. The space H is called a reproducing kernel Hilbert space (RKHS)over X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H,

is continuous.

2. A function k : X × X → R is called a reproducing kernel of H ifk(·, x) ∈ H for all x ∈ X and the reproducing property

f (x) = 〈f , k(·, x)〉H

holds for all f ∈ H and all x ∈ X .

Aronszajn (1950)2: “There is a one-to-one correspondance between thereproducing kernel k and the RKHS H”.

2N. Aronszajn. Theory of reproducing kernels. Transactions of the AmericanMathematical Society, 68(3):337–404, 1950.

Page 21: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

11/53

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of real-valued functions on X .

1. The space H is called a reproducing kernel Hilbert space (RKHS)over X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H,

is continuous.

2. A function k : X × X → R is called a reproducing kernel of H ifk(·, x) ∈ H for all x ∈ X and the reproducing property

f (x) = 〈f , k(·, x)〉H

holds for all f ∈ H and all x ∈ X .

Aronszajn (1950)2: “There is a one-to-one correspondance between thereproducing kernel k and the RKHS H”.

2N. Aronszajn. Theory of reproducing kernels. Transactions of the AmericanMathematical Society, 68(3):337–404, 1950.

Page 22: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

11/53

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of real-valued functions on X .

1. The space H is called a reproducing kernel Hilbert space (RKHS)over X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H,

is continuous.

2. A function k : X × X → R is called a reproducing kernel of H ifk(·, x) ∈ H for all x ∈ X and the reproducing property

f (x) = 〈f , k(·, x)〉H

holds for all f ∈ H and all x ∈ X .

Aronszajn (1950)2: “There is a one-to-one correspondance between thereproducing kernel k and the RKHS H”.

2N. Aronszajn. Theory of reproducing kernels. Transactions of the AmericanMathematical Society, 68(3):337–404, 1950.

Page 23: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

11/53

Reproducing Kernel Hilbert Spaces

Let H be a Hilbert space of real-valued functions on X .

1. The space H is called a reproducing kernel Hilbert space (RKHS)over X if for all x ∈ X the Dirac functional δx : H → R defined by

δx(f ) := f (x), f ∈ H,

is continuous.

2. A function k : X × X → R is called a reproducing kernel of H ifk(·, x) ∈ H for all x ∈ X and the reproducing property

f (x) = 〈f , k(·, x)〉H

holds for all f ∈ H and all x ∈ X .

Aronszajn (1950)2: “There is a one-to-one correspondance between thereproducing kernel k and the RKHS H”.

2N. Aronszajn. Theory of reproducing kernels. Transactions of the AmericanMathematical Society, 68(3):337–404, 1950.

Page 24: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

12/53

RKHS as Feature Space

Reproducing kernels are kernelsLet H be a Hilbert space on X with areproducing kernel k . Then, H is anRKHS and is also a feature space of k ,where the feature map φ : X → H is givenby

φ(x) = k(·, x).

We call φ the canonical feature map.

φx , x ′ φ(x), φ(x ′)

k(x , x ′)

〈·, ·〉

ProofWe fix an x′ ∈ X and write f := k(·, x′). Then, for x ∈ X , thereproducing property implies

〈φ(x′), φ(x)〉 = 〈k(·, x′), k(·, x)〉 = 〈f , k(·, x)〉 = f (x) = k(x, x′).

Page 25: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

12/53

RKHS as Feature Space

Reproducing kernels are kernelsLet H be a Hilbert space on X with areproducing kernel k . Then, H is anRKHS and is also a feature space of k ,where the feature map φ : X → H is givenby

φ(x) = k(·, x).

We call φ the canonical feature map.

φx , x ′ φ(x), φ(x ′)

k(x , x ′)

〈·, ·〉

ProofWe fix an x′ ∈ X and write f := k(·, x′). Then, for x ∈ X , thereproducing property implies

〈φ(x′), φ(x)〉 = 〈k(·, x′), k(·, x)〉 = 〈f , k(·, x)〉 = f (x) = k(x, x′).

Page 26: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

13/53

RKHS as Feature Space

Universal kernels (Steinwart 2002)A continuous kernel k on a compact metric space X is called universal ifthe RKHS H of k is dense in C (X ), i.e., for every function g ∈ C (X )and all ε > 0 there exist an f ∈ H such that

‖f − g‖∞ ≤ ε.

Universal approximation theorem (Cybenko 1989)Given any ε > 0 and f ∈ C (X ), there exist

h(x) =n∑

i=1

αiϕ(w>i x + bi )

such that |f (x)− h(x)| < ε for all x ∈ X .

Page 27: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

13/53

RKHS as Feature Space

Universal kernels (Steinwart 2002)A continuous kernel k on a compact metric space X is called universal ifthe RKHS H of k is dense in C (X ), i.e., for every function g ∈ C (X )and all ε > 0 there exist an f ∈ H such that

‖f − g‖∞ ≤ ε.

Universal approximation theorem (Cybenko 1989)Given any ε > 0 and f ∈ C (X ), there exist

h(x) =n∑

i=1

αiϕ(w>i x + bi )

such that |f (x)− h(x)| < ε for all x ∈ X .

Page 28: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

14/53

Quick Summary

I A positive definite kernel k(x , x ′) defines an implicit feature map:

k(x , x ′) = 〈φ(x), φ(x ′)〉H

I There exists a unique reproducing kernel Hilbert space (RKHS) Hof functions on X for which k is a reproducing kernel:

f (x) = 〈f , k(·, x)〉H, k(x , x ′) = 〈k(·, x), k(·, x ′)〉H.

I Implicit representation of data points:I Support vector machine (SVM)I Gaussian process (GP)I Neural tangent kernel (NTK)

I Good references on kernel methods.I Support vector machine (2008), Christmann and Steinwart.I Gaussian process for ML (2005), Rasmussen and Williams.I Learning with kernels (1998), Scholkopf and Smola.

Page 29: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

14/53

Quick Summary

I A positive definite kernel k(x , x ′) defines an implicit feature map:

k(x , x ′) = 〈φ(x), φ(x ′)〉H

I There exists a unique reproducing kernel Hilbert space (RKHS) Hof functions on X for which k is a reproducing kernel:

f (x) = 〈f , k(·, x)〉H, k(x , x ′) = 〈k(·, x), k(·, x ′)〉H.

I Implicit representation of data points:I Support vector machine (SVM)I Gaussian process (GP)I Neural tangent kernel (NTK)

I Good references on kernel methods.I Support vector machine (2008), Christmann and Steinwart.I Gaussian process for ML (2005), Rasmussen and Williams.I Learning with kernels (1998), Scholkopf and Smola.

Page 30: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

14/53

Quick Summary

I A positive definite kernel k(x , x ′) defines an implicit feature map:

k(x , x ′) = 〈φ(x), φ(x ′)〉H

I There exists a unique reproducing kernel Hilbert space (RKHS) Hof functions on X for which k is a reproducing kernel:

f (x) = 〈f , k(·, x)〉H, k(x , x ′) = 〈k(·, x), k(·, x ′)〉H.

I Implicit representation of data points:I Support vector machine (SVM)I Gaussian process (GP)I Neural tangent kernel (NTK)

I Good references on kernel methods.I Support vector machine (2008), Christmann and Steinwart.I Gaussian process for ML (2005), Rasmussen and Williams.I Learning with kernels (1998), Scholkopf and Smola.

Page 31: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

14/53

Quick Summary

I A positive definite kernel k(x , x ′) defines an implicit feature map:

k(x , x ′) = 〈φ(x), φ(x ′)〉H

I There exists a unique reproducing kernel Hilbert space (RKHS) Hof functions on X for which k is a reproducing kernel:

f (x) = 〈f , k(·, x)〉H, k(x , x ′) = 〈k(·, x), k(·, x ′)〉H.

I Implicit representation of data points:I Support vector machine (SVM)I Gaussian process (GP)I Neural tangent kernel (NTK)

I Good references on kernel methods.I Support vector machine (2008), Christmann and Steinwart.I Gaussian process for ML (2005), Rasmussen and Williams.I Learning with kernels (1998), Scholkopf and Smola.

Page 32: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

15/53

Kernel Methods

From Points to Probability Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Recent Development

Page 33: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

16/53

Probability Measures

Learning onDistributions/Point Clouds

Group Anomaly/OODDetection

Dist

ributio

n spac

e

Input s

pace

Generalization acrossEnvironments

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

Statistical and Causal Inference

X Y

x

p(x)P

Q

Page 34: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

17/53

Embedding of Dirac Measures

Feature Space HInput Space X

x

y

k(x, ·)k(y, ·)

f

x 7→ k(·, x) δx 7→∫k(·, z) dδx(z) = k(·, x)

Page 35: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

17/53

Embedding of Dirac Measures

Feature Space HInput Space X

x

y

k(x, ·)k(y, ·)

f

x 7→ k(·, x) δx 7→∫k(·, z) dδx(z) = k(·, x)

Page 36: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

18/53

Kernel Methods

From Points to Probability Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Recent Development

Page 37: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

19/53

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

Probability measureLet P be a probability measure defined on a measurable space (X ,Σ)with a σ-algebra Σ.

Kernel mean embeddingLet P be a space of all probability measures P. A kernel meanembedding is defined by

µ : P → H, P 7→∫k(·, x)dP(x).

Remark: The kernel k is Bochner integrable if it is bounded.

Page 38: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

19/53

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

Probability measureLet P be a probability measure defined on a measurable space (X ,Σ)with a σ-algebra Σ.

Kernel mean embeddingLet P be a space of all probability measures P. A kernel meanembedding is defined by

µ : P → H, P 7→∫k(·, x)dP(x).

Remark: The kernel k is Bochner integrable if it is bounded.

Page 39: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

19/53

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

Probability measureLet P be a probability measure defined on a measurable space (X ,Σ)with a σ-algebra Σ.

Kernel mean embeddingLet P be a space of all probability measures P. A kernel meanembedding is defined by

µ : P → H, P 7→∫k(·, x)dP(x).

Remark: The kernel k is Bochner integrable if it is bounded.

Page 40: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

20/53

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

I If EX∼P[√k(X ,X )] <∞, then for µP ∈ H and f ∈ H,

〈f , µP〉 = 〈f ,EX∼P[k(·,X )]〉 = EX∼P[〈f , k(·,X )〉] = EX∼P[f (X )].

I The kernel k is said to be characteristic if the map

P 7→ µP

is injective, i.e., ‖µP − µQ‖H = 0 if and only if P = Q.

Page 41: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

20/53

Embedding of Marginal Distributions

x

p(x) RKHS H

µP

µQ

PQ

f

I If EX∼P[√k(X ,X )] <∞, then for µP ∈ H and f ∈ H,

〈f , µP〉 = 〈f ,EX∼P[k(·,X )]〉 = EX∼P[〈f , k(·,X )〉] = EX∼P[f (X )].

I The kernel k is said to be characteristic if the map

P 7→ µP

is injective, i.e., ‖µP − µQ‖H = 0 if and only if P = Q.

Page 42: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

21/53

Interpretation of Kernel Mean Representation

What properties are captured by µP?

I k(x , x ′) = 〈x , x ′〉 the first moment of PI k(x , x ′) = (〈x , x ′〉+ 1)p moments of P up to order p ∈ NI k(x , x ′) is universal/characteristic all information of P

Moment-generating functionConsider k(x , x ′) = exp(〈x , x ′〉). Then, µP = EX∼P[e〈X ,·〉].

Characteristic functionIf k(x , y) = ψ(x − y) where ψ is a positive definite function, then

µP(y) =

∫ψ(x − y)dP(x) = Λk · ϕP

for positive finite measure Λk .

Page 43: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

21/53

Interpretation of Kernel Mean Representation

What properties are captured by µP?

I k(x , x ′) = 〈x , x ′〉 the first moment of PI k(x , x ′) = (〈x , x ′〉+ 1)p moments of P up to order p ∈ NI k(x , x ′) is universal/characteristic all information of P

Moment-generating functionConsider k(x , x ′) = exp(〈x , x ′〉). Then, µP = EX∼P[e〈X ,·〉].

Characteristic functionIf k(x , y) = ψ(x − y) where ψ is a positive definite function, then

µP(y) =

∫ψ(x − y)dP(x) = Λk · ϕP

for positive finite measure Λk .

Page 44: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

21/53

Interpretation of Kernel Mean Representation

What properties are captured by µP?

I k(x , x ′) = 〈x , x ′〉 the first moment of PI k(x , x ′) = (〈x , x ′〉+ 1)p moments of P up to order p ∈ NI k(x , x ′) is universal/characteristic all information of P

Moment-generating functionConsider k(x , x ′) = exp(〈x , x ′〉). Then, µP = EX∼P[e〈X ,·〉].

Characteristic functionIf k(x , y) = ψ(x − y) where ψ is a positive definite function, then

µP(y) =

∫ψ(x − y)dP(x) = Λk · ϕP

for positive finite measure Λk .

Page 45: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

22/53

Characteristic Kernels

I All universal kernels are characteristic, but characteristic kernelsmay not be universal.

I Important characterizations:I Discrete kernel on discrete spaceI Shift-invariant kernels on Rd whose Fourier transform has full support.I Integrally strictly positive definite (ISPD) kernelsI Characteristic kernels on groups

I Examples of characteristic kernels:

Gaussian RBF kernel

k(x, x′) = exp

(−‖x− x′‖2

2

2σ2

) Laplacian kernel

k(x, x′) = exp

(−‖x− x′‖1

σ

)I Kernel choice vs parametric assumption

I Parametric assumption is susceptible to model misspecification.I But the choice of kernel matters in practice.I We can optimize the kernel to maximize the performance of the

downstream tasks.

Page 46: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

22/53

Characteristic Kernels

I All universal kernels are characteristic, but characteristic kernelsmay not be universal.

I Important characterizations:I Discrete kernel on discrete spaceI Shift-invariant kernels on Rd whose Fourier transform has full support.I Integrally strictly positive definite (ISPD) kernelsI Characteristic kernels on groups

I Examples of characteristic kernels:

Gaussian RBF kernel

k(x, x′) = exp

(−‖x− x′‖2

2

2σ2

) Laplacian kernel

k(x, x′) = exp

(−‖x− x′‖1

σ

)I Kernel choice vs parametric assumption

I Parametric assumption is susceptible to model misspecification.I But the choice of kernel matters in practice.I We can optimize the kernel to maximize the performance of the

downstream tasks.

Page 47: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

22/53

Characteristic Kernels

I All universal kernels are characteristic, but characteristic kernelsmay not be universal.

I Important characterizations:I Discrete kernel on discrete spaceI Shift-invariant kernels on Rd whose Fourier transform has full support.I Integrally strictly positive definite (ISPD) kernelsI Characteristic kernels on groups

I Examples of characteristic kernels:

Gaussian RBF kernel

k(x, x′) = exp

(−‖x− x′‖2

2

2σ2

) Laplacian kernel

k(x, x′) = exp

(−‖x− x′‖1

σ

)

I Kernel choice vs parametric assumptionI Parametric assumption is susceptible to model misspecification.I But the choice of kernel matters in practice.I We can optimize the kernel to maximize the performance of the

downstream tasks.

Page 48: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

22/53

Characteristic Kernels

I All universal kernels are characteristic, but characteristic kernelsmay not be universal.

I Important characterizations:I Discrete kernel on discrete spaceI Shift-invariant kernels on Rd whose Fourier transform has full support.I Integrally strictly positive definite (ISPD) kernelsI Characteristic kernels on groups

I Examples of characteristic kernels:

Gaussian RBF kernel

k(x, x′) = exp

(−‖x− x′‖2

2

2σ2

) Laplacian kernel

k(x, x′) = exp

(−‖x− x′‖1

σ

)I Kernel choice vs parametric assumption

I Parametric assumption is susceptible to model misspecification.I But the choice of kernel matters in practice.I We can optimize the kernel to maximize the performance of the

downstream tasks.

Page 49: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

23/53

Kernel Mean Estimation

I Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP := 1n

∑ni=1 k(xi , ·) ∈ H, P = 1

n

∑ni=1 δxi .

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.I Consistency: with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I The rate Op(n−1/2) was shown to be minimax optimal.3

I Similar to James-Stein estimators, we can improve an estimation byshrinkage estimators:4

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 50: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

23/53

Kernel Mean Estimation

I Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP := 1n

∑ni=1 k(xi , ·) ∈ H, P = 1

n

∑ni=1 δxi .

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.

I Consistency: with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I The rate Op(n−1/2) was shown to be minimax optimal.3

I Similar to James-Stein estimators, we can improve an estimation byshrinkage estimators:4

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 51: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

23/53

Kernel Mean Estimation

I Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP := 1n

∑ni=1 k(xi , ·) ∈ H, P = 1

n

∑ni=1 δxi .

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.I Consistency: with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I The rate Op(n−1/2) was shown to be minimax optimal.3

I Similar to James-Stein estimators, we can improve an estimation byshrinkage estimators:4

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 52: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

23/53

Kernel Mean Estimation

I Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP := 1n

∑ni=1 k(xi , ·) ∈ H, P = 1

n

∑ni=1 δxi .

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.I Consistency: with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I The rate Op(n−1/2) was shown to be minimax optimal.3

I Similar to James-Stein estimators, we can improve an estimation byshrinkage estimators:4

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 53: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

23/53

Kernel Mean Estimation

I Given an i.i.d. sample x1, x2, . . . , xn from P, we can estimate µP by

µP := 1n

∑ni=1 k(xi , ·) ∈ H, P = 1

n

∑ni=1 δxi .

I For each f ∈ H, we have EX∼P[f (X )] = 〈f , µP〉.I Consistency: with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I The rate Op(n−1/2) was shown to be minimax optimal.3

I Similar to James-Stein estimators, we can improve an estimation byshrinkage estimators:4

µα := αf ∗ + (1− α)µP, f ∗ ∈ H.

3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017.4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.

Page 54: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

24/53

Recovering Samples/Distributions

I An approximate pre-image problem

θ∗ = arg minθ‖µ− µPθ

‖2H. µPθ

µPθ

I The distribution Pθ is assumed to be in a certain class

Pθ(x) =K∑

k=1

πkN (x : µk ,Σk),K∑

k=1

πk = 1.

I Kernel herding generates deterministic pseudo-samples by greedilyminimizing the squared error

E2T =

∥∥∥∥∥µP −1

T

T∑t=1

k(·, xt)

∥∥∥∥∥2

H

.

I Negative autocorrelation: O(1/T ) rate of convergence.

I Deep generative models (see the following slides).

Page 55: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

24/53

Recovering Samples/Distributions

I An approximate pre-image problem

θ∗ = arg minθ‖µ− µPθ

‖2H. µPθ

µPθ

I The distribution Pθ is assumed to be in a certain class

Pθ(x) =K∑

k=1

πkN (x : µk ,Σk),K∑

k=1

πk = 1.

I Kernel herding generates deterministic pseudo-samples by greedilyminimizing the squared error

E2T =

∥∥∥∥∥µP −1

T

T∑t=1

k(·, xt)

∥∥∥∥∥2

H

.

I Negative autocorrelation: O(1/T ) rate of convergence.

I Deep generative models (see the following slides).

Page 56: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

24/53

Recovering Samples/Distributions

I An approximate pre-image problem

θ∗ = arg minθ‖µ− µPθ

‖2H. µPθ

µPθ

I The distribution Pθ is assumed to be in a certain class

Pθ(x) =K∑

k=1

πkN (x : µk ,Σk),K∑

k=1

πk = 1.

I Kernel herding generates deterministic pseudo-samples by greedilyminimizing the squared error

E2T =

∥∥∥∥∥µP −1

T

T∑t=1

k(·, xt)

∥∥∥∥∥2

H

.

I Negative autocorrelation: O(1/T ) rate of convergence.

I Deep generative models (see the following slides).

Page 57: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

24/53

Recovering Samples/Distributions

I An approximate pre-image problem

θ∗ = arg minθ‖µ− µPθ

‖2H. µPθ

µPθ

I The distribution Pθ is assumed to be in a certain class

Pθ(x) =K∑

k=1

πkN (x : µk ,Σk),K∑

k=1

πk = 1.

I Kernel herding generates deterministic pseudo-samples by greedilyminimizing the squared error

E2T =

∥∥∥∥∥µP −1

T

T∑t=1

k(·, xt)

∥∥∥∥∥2

H

.

I Negative autocorrelation: O(1/T ) rate of convergence.

I Deep generative models (see the following slides).

Page 58: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

24/53

Recovering Samples/Distributions

I An approximate pre-image problem

θ∗ = arg minθ‖µ− µPθ

‖2H. µPθ

µPθ

I The distribution Pθ is assumed to be in a certain class

Pθ(x) =K∑

k=1

πkN (x : µk ,Σk),K∑

k=1

πk = 1.

I Kernel herding generates deterministic pseudo-samples by greedilyminimizing the squared error

E2T =

∥∥∥∥∥µP −1

T

T∑t=1

k(·, xt)

∥∥∥∥∥2

H

.

I Negative autocorrelation: O(1/T ) rate of convergence.

I Deep generative models (see the following slides).

Page 59: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

25/53

Quick Summary

I A kernel mean embedding of distribution P

µP :=

∫k(·, x)dP(x), µP :=

1

n

n∑i=1

k(xi , ·).

I If k is characteristic, µP captures all information about P.

I All universal kernels are characteristic, but not vice versa.

I The empirical µP requires no parametric assumption about P.

I It can be estimated consistently, i.e., with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I Given the embedding µ, it is possible to reconstruct the distributionor generate samples from it.

Page 60: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

25/53

Quick Summary

I A kernel mean embedding of distribution P

µP :=

∫k(·, x)dP(x), µP :=

1

n

n∑i=1

k(xi , ·).

I If k is characteristic, µP captures all information about P.

I All universal kernels are characteristic, but not vice versa.

I The empirical µP requires no parametric assumption about P.

I It can be estimated consistently, i.e., with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I Given the embedding µ, it is possible to reconstruct the distributionor generate samples from it.

Page 61: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

25/53

Quick Summary

I A kernel mean embedding of distribution P

µP :=

∫k(·, x)dP(x), µP :=

1

n

n∑i=1

k(xi , ·).

I If k is characteristic, µP captures all information about P.

I All universal kernels are characteristic, but not vice versa.

I The empirical µP requires no parametric assumption about P.

I It can be estimated consistently, i.e., with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I Given the embedding µ, it is possible to reconstruct the distributionor generate samples from it.

Page 62: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

25/53

Quick Summary

I A kernel mean embedding of distribution P

µP :=

∫k(·, x)dP(x), µP :=

1

n

n∑i=1

k(xi , ·).

I If k is characteristic, µP captures all information about P.

I All universal kernels are characteristic, but not vice versa.

I The empirical µP requires no parametric assumption about P.

I It can be estimated consistently, i.e., with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I Given the embedding µ, it is possible to reconstruct the distributionor generate samples from it.

Page 63: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

25/53

Quick Summary

I A kernel mean embedding of distribution P

µP :=

∫k(·, x)dP(x), µP :=

1

n

n∑i=1

k(xi , ·).

I If k is characteristic, µP captures all information about P.

I All universal kernels are characteristic, but not vice versa.

I The empirical µP requires no parametric assumption about P.

I It can be estimated consistently, i.e., with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I Given the embedding µ, it is possible to reconstruct the distributionor generate samples from it.

Page 64: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

25/53

Quick Summary

I A kernel mean embedding of distribution P

µP :=

∫k(·, x)dP(x), µP :=

1

n

n∑i=1

k(xi , ·).

I If k is characteristic, µP captures all information about P.

I All universal kernels are characteristic, but not vice versa.

I The empirical µP requires no parametric assumption about P.

I It can be estimated consistently, i.e., with probability at least 1− δ,

‖µP − µP‖H ≤ 2

√EX∼P[k(X ,X )]

n+

√2 log 1

δ

n.

I Given the embedding µ, it is possible to reconstruct the distributionor generate samples from it.

Page 65: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

26/53

Application: High-Level Generalization

Learning from Distributions

q KM., Fukumizu, Dinuzzo,

Scholkopf. NIPS 2012.

Group Anomaly Detection

Dist

ributio

n spac

e

Input s

pace

q KM. and Scholkopf, UAI 2013.

Domain Generalization

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

q KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference

X Y

q Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

Page 66: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

26/53

Application: High-Level Generalization

Learning from Distributions

q KM., Fukumizu, Dinuzzo,

Scholkopf. NIPS 2012.

Group Anomaly Detection

Dist

ributio

n spac

e

Input s

pace

q KM. and Scholkopf, UAI 2013.

Domain Generalization

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

q KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference

X Y

q Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

Page 67: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

26/53

Application: High-Level Generalization

Learning from Distributions

q KM., Fukumizu, Dinuzzo,

Scholkopf. NIPS 2012.

Group Anomaly Detection

Dist

ributio

n spac

e

Input s

pace

q KM. and Scholkopf, UAI 2013.

Domain Generalization

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

q KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference

X Y

q Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

Page 68: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

26/53

Application: High-Level Generalization

Learning from Distributions

q KM., Fukumizu, Dinuzzo,

Scholkopf. NIPS 2012.

Group Anomaly Detection

Dist

ributio

n spac

e

Input s

pace

q KM. and Scholkopf, UAI 2013.

Domain Generalization

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

q KM. et al. ICML 2013;

Zhang, KM. et al. ICML 2013

Cause-Effect Inference

X Y

q Lopez-Paz, KM. et al.

JMLR 2015, ICML 2015.

Page 69: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

27/53

Support Measure Machine (SMM)KM, K. Fukumizu, F. Dinuzzo, and B. Scholkopf (NeurIPS2012)

x 7→ k(·, x) δx 7→∫k(·, z)dδx(z) P 7→

∫k(·, z)dP(z)

Training data: (P1, y1), (P2, y2), . . . , (Pn, yn) ∼P × Y

Theorem (Distributional representer theorem)Under technical assumptions on Ω : [0,+∞)→ R, and a loss function` : (P × R2)m → R ∪ +∞, any f ∈ H minimizing

` (P1, y1,EP1 [f ], . . . ,Pm, ym,EPm [f ]) + Ω (‖f ‖H)

admits a representation of the form

f =m∑i=1

αiEx∼Pi [k(x , ·)] =m∑i=1

αiµPi .

Page 70: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

27/53

Support Measure Machine (SMM)KM, K. Fukumizu, F. Dinuzzo, and B. Scholkopf (NeurIPS2012)

x 7→ k(·, x) δx 7→∫k(·, z)dδx(z) P 7→

∫k(·, z)dP(z)

Training data: (P1, y1), (P2, y2), . . . , (Pn, yn) ∼P × Y

Theorem (Distributional representer theorem)Under technical assumptions on Ω : [0,+∞)→ R, and a loss function` : (P × R2)m → R ∪ +∞, any f ∈ H minimizing

` (P1, y1,EP1 [f ], . . . ,Pm, ym,EPm [f ]) + Ω (‖f ‖H)

admits a representation of the form

f =m∑i=1

αiEx∼Pi [k(x , ·)] =m∑i=1

αiµPi .

Page 71: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

28/53

Supervised Learning on Point Clouds

Training set (S1, y1), . . . , (Sn, yn) with Si = x (i)j ∼ Pi (X ).

Causal Prediction

X → Y X ← Y X → Y ?

Lopez-Paz, KM., B. Scholkopf, I. Tolstikhin. JMLR 2015, ICML 2015.

Topological Data Analysis

G. Kusano, K. Fukumizu, and Y. Hiraoka. JMLR2018

Page 72: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

28/53

Supervised Learning on Point Clouds

Training set (S1, y1), . . . , (Sn, yn) with Si = x (i)j ∼ Pi (X ).

Causal Prediction

X → Y X ← Y X → Y ?

Lopez-Paz, KM., B. Scholkopf, I. Tolstikhin. JMLR 2015, ICML 2015.

Topological Data Analysis

G. Kusano, K. Fukumizu, and Y. Hiraoka. JMLR2018

Page 73: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

28/53

Supervised Learning on Point Clouds

Training set (S1, y1), . . . , (Sn, yn) with Si = x (i)j ∼ Pi (X ).

Causal Prediction

X → Y X ← Y X → Y ?

Lopez-Paz, KM., B. Scholkopf, I. Tolstikhin. JMLR 2015, ICML 2015.

Topological Data Analysis

G. Kusano, K. Fukumizu, and Y. Hiraoka. JMLR2018

Page 74: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

29/53

Domain GeneralizationBlanchard et al., NeurIPS2012; KM, D. Balduzzi, B. Scholkopf, ICML2013

training data unseen test data

P2XYP1

XY

P

PNXY

...

(Xk, Yk) ... Xk

PX

(Xk, Yk) (Xk, Yk)

k = 1, . . . , nk = 1, . . . , nNk = 1, . . . , n2k = 1, . . . , n1

K ((Pi , x), (Pj , x)) = k1(Pi ,Pj)k2(x , x) = k1(µPi ,µPj )k2(x , x)

Page 75: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

30/53

Comparing Distributions

I Maximum mean discrepancy (MMD) corresponds to the RKHSdistance between mean embeddings:

MMD2(P,Q,H) = ‖µP − µQ‖2H = ‖µP‖H − 2〈µP, µQ〉H + ‖µQ‖H.

I MMD is an integral probability metric (IPM):

MMD2(P,Q,H) := suph∈H,‖h‖≤1

∣∣∣∣∫ h(x)dP(x)−∫

h(x)dQ(x)

∣∣∣∣ .I If k is characteristic, then ‖µP − µQ‖H = 0 if and only if P = Q.

I Given xini=1 ∼ P and yjmj=1 ∼ Q, the empirical MMD is

MMD2u(P,Q,H) =

1

n(n − 1)

n∑i=1

n∑j 6=i

k(xi , xj) +1

m(m − 1)

m∑i=1

m∑j 6=i

k(yi , yj)

− 2

nm

n∑i=1

m∑j=1

k(xi , yj).

Page 76: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

30/53

Comparing Distributions

I Maximum mean discrepancy (MMD) corresponds to the RKHSdistance between mean embeddings:

MMD2(P,Q,H) = ‖µP − µQ‖2H = ‖µP‖H − 2〈µP, µQ〉H + ‖µQ‖H.

I MMD is an integral probability metric (IPM):

MMD2(P,Q,H) := suph∈H,‖h‖≤1

∣∣∣∣∫ h(x)dP(x)−∫

h(x)dQ(x)

∣∣∣∣ .

I If k is characteristic, then ‖µP − µQ‖H = 0 if and only if P = Q.

I Given xini=1 ∼ P and yjmj=1 ∼ Q, the empirical MMD is

MMD2u(P,Q,H) =

1

n(n − 1)

n∑i=1

n∑j 6=i

k(xi , xj) +1

m(m − 1)

m∑i=1

m∑j 6=i

k(yi , yj)

− 2

nm

n∑i=1

m∑j=1

k(xi , yj).

Page 77: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

30/53

Comparing Distributions

I Maximum mean discrepancy (MMD) corresponds to the RKHSdistance between mean embeddings:

MMD2(P,Q,H) = ‖µP − µQ‖2H = ‖µP‖H − 2〈µP, µQ〉H + ‖µQ‖H.

I MMD is an integral probability metric (IPM):

MMD2(P,Q,H) := suph∈H,‖h‖≤1

∣∣∣∣∫ h(x)dP(x)−∫

h(x)dQ(x)

∣∣∣∣ .I If k is characteristic, then ‖µP − µQ‖H = 0 if and only if P = Q.

I Given xini=1 ∼ P and yjmj=1 ∼ Q, the empirical MMD is

MMD2u(P,Q,H) =

1

n(n − 1)

n∑i=1

n∑j 6=i

k(xi , xj) +1

m(m − 1)

m∑i=1

m∑j 6=i

k(yi , yj)

− 2

nm

n∑i=1

m∑j=1

k(xi , yj).

Page 78: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

30/53

Comparing Distributions

I Maximum mean discrepancy (MMD) corresponds to the RKHSdistance between mean embeddings:

MMD2(P,Q,H) = ‖µP − µQ‖2H = ‖µP‖H − 2〈µP, µQ〉H + ‖µQ‖H.

I MMD is an integral probability metric (IPM):

MMD2(P,Q,H) := suph∈H,‖h‖≤1

∣∣∣∣∫ h(x)dP(x)−∫

h(x)dQ(x)

∣∣∣∣ .I If k is characteristic, then ‖µP − µQ‖H = 0 if and only if P = Q.

I Given xini=1 ∼ P and yjmj=1 ∼ Q, the empirical MMD is

MMD2u(P,Q,H) =

1

n(n − 1)

n∑i=1

n∑j 6=i

k(xi , xj) +1

m(m − 1)

m∑i=1

m∑j 6=i

k(yi , yj)

− 2

nm

n∑i=1

m∑j=1

k(xi , yj).

Page 79: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

31/53

Kernel Two-Sample TestingGretton et al., JMLR2012

PQ

PQ

Question: Given xini=1 ∼ P and yjnj=1 ∼ Q, check if P = Q.

H0 : P = Q, H1 : P 6= Q

I MMD test statistic:

t2 = MMD2

u(P,Q,H)

=1

n(n − 1)

∑1≤i 6=j≤n

h((xi , yi ), (xj , yj))

where h((xi , yi ), (xj , yj)) = k(xi , xj) + k(yi , yj)− k(xi , yj)− k(xj , yi ).

Page 80: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

31/53

Kernel Two-Sample TestingGretton et al., JMLR2012

PQ

PQ

Question: Given xini=1 ∼ P and yjnj=1 ∼ Q, check if P = Q.

H0 : P = Q, H1 : P 6= Q

I MMD test statistic:

t2 = MMD2

u(P,Q,H)

=1

n(n − 1)

∑1≤i 6=j≤n

h((xi , yi ), (xj , yj))

where h((xi , yi ), (xj , yj)) = k(xi , xj) + k(yi , yj)− k(xi , yj)− k(xj , yi ).

Page 81: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

32/53

Generative Adversarial Networks

Learn a deep generative model G via a minimax optimization

minG

maxD

Ex [logD(x)] + Ez [log(1− D(G (z)))]

where D is a discriminator and z ∼ N (0, σ2I).

random noise z

Gθ(z)

Generator Gθ

real or synthetic?

x or Gθ(z)

Discriminator Dφ

•••••••••

••••××× ××××××××××

real dataxi synthetic data

Gθ(zi )

∥∥µX − µGθ(Z)

∥∥H is zero?

MMD Test

Page 82: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

33/53

Generative Moment Matching Network

I The GAN aims to match two distributions P(X ) and Gθ.

I Generative moment matching network (GMMN) proposed byDziugaite et al. (2015) and Li et al. (2015) considers

minθ

∥∥µX − µGθ(Z)

∥∥2

H = minθ

∥∥∥∥∫ φ(X )dP(X )−∫φ(X )dGθ(X )

∥∥∥∥2

H

= minθ

sup

h∈H,‖h‖≤1

∣∣∣∣∫ h dP−∫

h dGθ

∣∣∣∣

I Many tricks have been proposed to improve the GMMN:I Optimized kernels and feature extractors

(Sutherland et al., 2017; Li et al., 2017a)I Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018)I Repulsive loss (Wang et al., 2019)I Optimized witness points (Mehrjou et al., 2019)I Etc.

Page 83: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

33/53

Generative Moment Matching Network

I The GAN aims to match two distributions P(X ) and Gθ.

I Generative moment matching network (GMMN) proposed byDziugaite et al. (2015) and Li et al. (2015) considers

minθ

∥∥µX − µGθ(Z)

∥∥2

H = minθ

∥∥∥∥∫ φ(X )dP(X )−∫φ(X )dGθ(X )

∥∥∥∥2

H

= minθ

sup

h∈H,‖h‖≤1

∣∣∣∣∫ h dP−∫

h dGθ

∣∣∣∣

I Many tricks have been proposed to improve the GMMN:I Optimized kernels and feature extractors

(Sutherland et al., 2017; Li et al., 2017a)I Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018)I Repulsive loss (Wang et al., 2019)I Optimized witness points (Mehrjou et al., 2019)I Etc.

Page 84: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

33/53

Generative Moment Matching Network

I The GAN aims to match two distributions P(X ) and Gθ.

I Generative moment matching network (GMMN) proposed byDziugaite et al. (2015) and Li et al. (2015) considers

minθ

∥∥µX − µGθ(Z)

∥∥2

H = minθ

∥∥∥∥∫ φ(X )dP(X )−∫φ(X )dGθ(X )

∥∥∥∥2

H

= minθ

sup

h∈H,‖h‖≤1

∣∣∣∣∫ h dP−∫

h dGθ

∣∣∣∣

I Many tricks have been proposed to improve the GMMN:I Optimized kernels and feature extractors

(Sutherland et al., 2017; Li et al., 2017a)I Gradient regularization (Binkowski et al., 2018; Arbel et al., 2018)I Repulsive loss (Wang et al., 2019)I Optimized witness points (Mehrjou et al., 2019)I Etc.

Page 85: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

34/53

Kernel Methods

From Points to Probability Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Recent Development

Page 86: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

35/53

Conditional Distribution P(Y |X )

X

Y

A collection of distributions PY := P(Y |X = x) : x ∈ X.

I For each x ∈ X , we can define an embedding of P(Y |X = x) as

µY |x :=

∫Y

ϕ(Y ) dP(Y |X = x) = EY |x [ϕ(Y )]

where ϕ : Y → G is a feature map of Y .

Page 87: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

35/53

Conditional Distribution P(Y |X )

X

Y

A collection of distributions PY := P(Y |X = x) : x ∈ X.I For each x ∈ X , we can define an embedding of P(Y |X = x) as

µY |x :=

∫Y

ϕ(Y ) dP(Y |X = x) = EY |x [ϕ(Y )]

where ϕ : Y → G is a feature map of Y .

Page 88: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

36/53

Embedding of Conditional Distributions

X

Y

H GCYXC−1

XXk(x , ·)

µY |X=xk(x , ·) CYXC−1XX

y

p(y |x)

P(Y |X = x)

The conditional mean embedding of P(Y |X ) can be defined as

UY |X : H → G, UY |X := CYXC−1XX

Page 89: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

37/53

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G , ∀g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

I Conditional mean estimator

µY |x =n∑

i=1

βi (x)ϕ(yi ), β(x) := (K + nεI )−1kx .

Page 90: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

37/53

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G , ∀g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

I Conditional mean estimator

µY |x =n∑

i=1

βi (x)ϕ(yi ), β(x) := (K + nεI )−1kx .

Page 91: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

37/53

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G , ∀g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

I Conditional mean estimator

µY |x =n∑

i=1

βi (x)ϕ(yi ), β(x) := (K + nεI )−1kx .

Page 92: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

37/53

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G , ∀g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

I Conditional mean estimator

µY |x =n∑

i=1

βi (x)ϕ(yi ), β(x) := (K + nεI )−1kx .

Page 93: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

37/53

Conditional Mean Embedding

I To fully represent P(Y |X ), we need to perform conditioning andconditional expectation.

I To represent P(Y |X = x) for x ∈ X , it follows that

EY |x [ϕ(Y ) |X = x ] = UY |Xk(x , ·) = CYXC−1XXk(x , ·) =: µY |x .

I It follows from the reproducing property of G that

EY |x [g(Y ) |X = x ] = 〈µY |x , g〉G , ∀g ∈ G.

I In an infinite RKHS, C−1XX does not exists. Hence, we often use

UY |X := CYX (CXX + εI)−1.

I Conditional mean estimator

µY |x =n∑

i=1

βi (x)ϕ(yi ), β(x) := (K + nεI )−1kx .

Page 94: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

38/53

Counterfactual Mean EmbeddingKM, Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted)

In economics, social science, and public policy, we need to evaluate thedistributional treatment effect (DTE)

PY ∗0

(·)− PY ∗1

(·)

where Y ∗0 and Y ∗1 are potential outcomes of a treatment policy T .

I We can only observe either PY ∗0

or PY ∗1

.

I Counterfactual distribution

PY 〈0|1〉(y) =

∫PY0|X0

(y |x) dPX1 (x).

I The counterfactual distribution PY 〈0|1〉(y) can be estimated using thekernel mean embedding.

Page 95: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

38/53

Counterfactual Mean EmbeddingKM, Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted)

In economics, social science, and public policy, we need to evaluate thedistributional treatment effect (DTE)

PY ∗0

(·)− PY ∗1

(·)

where Y ∗0 and Y ∗1 are potential outcomes of a treatment policy T .

I We can only observe either PY ∗0

or PY ∗1

.

I Counterfactual distribution

PY 〈0|1〉(y) =

∫PY0|X0

(y |x) dPX1 (x).

I The counterfactual distribution PY 〈0|1〉(y) can be estimated using thekernel mean embedding.

Page 96: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

38/53

Counterfactual Mean EmbeddingKM, Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted)

In economics, social science, and public policy, we need to evaluate thedistributional treatment effect (DTE)

PY ∗0

(·)− PY ∗1

(·)

where Y ∗0 and Y ∗1 are potential outcomes of a treatment policy T .

I We can only observe either PY ∗0

or PY ∗1

.

I Counterfactual distribution

PY 〈0|1〉(y) =

∫PY0|X0

(y |x) dPX1 (x).

I The counterfactual distribution PY 〈0|1〉(y) can be estimated using thekernel mean embedding.

Page 97: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

38/53

Counterfactual Mean EmbeddingKM, Kanagawa, Saengkyongam, Marukatat, JMLR2020 (Accepted)

In economics, social science, and public policy, we need to evaluate thedistributional treatment effect (DTE)

PY ∗0

(·)− PY ∗1

(·)

where Y ∗0 and Y ∗1 are potential outcomes of a treatment policy T .

I We can only observe either PY ∗0

or PY ∗1

.

I Counterfactual distribution

PY 〈0|1〉(y) =

∫PY0|X0

(y |x) dPX1 (x).

I The counterfactual distribution PY 〈0|1〉(y) can be estimated using thekernel mean embedding.

Page 98: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

39/53

Quantum Mean Embedding

Page 99: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

40/53

Quick Summary

I Many applications requires information in P(Y |X ).

I Hilbert space embedding of P(Y |X ) is not a single element, but anoperator UY |X mapping from H to G:

µY |x = UY |Xk(x , ·) = CYXC−1XXk(x , ·)

〈µY |x , g〉G = EY |x [g(Y ) |X = x ]

I The conditional mean operator

UY |X := CYX (CXX + εI)−1, UY |X = CYX (CXX + εI)−1

I Probabilistic inference such as sum, product, and Bayes rules, canbe performed via the embeddings.

Page 100: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

40/53

Quick Summary

I Many applications requires information in P(Y |X ).

I Hilbert space embedding of P(Y |X ) is not a single element, but anoperator UY |X mapping from H to G:

µY |x = UY |Xk(x , ·) = CYXC−1XXk(x , ·)

〈µY |x , g〉G = EY |x [g(Y ) |X = x ]

I The conditional mean operator

UY |X := CYX (CXX + εI)−1, UY |X = CYX (CXX + εI)−1

I Probabilistic inference such as sum, product, and Bayes rules, canbe performed via the embeddings.

Page 101: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

40/53

Quick Summary

I Many applications requires information in P(Y |X ).

I Hilbert space embedding of P(Y |X ) is not a single element, but anoperator UY |X mapping from H to G:

µY |x = UY |Xk(x , ·) = CYXC−1XXk(x , ·)

〈µY |x , g〉G = EY |x [g(Y ) |X = x ]

I The conditional mean operator

UY |X := CYX (CXX + εI)−1, UY |X = CYX (CXX + εI)−1

I Probabilistic inference such as sum, product, and Bayes rules, canbe performed via the embeddings.

Page 102: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

40/53

Quick Summary

I Many applications requires information in P(Y |X ).

I Hilbert space embedding of P(Y |X ) is not a single element, but anoperator UY |X mapping from H to G:

µY |x = UY |Xk(x , ·) = CYXC−1XXk(x , ·)

〈µY |x , g〉G = EY |x [g(Y ) |X = x ]

I The conditional mean operator

UY |X := CYX (CXX + εI)−1, UY |X = CYX (CXX + εI)−1

I Probabilistic inference such as sum, product, and Bayes rules, canbe performed via the embeddings.

Page 103: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

41/53

Kernel Methods

From Points to Probability Measures

Embedding of Marginal Distributions

Embedding of Conditional Distributions

Recent Development

Page 104: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

42/53

Machine Learning in Economics

Recommendation Autonomous Car Healthcare

Finance Law Enforcement Public Policy

Page 105: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

43/53

Instrumental Variable Regression

S

YEI

educationseason of birth income

socioeconomic status

I We aim to estimate a function f from a structural equation model

Y = f (E ) + ε, E[ε |E ] 6= 0.

I We have an instrumental variable I with property E[ε | I ] = 0, i.e.,

E[Y − f (E ) | I ] = 0

I Conditional moment restriction (CMR): E[ψ(Z , θ) |X ] = 0.

Z = (E ,Y ), X = I , θ = f , ψ(Z ; θ) = Y − f (E ).

Page 106: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

43/53

Instrumental Variable Regression

S

YEI

educationseason of birth income

socioeconomic status

I We aim to estimate a function f from a structural equation model

Y = f (E ) + ε, E[ε |E ] 6= 0.

I We have an instrumental variable I with property E[ε | I ] = 0, i.e.,

E[Y − f (E ) | I ] = 0

I Conditional moment restriction (CMR): E[ψ(Z , θ) |X ] = 0.

Z = (E ,Y ), X = I , θ = f , ψ(Z ; θ) = Y − f (E ).

Page 107: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

43/53

Instrumental Variable Regression

S

YEI

educationseason of birth income

socioeconomic status

I We aim to estimate a function f from a structural equation model

Y = f (E ) + ε, E[ε |E ] 6= 0.

I We have an instrumental variable I with property E[ε | I ] = 0, i.e.,

E[Y − f (E ) | I ] = 0

I Conditional moment restriction (CMR): E[ψ(Z , θ) |X ] = 0.

Z = (E ,Y ), X = I , θ = f , ψ(Z ; θ) = Y − f (E ).

Page 108: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

43/53

Instrumental Variable Regression

S

YEI

educationseason of birth income

socioeconomic status

I We aim to estimate a function f from a structural equation model

Y = f (E ) + ε, E[ε |E ] 6= 0.

I We have an instrumental variable I with property E[ε | I ] = 0, i.e.,

E[Y − f (E ) | I ] = 0

I Conditional moment restriction (CMR): E[ψ(Z , θ) |X ] = 0.

Z = (E ,Y ), X = I , θ = f , ψ(Z ; θ) = Y − f (E ).

Page 109: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

44/53

Conditional Moment Restriction (CMR)Newey (1993), Ai and Chen (2003)

There exists a true parameter θ0 ∈ Θ that satisfies

E[ψ(Z ; θ0) |X ] = 0, PX − a.s.,

where ψ : Z ×Θ→ Rq is a generalized residual function.

I The function ψ is known and is problem-dependent, e.g.,

ψ(Z ; θ) = Y − f (E ), Z = (Y ,E ),X = I , θ = f .

I The CMR implies unconditional moment restriction (UMR):

E[ψ(Z ; θ0)>f (X )] = 0

for any measurable vector-valued function f : X → Rq. The functionf (X ) is often called an instrument.

I Given the instruments f1, . . . , fm, one can use the generalizedmethod of moment (GMM) to learn the parameter θ.

Page 110: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

44/53

Conditional Moment Restriction (CMR)Newey (1993), Ai and Chen (2003)

There exists a true parameter θ0 ∈ Θ that satisfies

E[ψ(Z ; θ0) |X ] = 0, PX − a.s.,

where ψ : Z ×Θ→ Rq is a generalized residual function.

I The function ψ is known and is problem-dependent, e.g.,

ψ(Z ; θ) = Y − f (E ), Z = (Y ,E ),X = I , θ = f .

I The CMR implies unconditional moment restriction (UMR):

E[ψ(Z ; θ0)>f (X )] = 0

for any measurable vector-valued function f : X → Rq. The functionf (X ) is often called an instrument.

I Given the instruments f1, . . . , fm, one can use the generalizedmethod of moment (GMM) to learn the parameter θ.

Page 111: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

44/53

Conditional Moment Restriction (CMR)Newey (1993), Ai and Chen (2003)

There exists a true parameter θ0 ∈ Θ that satisfies

E[ψ(Z ; θ0) |X ] = 0, PX − a.s.,

where ψ : Z ×Θ→ Rq is a generalized residual function.

I The function ψ is known and is problem-dependent, e.g.,

ψ(Z ; θ) = Y − f (E ), Z = (Y ,E ),X = I , θ = f .

I The CMR implies unconditional moment restriction (UMR):

E[ψ(Z ; θ0)>f (X )] = 0

for any measurable vector-valued function f : X → Rq. The functionf (X ) is often called an instrument.

I Given the instruments f1, . . . , fm, one can use the generalizedmethod of moment (GMM) to learn the parameter θ.

Page 112: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

44/53

Conditional Moment Restriction (CMR)Newey (1993), Ai and Chen (2003)

There exists a true parameter θ0 ∈ Θ that satisfies

E[ψ(Z ; θ0) |X ] = 0, PX − a.s.,

where ψ : Z ×Θ→ Rq is a generalized residual function.

I The function ψ is known and is problem-dependent, e.g.,

ψ(Z ; θ) = Y − f (E ), Z = (Y ,E ),X = I , θ = f .

I The CMR implies unconditional moment restriction (UMR):

E[ψ(Z ; θ0)>f (X )] = 0

for any measurable vector-valued function f : X → Rq. The functionf (X ) is often called an instrument.

I Given the instruments f1, . . . , fm, one can use the generalizedmethod of moment (GMM) to learn the parameter θ.

Page 113: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

45/53

Maximum Moment Restriction (MMR)KM, W. Jitkrittum, J. Kubler, UAI2020

Let F be a space of instruments f (x).

E[ψ(Z ; θ0) |X ] = 0︸ ︷︷ ︸CMR

⇔ supf∈F

∣∣E[ψ(Z ; θ0)>f (X )]∣∣ = 0︸ ︷︷ ︸

MMR(F ,θ0)

I The equivalence above holds if F is a universal vector-valued RKHS.

I Let µθ := KXψ(Z ; θ).

MMR(F , θ) := supf∈F ,‖f ‖≤1

∣∣E [ψ(Z ; θ)>f (X )]∣∣

= ‖E[KXψ(Z ; θ)]‖F= ‖µθ‖F .

I MMR2(F , θ) = E[ψ(Z ; θ)>K (X ,X ′)ψ(Z ′; θ)].

Page 114: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

45/53

Maximum Moment Restriction (MMR)KM, W. Jitkrittum, J. Kubler, UAI2020

Let F be a space of instruments f (x).

E[ψ(Z ; θ0) |X ] = 0︸ ︷︷ ︸CMR

⇔ supf∈F

∣∣E[ψ(Z ; θ0)>f (X )]∣∣ = 0︸ ︷︷ ︸

MMR(F ,θ0)

I The equivalence above holds if F is a universal vector-valued RKHS.

I Let µθ := KXψ(Z ; θ).

MMR(F , θ) := supf∈F ,‖f ‖≤1

∣∣E [ψ(Z ; θ)>f (X )]∣∣

= ‖E[KXψ(Z ; θ)]‖F= ‖µθ‖F .

I MMR2(F , θ) = E[ψ(Z ; θ)>K (X ,X ′)ψ(Z ′; θ)].

Page 115: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

45/53

Maximum Moment Restriction (MMR)KM, W. Jitkrittum, J. Kubler, UAI2020

Let F be a space of instruments f (x).

E[ψ(Z ; θ0) |X ] = 0︸ ︷︷ ︸CMR

⇔ supf∈F

∣∣E[ψ(Z ; θ0)>f (X )]∣∣ = 0︸ ︷︷ ︸

MMR(F ,θ0)

I The equivalence above holds if F is a universal vector-valued RKHS.

I Let µθ := KXψ(Z ; θ).

MMR(F , θ) := supf∈F ,‖f ‖≤1

∣∣E [ψ(Z ; θ)>f (X )]∣∣

= ‖E[KXψ(Z ; θ)]‖F= ‖µθ‖F .

I MMR2(F , θ) = E[ψ(Z ; θ)>K (X ,X ′)ψ(Z ′; θ)].

Page 116: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

45/53

Maximum Moment Restriction (MMR)KM, W. Jitkrittum, J. Kubler, UAI2020

Let F be a space of instruments f (x).

E[ψ(Z ; θ0) |X ] = 0︸ ︷︷ ︸CMR

⇔ supf∈F

∣∣E[ψ(Z ; θ0)>f (X )]∣∣ = 0︸ ︷︷ ︸

MMR(F ,θ0)

I The equivalence above holds if F is a universal vector-valued RKHS.

I Let µθ := KXψ(Z ; θ).

MMR(F , θ) := supf∈F ,‖f ‖≤1

∣∣E [ψ(Z ; θ)>f (X )]∣∣

= ‖E[KXψ(Z ; θ)]‖F= ‖µθ‖F .

I MMR2(F , θ) = E[ψ(Z ; θ)>K (X ,X ′)ψ(Z ′; θ)].

Page 117: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

46/53

Maximum Moment Restriction (MMR)KM, W. Jitkrittum, J. Kubler, UAI2020

Parameter Estimation

Given observations (xi , zi )ni=1 from P(X ,Z ), we aim to estimate θ0 by

θ = arg minθ∈Θ

MMR2(F , θ)

= arg minθ∈Θ

1

n(n − 1)

∑1≤i 6=j≤n

ψ(zi ; θ)>K (xi , xj)ψ(zj ; θ).

Hypothesis Testing

Given observations (xi , zi )ni=1 from P(X ,Z ) and the parameter estimate

θ, we aim to test

H0 : MMR2(F , θ) = 0, H1 : MMR

2(F , θ) 6= 0.

Page 118: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

46/53

Maximum Moment Restriction (MMR)KM, W. Jitkrittum, J. Kubler, UAI2020

Parameter Estimation

Given observations (xi , zi )ni=1 from P(X ,Z ), we aim to estimate θ0 by

θ = arg minθ∈Θ

MMR2(F , θ)

= arg minθ∈Θ

1

n(n − 1)

∑1≤i 6=j≤n

ψ(zi ; θ)>K (xi , xj)ψ(zj ; θ).

Hypothesis Testing

Given observations (xi , zi )ni=1 from P(X ,Z ) and the parameter estimate

θ, we aim to test

H0 : MMR2(F , θ) = 0, H1 : MMR

2(F , θ) 6= 0.

Page 119: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

47/53

Conditional Moment Embedding

PXθ2

θ1

θ0

E[ψ(Z ; θ)|X ]

µθ0

µθ1

µθ2

RKHS F

Figure: The conditional moments E[ψ(Z ; θ)|X ] for different parameters θ areuniquely (PX -almost surely) embedded into the RKHS.

Kernel Conditional Moment Test via Maximum MomentRestriction (UAI2020)Paper: https://arxiv.org/abs/2002.09225

Code: https://github.com/krikamol/kcm-test

Page 120: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

48/53

Future Direction

Contact: Website:[email protected] http://krikamol.org

Page 121: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

49/53

Covariance OperatorsI Let H,G be RKHSes on X ,Y with feature maps

φ(x) = k(x , ·), ϕ(y) = `(y , ·).

I Let CXX and CYX be the covariance operator on X andcross-covariance operator from X to Y , i.e.,

CXX =

∫φ(X )⊗ φ(X ) dP(X ),

CYX =

∫ϕ(Y )⊗ φ(X ) dP(Y ,X )

I Alternatively, CYX is a unique bounded operator satisfying

〈g , CYX f 〉G = Cov[g(Y ), f (X )].

I If EYX [g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX [g(Y )|X = ·] = CXY g .

Page 122: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

49/53

Covariance OperatorsI Let H,G be RKHSes on X ,Y with feature maps

φ(x) = k(x , ·), ϕ(y) = `(y , ·).

I Let CXX and CYX be the covariance operator on X andcross-covariance operator from X to Y , i.e.,

CXX =

∫φ(X )⊗ φ(X ) dP(X ),

CYX =

∫ϕ(Y )⊗ φ(X ) dP(Y ,X )

I Alternatively, CYX is a unique bounded operator satisfying

〈g , CYX f 〉G = Cov[g(Y ), f (X )].

I If EYX [g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX [g(Y )|X = ·] = CXY g .

Page 123: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

49/53

Covariance OperatorsI Let H,G be RKHSes on X ,Y with feature maps

φ(x) = k(x , ·), ϕ(y) = `(y , ·).

I Let CXX and CYX be the covariance operator on X andcross-covariance operator from X to Y , i.e.,

CXX =

∫φ(X )⊗ φ(X ) dP(X ),

CYX =

∫ϕ(Y )⊗ φ(X ) dP(Y ,X )

I Alternatively, CYX is a unique bounded operator satisfying

〈g , CYX f 〉G = Cov[g(Y ), f (X )].

I If EYX [g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX [g(Y )|X = ·] = CXY g .

Page 124: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

49/53

Covariance OperatorsI Let H,G be RKHSes on X ,Y with feature maps

φ(x) = k(x , ·), ϕ(y) = `(y , ·).

I Let CXX and CYX be the covariance operator on X andcross-covariance operator from X to Y , i.e.,

CXX =

∫φ(X )⊗ φ(X ) dP(X ),

CYX =

∫ϕ(Y )⊗ φ(X ) dP(Y ,X )

I Alternatively, CYX is a unique bounded operator satisfying

〈g , CYX f 〉G = Cov[g(Y ), f (X )].

I If EYX [g(Y )|X = ·] ∈ H for g ∈ G, then

CXXEYX [g(Y )|X = ·] = CXY g .

Page 125: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

50/53

Conditional Mean Estimation

I Given a joint sample (x1, y1), . . . , (xn, yn) from P(X ,Y ), we have

CXX =1

n

n∑i=1

φ(xi )⊗ φ(xi ), CYX =1

n

n∑i=1

ϕ(yi )⊗ φ(xi ).

I Then, µY |x for some x ∈ X can be estimated as

µY |x = CYX (CXX + εI)−1k(x , ·) = Φ(K + nεIn)−1kx =n∑

i=1

βiϕ(yi ),

where ε > 0 is a regularization parameter and

Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi , xj), kx = [k(x1, x), .., k(xn, x)].

I Under some technical assumptions, µY |x → µY |x as n→∞.

Page 126: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

50/53

Conditional Mean Estimation

I Given a joint sample (x1, y1), . . . , (xn, yn) from P(X ,Y ), we have

CXX =1

n

n∑i=1

φ(xi )⊗ φ(xi ), CYX =1

n

n∑i=1

ϕ(yi )⊗ φ(xi ).

I Then, µY |x for some x ∈ X can be estimated as

µY |x = CYX (CXX + εI)−1k(x , ·) = Φ(K + nεIn)−1kx =n∑

i=1

βiϕ(yi ),

where ε > 0 is a regularization parameter and

Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi , xj), kx = [k(x1, x), .., k(xn, x)].

I Under some technical assumptions, µY |x → µY |x as n→∞.

Page 127: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

50/53

Conditional Mean Estimation

I Given a joint sample (x1, y1), . . . , (xn, yn) from P(X ,Y ), we have

CXX =1

n

n∑i=1

φ(xi )⊗ φ(xi ), CYX =1

n

n∑i=1

ϕ(yi )⊗ φ(xi ).

I Then, µY |x for some x ∈ X can be estimated as

µY |x = CYX (CXX + εI)−1k(x , ·) = Φ(K + nεIn)−1kx =n∑

i=1

βiϕ(yi ),

where ε > 0 is a regularization parameter and

Φ = [ϕ(y1), .., ϕ(yn)], Kij = k(xi , xj), kx = [k(x1, x), .., k(xn, x)].

I Under some technical assumptions, µY |x → µY |x as n→∞.

Page 128: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

51/53

Kernel Sum Rule: P(X ) =∑

Y P(X ,Y )

I By the law of total expectation,

µX = EX [φ(X )] = EY [EX |Y [φ(X )|Y ]]

= EY [UX |Yϕ(Y )] = UX |YEY [ϕ(Y )]

= UX |YµY

I Let µY =∑m

i=1 αiϕ(yi ) and UX |Y = CXY C−1YY . Then,

µX = UX |Y µY = CXY C−1YY µY = Υ(L + nλI )−1Lα.

where α = (α1, . . . , αm)>, Lij = l(yi , yj), and Lij = l(yi , yj).

I That is, we have

µX =∑n

j=1 βjφ(xj)

with β = (L + nλI)−1Lα.

Page 129: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

51/53

Kernel Sum Rule: P(X ) =∑

Y P(X ,Y )

I By the law of total expectation,

µX = EX [φ(X )] = EY [EX |Y [φ(X )|Y ]]

= EY [UX |Yϕ(Y )] = UX |YEY [ϕ(Y )]

= UX |YµY

I Let µY =∑m

i=1 αiϕ(yi ) and UX |Y = CXY C−1YY . Then,

µX = UX |Y µY = CXY C−1YY µY = Υ(L + nλI )−1Lα.

where α = (α1, . . . , αm)>, Lij = l(yi , yj), and Lij = l(yi , yj).

I That is, we have

µX =∑n

j=1 βjφ(xj)

with β = (L + nλI)−1Lα.

Page 130: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

51/53

Kernel Sum Rule: P(X ) =∑

Y P(X ,Y )

I By the law of total expectation,

µX = EX [φ(X )] = EY [EX |Y [φ(X )|Y ]]

= EY [UX |Yϕ(Y )] = UX |YEY [ϕ(Y )]

= UX |YµY

I Let µY =∑m

i=1 αiϕ(yi ) and UX |Y = CXY C−1YY . Then,

µX = UX |Y µY = CXY C−1YY µY = Υ(L + nλI )−1Lα.

where α = (α1, . . . , αm)>, Lij = l(yi , yj), and Lij = l(yi , yj).

I That is, we have

µX =∑n

j=1 βjφ(xj)

with β = (L + nλI)−1Lα.

Page 131: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

52/53

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.5

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 132: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

52/53

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.5

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 133: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

52/53

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.5

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 134: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

52/53

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.5

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 135: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

52/53

Kernel Product Rule: P(X ,Y ) = P(Y |X )P(X )

I We can factorize µXY = EXY [φ(X )⊗ ϕ(Y )] as

EY [EX |Y [φ(X )|Y ]⊗ ϕ(Y )] = UX |YEY [ϕ(Y )⊗ ϕ(Y )]

EX [EY |X [ϕ(Y )|X ]⊗ φ(X )] = UY |XEX [φ(X )⊗ φ(X )]

I Let µ⊗X = EX [φ(X )⊗ φ(X )] and µ⊗Y = EY [ϕ(Y )⊗ ϕ(Y )].

I Then, the product rule becomes

µXY = UX |Yµ⊗Y = UY |Xµ⊗X .

I Alternatively, we may write the above formulation as

CXY = UX |Y CYY and CYX = UY |XCXX

I The kernel sum and product rules can be combined to obtain thekernel Bayes’ rule.5

5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013

Page 136: Recent Advances in Hilbert Space Representation of Probability …lcsl.mit.edu/courses/regml/regml2020/slides/talk2.pdf · 2020-07-01 · Reproducing Kernel Hilbert Spaces Let Hbe

53/53

Calibration of Computer SimulationKennedy and O’Hagan (2002); Kisamori et al., (AISTATS 2020)

Figure taken from Kisamori et al., (2020)

The computer simulator: r(x , θ), θ ∈ Θ.The posterior embedding: µΘ|r∗ :=

∫kΘ(·, θ) dPπ(θ|r∗)