determinantal point processesย ยท โ€ข special case: -dpp with =rank and =๐‘‰๐›ฌ๐‘‰ , has marginal...

Post on 07-Aug-2021

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Jose Gallego-Posada April 2021

Determinantal Point Processes

Brahms 7-8

sli.do -- #MilaDPP

Today's agenda

โ€ขWhy DPPs?

โ€ขDefinition and properties

โ€ขSampling

โ€ขApplications2

Bible for DPP in ML:

Foundations and Trends in Machine Learning

Determinantal Point Processes for Machine Learning

Alex Kulesza and Ben Taskar (2012) [link]

Presentation based on slides by :

โ€ข Simon Barthelmรฉ, Nicolas Tremblay, EUSIPCO19 [link]

โ€ข Alex Kulesza, Ben Taskar and Jennifer Gillenwater โ€“ CVPR13 [link]

3

4

Guillaume Gautier, Rรฉmi Bardenet, Guillermo Polito, Michal Valko

https://github.com/jgalle29/dpp_slides

Variance reduction โ€“ Mean estimation

6

IID Samples[BT19, dpp_demo]

Variance reduction โ€“ Mean estimation

6

DPP Samples[BT19, dpp_demo]

6

Variance reduction โ€“ Mean estimation

[BT19, dpp_demo]

Determinantal

Base set ๐’ด= {1,โ€ฆ , ๐‘›} from which we sample a random subset ๐’€.

๐’€ is distributed according to a point process ๐’ซ over 2๐’ด.

๐’ซ ๐’€ = ๐‘Œ depends on the determinant of a

matrix selected based on the elements of ๐‘Œ.

Point Process7

Poisson Process

8

โ€ข Simplest point processโ€ฆ too simple!

โ€ข Element memberships are parameterized by independent Bernoulli rvs.

โ€ข Special case of a DPP with marginal kernel ๐”Ž = ๐ท๐’‘.

๐’ซ ๐’€ = ๐‘Œ = เท‘

๐‘–โˆˆ๐‘Œ

๐‘๐‘– เท‘

๐‘–โˆ‰๐‘Œ

(1 โˆ’ ๐‘๐‘–)

Desiderata:

i. Density is tractable; including normalization constant

ii. Inclusion probabilities (marginals) are tractable

iii. Sampling is tractable

iv. Model is easy to understand

Representing repulsion

Contrary to most Gibbs processes (normalized, exponentiated potentials),

DPPs tick all the boxes 9

GM

s vs D

PPs

Loopy, negative interactions are hard

(Inference becomes intractable; worst case)

Global, negative interactions are easy

10

[KTG13]

๐”-ensembles

โ€ข Model repulsion based on similarity between elements of ๐’ด.

โ€ข Similarity between elements ๐‘– and ๐‘— is stored in ๐”๐‘–๐‘—.

โ€ข We assume ๐” to be positive definite.

โ€ข ๐” is known as the likelihood kernel.

We say that ๐’€ is distributed according to a DPP if:

๐’ซ ๐’€ = ๐‘Œ โˆ det ๐”๐‘Œ11

12

Where did the repulsion go?

๐’ซ ๐’€ = ๐‘Œ โˆ det ๐”๐‘Œ = det 2 ๐”…๐‘Œ

๐”…1

๐”…2

๐”๐‘Œ = [๐”…๐‘‡๐”…]๐‘Œ

๐”{1,2,4} =

Embedding of ๐’ด

13

Where did the repulsion go?

๐’ซ ๐‘–, ๐‘— โˆ ๐’ซ ๐‘– ๐’ซ ๐‘— โˆ’๐”๐‘–๐‘—

det(๐” + ๐•€)

2

Vol ๐”…๐‘– = det ๐”…

[KTG13]

๐”…1

๐”…2

14

Where did the repulsion go?

๐”๐‘Œ

Probability under a DPP grows with the spanned volume[BT19, dpp_demo]

15

Normalization

๐ดโŠ‚๐‘ŒโŠ‚๐’ด

det ๐”๐‘Œ = det ๐” + ๐•€ าง๐ด

Analytic normalization

constant!

16

Exploit linear-algebraic properties to make

inference/sampling easy(or feasible in high-dims)

17

Marginal kernels

โ€ข Consider a DPP with L-ensemble ๐”.

โ€ข The inclusion (marginal) probability that ๐’€ contains a set ๐‘† is given by:

with ๐”Ž = ๐” ๐” + ๐•€ โˆ’1.

โ€ข ๐”Ž is known as the marginal kernel of the DPP.

โ€ข ๐’ซ ๐‘– โˆˆ ๐’€ = ๐”Ž๐‘–๐‘–.

โ€ข ๐”ผ ๐’€ = ๐”ผ ฯƒ๐‘– ๐Ÿ™๐‘–โˆˆ๐’€ = ฯƒ๐‘–๐’ซ ๐‘– โˆˆ ๐’€ = tr ๐”Ž.

๐’ซ ๐‘† โŠ‚ ๐’€ =1

๐”–

๐‘†โŠ‚๐‘Œ

det ๐”๐‘Œ = det ๐”Ž๐‘†

18

Conditioning

๐’ซ ๐ต โŠ‚ ๐’€ |๐ด โŠ‚ ๐’€ =๐’ซ ๐ด โˆช ๐ต โŠ‚ ๐’€

๐’ซ ๐ด โŠ‚ ๐’€=det ๐”Ž๐ดโˆช๐ตdet ๐”Ž๐ด

= det ๐”Ž๐ต โˆ’ ๐”Ž๐ต๐ด๐”Ž๐ดโˆ’1๐”Ž๐ด๐ต

๐”Ž๐ดโˆช๐ต =๐”Ž๐ต

๐”Ž๐ด

๐”Ž๐ต๐ด

๐”Ž๐ด๐ต

det ๐”Ž๐ดโˆช๐ต = det ๐”Ž๐ด det ๐”Ž๐ต โˆ’ ๐”Ž๐ต๐ด๐”Ž๐ดโˆ’1๐”Ž๐ด๐ต

Schur complement

DPPs are closed under conditioning!

19

Complexity?

โ€ข Evaluation of ๐” - ๐’ช ๐‘›2

โ€ข Normalization constant - ๐’ช ๐‘›3 [determinant]

โ€ข Marginal probabilities - ๐’ช ๐‘›3 [matrix inversion]

โ€ข Conditional probabilities - ๐’ช ๐‘›3 [Schur complement]

Questions?

Brahms 7-8

20

Extensions

Conditional

๐‘˜-

StructuredDPPs

Non-symmetric

21

๐‘˜-DPPs

โ€ข In practical applications, often preferred to limit cardinality of the set

โ€ข Search results

โ€ข Minibatch selection

โ€ข Summarization

โ€ข Normalization constant ฯƒ ๐’€ =๐‘˜ det ๐”๐‘Œ = ๐‘’๐‘˜ ๐œ†1, โ€ฆ , ๐œ†๐‘ [๐‘˜-th elementary sym. polynomial]

โ€ข Special case: 1-DPP

โ€ข Need not have a corresponding marginal kernel

๐’ซ ๐’€ = ๐‘Œ โˆ det ๐”๐‘Œ ๐Ÿ ๐’€ =๐‘˜

22

Elementary ๐œ‹-DPPs

โ€ข Special case: ๐‘˜-DPP with ๐‘˜ = rank ๐” and ๐” = ๐‘‰๐›ฌ๐‘‰๐‘‡, has marginal kernel ๐”Ž = ๐‘‰๐‘‰๐‘‡

โ€ข A DPP is called elementary if the spectrum of its marginal kernel is 0, 1 .

โ€ข We denote this process as ๐’ซ๐‘‰.

โ€ข If ๐’€ โˆผ ๐’ซ๐‘‰, then ๐’€ = ๐‘‰ with probability one. ( ๐’€ is a sum of Bernoulli rvs.)

โ€ข ๐”Ž is a projection matrix โ€“ also called projection DPPs

๐”Ž๐‘‰ = ฯƒ๐“ฟโˆˆ๐‘‰๐“ฟ๐“ฟ๐‘‡

23

Hierarchy of DPPs

Strongly Rayleigh

๐‘˜-DPPs

DPPs

๐”-ensembles

๐œ‹-DPPs

24

Cauchy-Binet Lemma

๐น๐‘› =ฯ†๐‘› โˆ’ ๐œ“๐‘›

ฯ† โˆ’ ๐œ“

JPM Binet

โ€ข Consider matrices ๐ด of size ๐‘Ÿ ร— ๐‘  and ๐ต of size ๐‘  ร— ๐‘Ÿ

โ€ข For each ๐‘Ÿ-subset ๐‘Œ โŠ‚ [1,โ€ฆ , ๐‘Ÿ], construct square matrices ๐ด:๐‘Œ and ๐ต๐‘Œ:

det ๐ด๐ต = ฯƒ ๐’€ =๐‘Ÿ det ๐ด:๐‘Œ det ๐ต๐‘Œ:

[Proof]

25

DPPs as mixture models

๐’ซ ๐’€ = ๐‘Œ โˆ det ๐”๐‘Œ = det ๐‘‰๐›ฌ๐‘‰ ๐‘Œ

= det ๐‘‰๐‘Œ: ๐›ฌ ๐›ฌ ๐‘‰:๐‘Œ

=

๐‘ = ๐‘Œ

det ๐‘‰๐‘Œ๐‘ ๐›ฌ๐‘๐‘ det ๐›ฌ๐‘๐‘๐‘‰๐‘๐‘Œ

=

๐‘ = ๐‘Œ

det ๐‘‰๐‘Œ๐‘๐‘‰๐‘Œ๐‘๐‘‡ det ๐›ฌ๐‘๐‘

Elementary

DPP

Diagonal

๐”-ensemble

26

Sampling

๐’ซ โˆ

๐ฝโŠ‚๐’ด

๐’ซ๐‘‰๐ฝ เท‘

๐‘›โˆˆ๐ฝ

๐œ†๐‘› =

๐ฝโŠ‚๐’ด

๐’ซ๐‘‰๐ฝ det ๐‘ฝ๐ฝ

โ€ข Consider a DPP with L-ensemble ๐” = ฯƒ๐‘› ๐œ†๐‘›๐“ฟ๐‘›๐“ฟ๐‘›๐‘‡ .

โ€ข For each subset ๐ฝ โŠ‚ ๐’ด, let ๐‘‰๐ฝ denote the set ๐“ฟ๐‘› ๐‘›โˆˆ๐ฝ and the elementary DPP ๐’ซ๐‘‰๐ฝ .

Factorize the original DPP as a

mixture of elementary DPPs

27

Sampling via spectral decomposition

by sequential exploiting closure

of DPPs under conditioningPr ๐ฝ โˆเท‘

๐‘›โˆˆ๐ฝ

๐œ†๐‘›

STAGE ONE STAGE TWO

Draw a sample from ๐’ซ๐ฝChoose elementary DPP ๐’ซ๐ฝ

based on mixture weight

[KT12 โ€“ p.145]

28

Sampling in action

[KT12]

29

Advanced sampling

โ€ข Spectral method for sampling has cost ๐’ช ๐‘›2 + ๐‘›3 + ๐‘›๐‘˜2

โ€ข Dual sampling: instead of using ๐” = ๐”…๐‘‡๐”… with ๐”… ๐‘‘ ร— ๐‘› use โ„ญ = ๐”…๐”…๐‘‡ [KT12ยง3.3]

โ€ข Random projections

โ€ข Nystrรถm approximations: Low rank approximation [Li, Jegelka, Sra 16a]

โ€ข MCMC sampling [LJS16b]

โ€ข Add, remove, swap

โ€ข Prove fast mixing for chains in terms of total variation

โ€ข Distortion-free intermediate sampling [Derezinski 18; CDV20]

โ€ข Suitably construct an intermediate subset ๐œŽ and then subsample from it

30

Learning DPPs

โ€ข Basic setting: Maximum Likelihood

โ€ข Given ๐‘Œ๐‘ก ๐‘ก=1๐‘‡ subsets of ๐’ด. Parameterize ๐”-ensemble as ๐” ๐œƒ

argmax๐œƒ

logเท‘

๐‘ก

๐’ซ๐œƒ ๐‘Œ๐‘ก =

๐‘ก

log det ๐”๐‘Œ๐‘ก(๐œƒ) โˆ’ log det(๐” ๐œƒ + ๐•€)

โ€ข Can use gradient-based methods for optimizing ๐œƒ

โ€ข Can be extended to conditioning on a covariate ๐‘‹: ๐” ๐œƒ, ๐‘‹

โ€ข For each ๐‘‹ we have a DPP

โ€ข ๐‘‹ may be a query during search on which we want to condition the distribution over results

โ€ข See [KT12ยง4] for more details

31

Applications

Image search

{Relevance vs Diversity}Extractive summarization

[KT12]

32

Applications

โ€ข (Quasi) Monte-Carlo integration (Gautier et al., On two ways to use DPPs for Monte Carlo integration, 2019)

โ€ข Mini-batch sampling for SGD (Zhang et al., DPPs for Mini-Batch Diversification, 2017)

โ€ข Coresets (Tremblay et al., DPPs for Coresets, 2018)

33

DPPs in Randomized LinAlg

๐‘คโˆ— = argmin๐‘ค

๐‘ฟ๐‘ค โˆ’ ๐‘ฆ 2 = ๐‘ฟโ€ ๐‘ฆ

โ€ข Consider a linear regression problem with a tall, full-rank matrix ๐‘ฟ โˆˆ โ„๐‘›ร—๐‘‘ with ๐‘› โ‰ซ ๐‘‘

โ€ข Sketching: approximating matrix เทฉ๐‘ฟ (subset of rows, low-rank)

โ€ข Usual bounds have (ํœ€,๐›ฟ)-PAC flavour

โ€ข If ๐‘† โˆผ ๐‘‘-DPP(๐‘ฟ๐‘ฟ๐‘‡), then ๐”ผ[๐‘ฟ๐‘†:โˆ’1๐‘ฆ] = ๐‘คโˆ— [leverage scores]

โ€ข If ๐‘† โˆผDPP1

๐œ†๐‘ฟ๐‘ฟ๐‘‡ , then ๐”ผ[๐‘ฟ๐‘†:

โ€  ๐‘ฆ] = argmin๐‘ค

๐‘ฟ๐‘ค โˆ’ ๐‘ฆ 2 + ๐œ† ๐‘ค 2 [ridge l.s.]

[DM20]

34

Minibatch sampling for LinReg

โ€ข Previously we related sampling with properties of analytic solution

โ€ข What is the influence of non-iid sampling during stochastic optimization?

โ€ข Previous work by [Zhang, Kjellstrรถm, Mandt 17] for variance reduction

โ€ข Toy example: linear model

โ€ข Gradients are โ€˜constantโ€™ and correspond to points

โ€ข Redundant points lead to redundant sampled gradients

โ€ข Sample minibatches ๐‘† โˆผ ๐‘‘-DPP ๐‘ฟ๐‘ฟ๐‘‡ and run SGD with momentum

35

Minibatch sampling for LinReg

๐’…-DPP

IID

Train Test

๐œ‚ = 1 ร— 10โˆ’1 ๐œ‚ = 2.5 ร— 10โˆ’1 ๐œ‚ = 3.5 ร— 10โˆ’1 ๐œ‚ = 4 ร— 10โˆ’1

[optim_demo]

36

Overparameterized regime

๐’Œ-DPP

IID

Train

[optim_demo]

37

Determinantal Point Processes

are elegant, efficient and useful

models of repulsion

top related