Jose Gallego-Posada April 2021
Determinantal Point Processes
Brahms 7-8
sli.do -- #MilaDPP
Today's agenda
โขWhy DPPs?
โขDefinition and properties
โขSampling
โขApplications2
Bible for DPP in ML:
Foundations and Trends in Machine Learning
Determinantal Point Processes for Machine Learning
Alex Kulesza and Ben Taskar (2012) [link]
Presentation based on slides by :
โข Simon Barthelmรฉ, Nicolas Tremblay, EUSIPCO19 [link]
โข Alex Kulesza, Ben Taskar and Jennifer Gillenwater โ CVPR13 [link]
3
4
Guillaume Gautier, Rรฉmi Bardenet, Guillermo Polito, Michal Valko
https://github.com/jgalle29/dpp_slides
Variance reduction โ Mean estimation
6
IID Samples[BT19, dpp_demo]
Variance reduction โ Mean estimation
6
DPP Samples[BT19, dpp_demo]
6
Variance reduction โ Mean estimation
[BT19, dpp_demo]
Determinantal
Base set ๐ด= {1,โฆ , ๐} from which we sample a random subset ๐.
๐ is distributed according to a point process ๐ซ over 2๐ด.
๐ซ ๐ = ๐ depends on the determinant of a
matrix selected based on the elements of ๐.
Point Process7
Poisson Process
8
โข Simplest point processโฆ too simple!
โข Element memberships are parameterized by independent Bernoulli rvs.
โข Special case of a DPP with marginal kernel ๐ = ๐ท๐.
๐ซ ๐ = ๐ = เท
๐โ๐
๐๐ เท
๐โ๐
(1 โ ๐๐)
Desiderata:
i. Density is tractable; including normalization constant
ii. Inclusion probabilities (marginals) are tractable
iii. Sampling is tractable
iv. Model is easy to understand
Representing repulsion
Contrary to most Gibbs processes (normalized, exponentiated potentials),
DPPs tick all the boxes 9
GM
s vs D
PPs
Loopy, negative interactions are hard
(Inference becomes intractable; worst case)
Global, negative interactions are easy
10
[KTG13]
๐-ensembles
โข Model repulsion based on similarity between elements of ๐ด.
โข Similarity between elements ๐ and ๐ is stored in ๐๐๐.
โข We assume ๐ to be positive definite.
โข ๐ is known as the likelihood kernel.
We say that ๐ is distributed according to a DPP if:
๐ซ ๐ = ๐ โ det ๐๐11
12
Where did the repulsion go?
๐ซ ๐ = ๐ โ det ๐๐ = det 2 ๐ ๐
๐ 1
๐ 2
๐๐ = [๐ ๐๐ ]๐
๐{1,2,4} =
Embedding of ๐ด
13
Where did the repulsion go?
๐ซ ๐, ๐ โ ๐ซ ๐ ๐ซ ๐ โ๐๐๐
det(๐ + ๐)
2
Vol ๐ ๐ = det ๐
[KTG13]
๐ 1
๐ 2
14
Where did the repulsion go?
๐๐
Probability under a DPP grows with the spanned volume[BT19, dpp_demo]
15
Normalization
๐ดโ๐โ๐ด
det ๐๐ = det ๐ + ๐ าง๐ด
Analytic normalization
constant!
16
Exploit linear-algebraic properties to make
inference/sampling easy(or feasible in high-dims)
17
Marginal kernels
โข Consider a DPP with L-ensemble ๐.
โข The inclusion (marginal) probability that ๐ contains a set ๐ is given by:
with ๐ = ๐ ๐ + ๐ โ1.
โข ๐ is known as the marginal kernel of the DPP.
โข ๐ซ ๐ โ ๐ = ๐๐๐.
โข ๐ผ ๐ = ๐ผ ฯ๐ ๐๐โ๐ = ฯ๐๐ซ ๐ โ ๐ = tr ๐.
๐ซ ๐ โ ๐ =1
๐
๐โ๐
det ๐๐ = det ๐๐
18
Conditioning
๐ซ ๐ต โ ๐ |๐ด โ ๐ =๐ซ ๐ด โช ๐ต โ ๐
๐ซ ๐ด โ ๐=det ๐๐ดโช๐ตdet ๐๐ด
= det ๐๐ต โ ๐๐ต๐ด๐๐ดโ1๐๐ด๐ต
๐๐ดโช๐ต =๐๐ต
๐๐ด
๐๐ต๐ด
๐๐ด๐ต
det ๐๐ดโช๐ต = det ๐๐ด det ๐๐ต โ ๐๐ต๐ด๐๐ดโ1๐๐ด๐ต
Schur complement
DPPs are closed under conditioning!
19
Complexity?
โข Evaluation of ๐ - ๐ช ๐2
โข Normalization constant - ๐ช ๐3 [determinant]
โข Marginal probabilities - ๐ช ๐3 [matrix inversion]
โข Conditional probabilities - ๐ช ๐3 [Schur complement]
Questions?
Brahms 7-8
20
Extensions
Conditional
๐-
StructuredDPPs
Non-symmetric
21
๐-DPPs
โข In practical applications, often preferred to limit cardinality of the set
โข Search results
โข Minibatch selection
โข Summarization
โข Normalization constant ฯ ๐ =๐ det ๐๐ = ๐๐ ๐1, โฆ , ๐๐ [๐-th elementary sym. polynomial]
โข Special case: 1-DPP
โข Need not have a corresponding marginal kernel
๐ซ ๐ = ๐ โ det ๐๐ ๐ ๐ =๐
22
Elementary ๐-DPPs
โข Special case: ๐-DPP with ๐ = rank ๐ and ๐ = ๐๐ฌ๐๐, has marginal kernel ๐ = ๐๐๐
โข A DPP is called elementary if the spectrum of its marginal kernel is 0, 1 .
โข We denote this process as ๐ซ๐.
โข If ๐ โผ ๐ซ๐, then ๐ = ๐ with probability one. ( ๐ is a sum of Bernoulli rvs.)
โข ๐ is a projection matrix โ also called projection DPPs
๐๐ = ฯ๐ฟโ๐๐ฟ๐ฟ๐
23
Hierarchy of DPPs
Strongly Rayleigh
๐-DPPs
DPPs
๐-ensembles
๐-DPPs
24
Cauchy-Binet Lemma
๐น๐ =ฯ๐ โ ๐๐
ฯ โ ๐
JPM Binet
โข Consider matrices ๐ด of size ๐ ร ๐ and ๐ต of size ๐ ร ๐
โข For each ๐-subset ๐ โ [1,โฆ , ๐], construct square matrices ๐ด:๐ and ๐ต๐:
det ๐ด๐ต = ฯ ๐ =๐ det ๐ด:๐ det ๐ต๐:
[Proof]
25
DPPs as mixture models
๐ซ ๐ = ๐ โ det ๐๐ = det ๐๐ฌ๐ ๐
= det ๐๐: ๐ฌ ๐ฌ ๐:๐
=
๐ = ๐
det ๐๐๐ ๐ฌ๐๐ det ๐ฌ๐๐๐๐๐
=
๐ = ๐
det ๐๐๐๐๐๐๐ det ๐ฌ๐๐
Elementary
DPP
Diagonal
๐-ensemble
26
Sampling
๐ซ โ
๐ฝโ๐ด
๐ซ๐๐ฝ เท
๐โ๐ฝ
๐๐ =
๐ฝโ๐ด
๐ซ๐๐ฝ det ๐ฝ๐ฝ
โข Consider a DPP with L-ensemble ๐ = ฯ๐ ๐๐๐ฟ๐๐ฟ๐๐ .
โข For each subset ๐ฝ โ ๐ด, let ๐๐ฝ denote the set ๐ฟ๐ ๐โ๐ฝ and the elementary DPP ๐ซ๐๐ฝ .
Factorize the original DPP as a
mixture of elementary DPPs
27
Sampling via spectral decomposition
by sequential exploiting closure
of DPPs under conditioningPr ๐ฝ โเท
๐โ๐ฝ
๐๐
STAGE ONE STAGE TWO
Draw a sample from ๐ซ๐ฝChoose elementary DPP ๐ซ๐ฝ
based on mixture weight
[KT12 โ p.145]
28
Sampling in action
[KT12]
29
Advanced sampling
โข Spectral method for sampling has cost ๐ช ๐2 + ๐3 + ๐๐2
โข Dual sampling: instead of using ๐ = ๐ ๐๐ with ๐ ๐ ร ๐ use โญ = ๐ ๐ ๐ [KT12ยง3.3]
โข Random projections
โข Nystrรถm approximations: Low rank approximation [Li, Jegelka, Sra 16a]
โข MCMC sampling [LJS16b]
โข Add, remove, swap
โข Prove fast mixing for chains in terms of total variation
โข Distortion-free intermediate sampling [Derezinski 18; CDV20]
โข Suitably construct an intermediate subset ๐ and then subsample from it
30
Learning DPPs
โข Basic setting: Maximum Likelihood
โข Given ๐๐ก ๐ก=1๐ subsets of ๐ด. Parameterize ๐-ensemble as ๐ ๐
argmax๐
logเท
๐ก
๐ซ๐ ๐๐ก =
๐ก
log det ๐๐๐ก(๐) โ log det(๐ ๐ + ๐)
โข Can use gradient-based methods for optimizing ๐
โข Can be extended to conditioning on a covariate ๐: ๐ ๐, ๐
โข For each ๐ we have a DPP
โข ๐ may be a query during search on which we want to condition the distribution over results
โข See [KT12ยง4] for more details
31
Applications
Image search
{Relevance vs Diversity}Extractive summarization
[KT12]
32
Applications
โข (Quasi) Monte-Carlo integration (Gautier et al., On two ways to use DPPs for Monte Carlo integration, 2019)
โข Mini-batch sampling for SGD (Zhang et al., DPPs for Mini-Batch Diversification, 2017)
โข Coresets (Tremblay et al., DPPs for Coresets, 2018)
33
DPPs in Randomized LinAlg
๐คโ = argmin๐ค
๐ฟ๐ค โ ๐ฆ 2 = ๐ฟโ ๐ฆ
โข Consider a linear regression problem with a tall, full-rank matrix ๐ฟ โ โ๐ร๐ with ๐ โซ ๐
โข Sketching: approximating matrix เทฉ๐ฟ (subset of rows, low-rank)
โข Usual bounds have (ํ,๐ฟ)-PAC flavour
โข If ๐ โผ ๐-DPP(๐ฟ๐ฟ๐), then ๐ผ[๐ฟ๐:โ1๐ฆ] = ๐คโ [leverage scores]
โข If ๐ โผDPP1
๐๐ฟ๐ฟ๐ , then ๐ผ[๐ฟ๐:
โ ๐ฆ] = argmin๐ค
๐ฟ๐ค โ ๐ฆ 2 + ๐ ๐ค 2 [ridge l.s.]
[DM20]
34
Minibatch sampling for LinReg
โข Previously we related sampling with properties of analytic solution
โข What is the influence of non-iid sampling during stochastic optimization?
โข Previous work by [Zhang, Kjellstrรถm, Mandt 17] for variance reduction
โข Toy example: linear model
โข Gradients are โconstantโ and correspond to points
โข Redundant points lead to redundant sampled gradients
โข Sample minibatches ๐ โผ ๐-DPP ๐ฟ๐ฟ๐ and run SGD with momentum
35
Minibatch sampling for LinReg
๐ -DPP
IID
Train Test
๐ = 1 ร 10โ1 ๐ = 2.5 ร 10โ1 ๐ = 3.5 ร 10โ1 ๐ = 4 ร 10โ1
[optim_demo]
36
Overparameterized regime
๐-DPP
IID
Train
[optim_demo]
37
Determinantal Point Processes
are elegant, efficient and useful
models of repulsion