determinantal point processesmachinelearning.math.rs/zecevic-dpp.pdf · 2019. 10. 23. · each...

54
Determinantal Point Processes Anđelka Zečević [email protected]

Upload: others

Post on 07-Aug-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Determinantal Point Processes

Anđelka Zečević[email protected]

Page 2: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Determinantal Point Processes (DPPs)Problem of interest: How can we select a subset of a given set that is of a good quality and diversity at the same time?

Sources:Alex Kulesza & Ben Taskar, 2013: Determinantal Point Processes for Machine Learning

Alexei Borodin, 2009:Determinantal Point Processes

2

Page 3: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Motivation - Information Retrieval

3

Page 4: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Motivation - Recommender Systems

4

Page 5: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Motivation - Extractive Text Summarization

5

Page 6: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Discrete Point Processes● Number of items (videos, sentences, …): N● Ground set: set of indexes

● Number of subsets: 2N

● Point process:

6

Page 7: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Discrete Point ProcessesIndependent Point Process

Each element i included with probability pi

7

Page 8: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP - marginal kernel

8

Let K = [Kij] be and N x N real, symmetric matrix with properties: - K is a positive semidefinite matrix xTKx ≥ 0 for all x∈RN

all principal minors of are non-negative all eigenvalues are non-negative

- all eigenvalues are bounded above by 1

Matrix K is so called marginal kernel.

Intuition: Kij represents similarity of the items i and j

Page 9: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP - marginal kernelIf A ⊆ than KA= [Kij]i,j∈A is the matrix with rows and columns indexed by the elements of the set A.

9

Page 10: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP - definitionPoint process is called determinantal point process if for a random subset drawn according to for every holds

Adaptation:

Notation: Y ~ DPP(K)

10

Page 11: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP - diversificationFor A = {i}:

For A = {i, j}:

DPPs have ability to diversify!

11

Page 12: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP - diversification

12

Page 13: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP - diversification

13

Page 14: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP propertiesRestriction

If Y is distributed as a DPP with marginal kernel K and then Y ∩ A is DPP with marginal kernel KA

Complement

If Y is distributed as DPP with marginal kernel K, - Y is distributed as a DPP with marginal kernel I-K.

ScalingIf K = ɣK′ for some 0≤ɣ<1, then for all we have det(KA) = ɣ|A|det(K′A)

14

Page 15: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

L-ensembleLet L = [Lij] be and N x N real, symmetric positive semidefinite matrix.

An L-ensemble is a point process that satisfies

Normalization constant:

For we have

Borodin & Rains, 2005

15

Page 16: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP and L-ensemble● An L-ensemble is a DPP with marginal kernel K given by

● Not all DPPs are L-ensembles! When any eigenvalue of K achieves the upper bound of 1, the DPP is not an L-ensemble!

16

Page 17: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP and L-ensembleIf

is an eigendecomposition of L then

is eigendecomposition of K.

17

Page 18: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Sampling Algorithm

18

Page 19: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Sampling Algorithm ComplexityAlgorithm complexity: O(Nk3) k = |V|

Gram-Schmidt orthonormalization: O(Nk2)

Bottleneck: eigendecomposition of L with complexity O(N3)

There exists exact and approximating DPP samplings with lower complexity.

Derezinski et. al, 2019

19

Page 20: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Sampling Algorithm - Visualization

20

Page 21: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

NP HardnessKo et. al (1995): Finding the set Y that maximizes is NP-hard.

is a log-submodular function and can be optimized in polynomial time.

for

21

Page 22: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Submodularity

22

Page 23: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Question of DiversityL can be decomposed as L = BTB for some D x N, D ≤ N Let Bi be the i-th column of B

23

Page 24: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Question of QualityColumns of B are vectors that represent items in the set .

24

Page 25: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Question of Quality

Similarity matrix S:

25

Page 26: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Question of Quality and Diversity

26

Page 27: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Decomposition for large setsDual representation:

C = BBT is DxD real symmetric positive-semidefinite matrix

27

Page 28: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Decomposition for large setsDual representation:

28

Page 29: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Decomposition for large setsDual representation:

29

Page 30: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Decomposition for large setsRandom projection:

Dimension D of diversity features can be large. Idea: Project diversity vectors to a space of a low dimension d.

There are theoretical guarantees thatrandom projection approximately preserves distances.

30

Page 31: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Training data

We assume that DPP kernel L(X; θ) is parametrized in terms of generic θ that reflect quality and/or diversity properties.

Learning

31

Page 32: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Learning A conditional DPP is a conditional probabilistic model which assigns a probability to every possible subset . . The model takes the form of an L-ensemble

when L(X) is a positive semidefinite matrix that depends on the input X

32

Page 33: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Goal: choose parameter θ to maximize the conditional log-likelihood of the training data (so called Maximum Likelihood Estimation, MLE)

with conditional probability of an output Y given input X under parameter θ

Learning

33

Page 34: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Learning the Parameters of DPP Kernels● It is conjecture that MLE is NP-hard to compute (Kulesza, 2012)

○ non-convex optimization○ non-convexity holds even under various simplified assumptions on the form of L○ approximating the mode of size k of a DPP to within a ck (c>1) factor is known to be NP-hard

● Special cases of quality or similarity functions: ○ Gaussian similarity with uniform quality○ Gaussian similarity with Gaussian quality○ Polynomial similarity with uniform quality

● Nelder-Mead simplex algorithm:does not require explicit knowledge of derivates of a log-likelihood function, there are no theoretical guarantees about convergence to a stationary point.

34

Page 35: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Learning the parameters of DPP KernelsHeuristics

○ Expectation-Maximization (Gillenwater et al., 2014)○ MCMC (Affandi et al., 2014)○ Fixed point algorithms (Mariet & Sra, 2015)

35

Page 36: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Extractive Document SummarizationGoal: Learn a DPP to model good summaries Y for a given input X

Kuleza & Taskar, 2012 36

Page 37: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Extractive Document SummarizationSimilarity featuressentences are represented as normalized tf-idf vectors

Similarity function similarity among sentences

37

Page 38: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Extractive Document SummarizationQuality scores

feature vector for the sentence i

parameter vector

Quality featureslength, position of the sentence in its original document, mean cluster similarity, personal pronouns, LexRank score, …

38

Page 39: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Extractive Document SummarizationQuality and similarity combined in the form of conditional DPP probability:

Holds is concave in θ

Gradient

39

Page 40: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Extractive Document Summarization Learning quality parameters by gradient descent

40

Page 41: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Extractive Document SummarizationInference: for a given X and learned parameter θ find Y with at most b charactersMaximum a posteriori:

Y is submodular and we can approximate it through simple greedy algorithm

41

Page 42: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Diversified Recommendation ● Content is presented in the form of a feed:

an ordered list of items through the user browses

● The goodness of the recommendation is measured via utility function

● Goal: select and order a set of k items such that the utilityof the set is maximized

Gillenwater et al, 201842

Page 43: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

k-DPPA k-DPP on a discrete set is distribution over all subsets with cardinality k.

Normalization constant is where ek is k-th

elementary symmetric polynomial

over eigenvalues of the matrix L.

43

Page 44: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

k-DPPFor N = 3 elementary symmetric polynomials are:

There is an effective recursive algorithm for elementary symmetric polynomials calculations.

44

Page 45: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

k-DPPSampling:

45

Page 46: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Diversified Recommendation DPP inputs: personalized quality scores and pointwise item distances

46

Page 47: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Diversified Recommendation Observed interaction of user i with the feed list of length N is given as a binary vector: yu= [0, 1, 0, 1, 1, …, 0]

Goal:maximize the total number of interactions

Slight modification in order to train models from records of previous interactions: maximize the cumulative gain by reranking the feed items

j is the new rank that the model assigns to an item.

47

Page 48: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Diversified Recommendation The interaction yu = [0, 1, 0, 1, 1, ….0] can be written as Y = {2, 4, 5}

Assumption: Y represents a drawn from the probability distribution defined by a user-specific DPP.

48

Page 49: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Diversified Recommendation qi - personalized quality scoreDij - Jaccard distance built on video descriptions

ɑ∈ [0, 1) and σ are learning parameters

Run grid search to find the values of the parameters that maximize the cumulative gain.

49

Page 50: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Diversified Recommendation f - parametrized quality function g - parametrized content re-embedding function ẟ - fixed regularization parameter w - the parameters of L

50

Page 51: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Diversified Recommendation Inference

greedy algorithm for submodular maximization for size-k window

Runs k iterations adding one video to Y on each iteration:

The greedy algorithm gives the natural order for videos in size-k window.

Than the algorithm is repeated for the unused N-k videos.

51

Page 52: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

DPP classes● Discrete DPP● k-DPP

● Continual DPP● Structured DPP

translation, bioinformatics, ...

● Sequential DPPvideo summarization, …

52

Page 53: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

CodeMatlab:

https://www.alexkulesza.com/code/dpp.tgz

Python:

https://github.com/guilgautier/DPPy

53

Page 54: Determinantal Point Processesmachinelearning.math.rs/Zecevic-DPP.pdf · 2019. 10. 23. · Each element i included with probability p i 7. DPP - marginal kernel 8 Let K = [K ij] be

Thank you!

54