determinantal point processesapparikh/nips2014ml-nlp/slides/dpps-nips2014.pdfdeterminantal point...

DETERMINANTAL POINT PROCESSESFOR

NATURAL LANGUAGE PROCESSING

Jennifer GillenwaterJoint work with Alex Kulesza and Ben Taskar

OUTLINE

OUTLINE

Motivation & background on DPPs

OUTLINE


Large-scale settings

OUTLINE



Structured summarization

OUTLINE



Structured summarization

Other potential NLP applications

MOTIVATION &BACKGROUND

SUMMARIZATION

SUMMARIZATION

...

SUMMARIZATION

...

Quality: relevance to

the topic

SUMMARIZATION

...

Quality: relevance to

the topic

Diversity: coverage of core ideas

SUBSET SELECTION

AREA AS SET-GOODNESS

feature space


feature space

Bi

Bj


feature spacequality =

pB>

i Bi

similarity = B>i Bj

Bi

Bj


feature spacequality =

pB>

i Bi

similarity = B>i Bj

Bi +Bj

Bi

Bj


area

=q kB

ik2 2kB

jk2 2�(B

>iBj)2

VOLUME AS SET-GOODNESS

area =q

kBik22kBjk22 � (B>i Bj)2


area =q


length = kBik2


area =q


length = kBik2 volume = base ⇥ height


area =q


= ||B1||2vol(proj?B1(B2:N ))

vol(B) = height⇥ base

length = kBik2 volume = base ⇥ height

AREA AS A DETarea =

qkBik22kBjk22 � (B>

i Bj)2

AREA AS A DETarea =


i Bj)2

||Bi||22

B>i Bj

B>i Bj( )= det

||Bj ||22

12

AREA AS A DETarea =


i Bj)2

||Bi||22

B>i Bj

B>i Bj( )= det

||Bj ||22

12

= det( )Bi

Bj Bi

Bj

12

VOLUME AS A DET

= det( )Bi

Bj Bi

Bj

12

vol(B{i,j})

VOLUME AS A DET

= det( )Bi

Bj Bi

Bj

12

vol(B{i,j})

vol(B) = det

12B1

BN

... B1

BN. . .( )

vol(B)

2= det(B>B) = det(L)

COMPLEX STATISTICS

COMPLEX STATISTICSP

COMPLEX STATISTICS

N items =) 2N sets

EFFICIENT COMPUTATION

det12

EFFICIENT COMPUTATION

EFFICIENT COMPUTATION2

det


P

det


det

O(N3)

POINT PROCESSESY = {1, . . . , N}


( )P


( )P = 0.2

DETERMINANTAL

DETERMINANTALP({2, 3, 5}) /

DETERMINANTAL

L11 L12 L13 L14 L15

L21 L22 L23 L24 L25

L35L34L33L32L31

L41 L42 L43 L44 L45

L55L54L53L52L51

P({2, 3, 5}) /

DETERMINANTALL22 L23 L25

L35L33L32

L55L53L52

P({2, 3, 5}) /


L35L33L32

L55L53L52

det( )P({2, 3, 5}) /


L35L33L32

L55L53L52

det( )P({2, 3, 5}) =


L35L33L32

L55L53L52

det( )P({2, 3, 5}) =

det(L+ I)

EFFICIENT INFERENCE

EFFICIENT INFERENCE

PL(Y = Y )Normalizing:

EFFICIENT INFERENCE

PL(Y = Y )

P(Y ✓ Y)

Normalizing:

Marginalizing:

EFFICIENT INFERENCE

PL(Y = Y )

PL(Y = B | A ✓ Y)

PL(Y = B | A \Y = ;)

P(Y ✓ Y)

Normalizing:

Marginalizing:

Conditioning:

EFFICIENT INFERENCE

PL(Y = Y )

PL(Y = B | A ✓ Y)

PL(Y = B | A \Y = ;)

Y ⇠ PL

P(Y ✓ Y)

Normalizing:

Marginalizing:

Conditioning:

Sampling:

EFFICIENT INFERENCE

PL(Y = Y )

PL(Y = B | A ✓ Y)

PL(Y = B | A \Y = ;)

Y ⇠ PL

P(Y ✓ Y)

O(N3)

Normalizing:

Marginalizing:

Conditioning:

Sampling:

LARGE-SCALE SETTINGS

DUAL KERNELKULESZA AND TASKAR (NIPS 2010)

DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

LKULESZA AND TASKAR (NIPS 2010)

DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

L

N ⇥N

=

KULESZA AND TASKAR (NIPS 2010)

DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

C

N ⇥N

=


DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

C

=


DUAL KERNEL

B1

BN

...

B2

B3

B1

BN. . .B2

B3

= D ⇥D

CKULESZA AND TASKAR (NIPS 2010)

DUAL INFERENCE

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >V = B>V̂ ⇤� 12

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >V = B>V̂ ⇤� 12

O(D3)P

Y det(LY )Normalizing

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >V = B>V̂ ⇤� 12

O(D3)P


O(D3 +D2k2)Marginalizing & Conditioning

DUAL INFERENCE

L = V ⇤V > C = V̂ ⇤V̂ >V = B>V̂ ⇤� 12

O(ND2k)Y ⇠ PLSampling

O(D3)P


O(D3 +D2k2)Marginalizing & Conditioning

EXPONENTIAL N

EXPONENTIAL N

N = O({sentence length}{sentence length})

We want to select a diverse set of parses.

EXPONENTIAL N



N = O({node degree}{path length})

i =

STRUCTURE FACTORIZATIONKULESZA AND TASKAR (NIPS 2010)

i =

STRUCTURE FACTORIZATION

Bi = q(i)�(i)


quality similarity

i =


Bi = q(i)�(i)

i = {i↵}↵2F↵

c = 1


quality similarity

i =


Bi = q(i)�(i)

i = {i↵}↵2F↵

c = 2


quality similarity

i =


i = {i↵}↵2F

Bi =

Q↵2F

q(i↵)

��(i)

↵

c = 2


i =


i = {i↵}↵2F

Bi =

Q↵2F

q(i↵)

� P↵2F

�(i↵)

�

↵

c = 2


i =


i = {i↵}↵2F

Bi =

Q↵2F

q(i↵)

� P↵2F

�(i↵)

�

↵

c = 2

O(ND2k)Y ⇠ PL



Bi =

Q↵2F

q(i↵)

� P↵2F

�(i↵)

�

M = R =

↵

c = 2

Y ⇠ PL O(D2k3 +Dk2M cR)



Bi =

Q↵2F

q(i↵)

� P↵2F

�(i↵)

�

M = R =

↵

c = 2

Y ⇠ PL O(D2k3 +Dk2M cR)


M cR = 42 ⇤ 12 = 192 ⌧ N = 412 = 16,777,216

LARGE FEATURE SETS?

Large Exponential

Small dualdual +

structure

Large ? ?

LARGE FEATURE SETS?N = # of items

D=

#offeatures

RANDOM PROJECTIONSGILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

RANDOM PROJECTIONS

N

D

�

GILLENWATER, KULESZA, AND TASKAR (EMNLP 2012)

RANDOM PROJECTIONS

D

�


M cR

RANDOM PROJECTIONS

D d

D⇥�


M cR

RANDOM PROJECTIONS

D d

D⇥

d

=�


M cR M cR

RANDOM PROJECTIONSd

D


VOLUME PRESERVATION

VOLUME PRESERVATIONJOHNSON AND LINDENSTRAUSS (1984)

VOLUME PRESERVATIONJOHNSON AND LINDENSTRAUSS (1984)

logN

VOLUME PRESERVATIONMAGEN AND ZOUZIAS (2008)

logN


DPP PRESERVATIONvol

2= det


DPP PRESERVATION

⇡

k = 1

vol

2= det


DPP PRESERVATION

⇡

k = 1

⇡

k = 2

vol

2= det


DPP PRESERVATION

⇡

k = 1

⇡

k = 2

⇡

k = 3

vol

2= det


DPP PRESERVATION


DPP PRESERVATION

d = O⇣

max

n

k✏ ,

log(1/�)+log(N)

✏2 + ko⌘


DPP PRESERVATION

d = O⇣

max

n

k✏ ,

log(1/�)+log(N)

✏2 + ko⌘

total # of itemssubset size


DPP PRESERVATION

d = O⇣

max

n

k✏ ,

log(1/�)+log(N)

✏2 + ko⌘

w.p. 1� � : kPk � P̃kk1 e6k✏ � 1



DPP PRESERVATION

d = O⇣

max

n

k✏ ,

log(1/�)+log(N)

✏2 + ko⌘

0 50 100 1500

0.2

0.4

0.6

0.8

1

1.2

L1 v

aria

tiona

l dis

tanc

e

Projection dimension

0

1

2

3

4x 108

Mem

ory

use

(byt

es)

w.p. 1� � : kPk � P̃kk1 e6k✏ � 1


STRUCTUREDSUMMARIZATION

NEWS THREADING

NEWS THREADING

March 28: Health officials confirm Ebola outbreak in Guinea’s capital

NEWS THREADING


August 8: World Health Organization declares Ebola epidemic an

international health emergency

NEWS THREADING




September 2: GlaxoSmithKlein begins Ebola vaccine drug trial

NEWS THREADING

10360




September 2: GlaxoSmithKlein begins Ebola vaccine drug trial

M ⇡ 35,000

PROJECTING NEWS FEATURES


�(i)

D = 36,356


G

G�(i)�(i)

D = 36,356 d = 50

DPP THREADS

DPP THREADS

Jan 08 Jan 28 Feb 17 Mar 09 Mar 29 Apr 18 May 08 May 28 Jun 17

pope vatican church parkinson

israel palestinian iraqi israeli gaza abbas baghdad

owen nominees senate democrats judicial filibusters

social tax security democrats rove accounts

iraq iraqi killed baghdad arab marines deaths forces

DPP THREADS

Jan 08 Jan 28 Feb 17 Mar 09 Mar 29 Apr 18 May 08 May 28 Jun 17

pope vatican church parkinson

israel palestinian iraqi israeli gaza abbas baghdad

owen nominees senate democrats judicial filibusters

social tax security democrats rove accounts

iraq iraqi killed baghdad arab marines deaths forces

Feb 24: Parkinson's Disease Increases Risks to PopeFeb 26: Pope's Health Raises Questions About His Ability to LeadMar 13: Pope Returns Home After 18 Days at HospitalApr 01: Pope's Condition Worsens as World Prepares for End of PapacyApr 02: Pope, Though Gravely Ill, Utters Thanks for PrayersApr 18: Europeans Fast Falling Away from ChurchApr 20: In Developing World, Choice [of Pope] Met with SkepticismMay 18: Pope Sends Message with Choice of Name

System k-means DTM DPP

ROUGE-1F 16.5 14.7 17.2

R-SU4F 3.76 3.44 3.98

Coherence 2.73 3.2 3.3

QUANTITATIVE RESULTS

System k-means DTM DPP

ROUGE-1F 16.5 14.7 17.2

R-SU4F 3.76 3.44 3.98

Coherence 2.73 3.2 3.3

Runtime (s) 626 19,434 252

QUANTITATIVE RESULTS

OTHER POTENTIALNLP APPLICATIONS

POTENTIAL APP: RE-RANKING

• Parser: simple model with local features defines basic scores for all possible parse trees



• Re-ranker: more complex model with non-local features provides more refined scores




• Typical pipeline: find the k highest-scoring parses under the simple model, then score these k with the more complex model and output the best




• Typical pipeline: find the k highest-scoring parses under the simple model, then score these k with the more complex model and output the best

• Issue: the k may be largely redundant, so re-ranker does not get to consider significantly different parses


IDEA: USE DPPS FORSELECTING RE-RANKER INPUT




Quality: standard

parser scores




Quality: standard

parser scores

Diversity: edge lengths,

POS pairs, etc.

POTENTIAL APP:WORD SENSE INDUCTION

POTENTIAL APP:WORD SENSE INDUCTION• Goal: identify all possible senses of ambiguous

words (e.g. river bank vs bank deposit)



• Typical approach: unsupervised clustering, cluster centers represent word senses




• Why DPPs fit: can re-express finding cluster centers as the problem of finding a high-quality, diverse set





Quality: centrality (density of points nearby)





Quality: centrality (density of points nearby)

Diversity: same as standard

WSI features

QUESTIONS?

determinantal point processesapparikh/nips2014ml-nlp/slides/dpps-nips2014.pdfdeterminantal point...

Documents