text as datastanford.edu/~jgrimmer/text14/tc6.pdf · suppose documents live in aspace rich set of...

Text as Data

Justin Grimmer

Associate ProfessorDepartment of Political Science

Stanford University

October 9th, 2014

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 1 / 44

The Vector Space Model of Text

1) Task:- Numerous tasks will suppose that we can measure document similarity

or dissimiliarity

2) Objective Function- For a variety of tasks, will impose some measure or definition of

similarity, dissimilarity, or distance.

d(X i ,X j) = Dissimilarity(Distance) Bigger implies further apart

s(X i ,X j) = Similarity Bigger implies closer together

- Objective functions determine which points we compare andaggregate similarity, dissimilarity, and distance

3) Optimization- Depends on the particular task, likely arranging/grouping objects to

find similarity

4) Validation- Are the mathematical definitions of similarity actually similar for our

particular purpose?

Texts and Geometry

Consider a document-term matrix

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Texts and Geometry

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space

rich set of results from linearalgebra

Texts and Geometry

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Texts and Geometry

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

- Provides a geometry

modify with word weighting

Texts and Geometry

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Texts and Geometry

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Texts and Geometry

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Texts and Geometry

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

Vector Length

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Vector Length

- c =√a2 + b2

Vector Length

- c =√a2 + b2

Vector Length

(a^2 + b^2)^(1/2)

- c =√a2 + b2

Vector Length

(a^2 + b^2)^(1/2)

- c =√a2 + b2

Vector (Euclidean) Length

Definition

Suppose v ∈ <J . Then, we will define its length as

||v || = (v · v)1/2

= (v21 + v22 + v23 + . . .+ v2J )1/2

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

1) d(X i ,X j) ≥ 0

3) d(X i ,X j) = d(X j ,X i )

1) d(X i ,X j) ≥ 0

3) d(X i ,X j) = d(X j ,X i )

1) d(X i ,X j) ≥ 0

3) d(X i ,X j) = d(X j ,X i )

1) d(X i ,X j) ≥ 0

3) d(X i ,X j) = d(X j ,X i )

1) d(X i ,X j) ≥ 0

3) d(X i ,X j) = d(X j ,X i )

1) d(X i ,X j) ≥ 0

3) d(X i ,X j) = d(X j ,X i )

Explore distance functions to compare documents Do we want additionalassumptions/properties?

Measuring the Distance Between DocumentsEuclidean Distance

0 1 2 3 4

Word 1

0 1 2 3 4

Word 1

0 1 2 3 4

Word 1

Measuring the Distance Between Documents

Definition

The Euclidean distance between documents X i and X j as

||X i − X j || =

√√√√ J∑m=1

(xim − xjm)2

Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:

||(1, 4)− (2, 1)|| =√

(1− 2)2 + (4− 1)2

Definition

The Euclidean distance between documents X i and X j as

||X i − X j || =

√√√√ J∑m=1

(xim − xjm)2

Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:

||(1, 4)− (2, 1)|| =√

(1− 2)2 + (4− 1)2

Many distance metrics

Consider the Minkowski family

Definition

The Minkowski Distance between documents X i and X j for value p is

dp(Xi ,Xj) =

|xim − xjm|p)1/p

Many distance metrics Consider the Minkowski family

Definition

dp(Xi ,Xj) =

|xim − xjm|p)1/p

Many distance metrics Consider the Minkowski family

Definition

dp(Xi ,Xj) =

|xim − xjm|p)1/p

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Manhattan metric

d1(Xi ,Xj) =J∑

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

dp(Xi ,Xj) =

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Manhattan metric

d1(Xi ,Xj) =J∑

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

dp(Xi ,Xj) =

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Manhattan metric

d1(Xi ,Xj) =J∑

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

dp(Xi ,Xj) =

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Manhattan metric

d1(Xi ,Xj) =J∑

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

dp(Xi ,Xj) =

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Manhattan metric

d1(Xi ,Xj) =J∑

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

dp(Xi ,Xj) =

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

What Does p Do?

Increasing p greater importance of coordinates with largest differences

If we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

What Does p Do?

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

What Does p Do?

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

What Does p Do?

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest difference

All other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

What Does p Do?

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measure

Decreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

What Does p Do?

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

What Does p Do?

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Are all differences equal?

Previous metrics treat all dimensions as equal

We may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweighting

Mahalanobis Distance

Definition

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

Definition

Suppose that we have a covariance matrix Σ

. Then we can define theMahalanobis Distance between documents X i and X j as

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

Definition

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

Definition

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definite

What does Σ do?

Definition

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

(1 00 1

−1.0 −0.5 0.0 0.5 1.0

(1 00 0.5

−1.0 −0.5 0.0 0.5 1.0

(0.5 00 1

−1.0 −0.5 0.0 0.5 1.0

(1 0.30.3 0.5

−1.0 −0.5 0.0 0.5 1.0

(1 0.30.3 0.5

−1.0 −0.5 0.0 0.5 1.0

(1 0.30.3 0.5

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

(1 00 1

)Then distance is EuclideanSpecial Case 2: Diagonal Matrix

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

(1 00 1

Then distance is EuclideanSpecial Case 2: Diagonal Matrix

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

(1 00 1

)Then distance is Euclidean

Special Case 2: Diagonal Matrix

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

(1 00 1

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

(1 00 1

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

- ? s(a, b) = s(b, a).

0 1 2 3 4

Word 1

Measure 1: Inner product

(2, 1)′ · (1, 4) = 6

0 1 2 3 4

Word 1

Measure 1: Inner product

(2, 1)′ · (1, 4) = 6

0 1 2 3 4

Word 1

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

0 1 2 3 4

Word 1

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

0 1 2 3 4

Word 1

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

0 1 2 3 4

Word 1

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Cosine Similarity

cos θ =

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Cosine Similarity

cos θ =

(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Cosine Similarity

cos θ =

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Cosine Similarity

cos θ =

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Cosine Similarity

cos θ =

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Cosine Similarity

cos θ =

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Cosine Similarity

0 1 2 3 4

Word 1

cos θ: removes document length from similarity measure

Projects texts to unit length representation onto sphere

Cosine Similarity

0 1 2 3 4

Word 1

cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere

Cosine Similarity

0 1 2 3 4

Word 1

cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

X ∗i =X i

||X i ||

X ∗i =X i

||X i ||

X ∗i =X i

||X i ||

X ∗i =X i

||X i ||

X ∗i =X i

||X i ||

X ∗i =X i

||X i ||

X ∗i =X i

||X i ||

X ∗i =X i

||X i ||

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Kernel Similarity

Definition

k(X i ,X j) = exp

(−||X i − X j ||2

Kernel Similarity

Definition

k(X i ,X j) = exp

(−||X i − X j ||2

Kernel Similarity

Definition

k(X i ,X j) = exp

(−||X i − X j ||2

Kernel Similarity

Definition

k(X i ,X j) = exp

(−||X i − X j ||2

Kernel Similarity

Definition

k(X i ,X j) = exp

(−||X i − X j ||2

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

The Kernel Trick

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimension

We might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

The Kernel Trick

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

The Kernel Trick

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

The Kernel Trick

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

The Kernel Trick

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

The Kernel Trick

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

The Kernel Trick

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Weighting Words

- Lots of noise

- Reweight words

Weighting Words

- Lots of noise

- Reweight words

Weighting Words

- Lots of noise

- Reweight words

Weighting Words

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative

- Make specific assumptions about characteristics of informative words

Weighting Words

- Lots of noise

- Reweight words

Weighting Words

- Lots of noise

- Reweight words

Weighting Words

- Lots of noise

- Reweight words

Weighting Words

- Lots of noise

- Reweight words

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

- Used frequently

idfj = logN

- Used frequently

idfj = logN

- Used frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measures

Inverse document frequency:

idfj = logN

- Used frequently

idfj = logN

- Used frequently

idfj = logN

- Used frequently

idfj = logN

idf = (idf1, idf2, . . . , idfJ)

- Used frequently

idfj = logN

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Why log ?

- Maximum at nj = 1

Why log ?

- Maximum at nj = 1

Why log ?

- Maximum at nj = 1

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

How Does This Matter For Measuring Similarity/Dissimilarity?

Inner Product

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Weighting Words: Inner Product

Define:

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

(X i − X j)′Σ(X i − X j)

Define:

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

d2(X i ,X j) =

√√√√ J∑m=1

(X i − X j)′Σ(X i − X j)

Define:

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

d2(X i ,X j) =

√√√√ J∑m=1

(X i − X j)′Σ(X i − X j)

Define:

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

d2(X i ,X j) =

√√√√ J∑m=1

(X i − X j)′Σ(X i − X j)

Final Product

Applying some measure of distance, similarity (if symmetric) yields:

0 d(1, 2) d(1, 3) . . . d(1,N)

d(2, 1) 0 d(2, 3) . . . d(2,N)d(3, 1) d(3, 2) 0 . . . d(3,N)

......

.... . .

...d(N, 1) d(N, 2) d(N, 3) . . . 0

Lower Triangle contains unique information N(N − 1)/2

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native Americans

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Spirling (2013): model Treaties between US and Native AmericansWhy?

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

- Peace Between Us

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Kernel Trick

- Kernel Methods: Represent texts, measure similarity

simultaneously

- Problem solved:

Kernel Trick

- Problem solved:

Kernel Trick

- Problem solved:

Kernel Trick

- Problem solved:

Kernel Trick

- Problem solved:

- Arthur gives all his money to Justin

- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Kernel Trick

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur

- Discard word order: same sentence Kernel : different sentences.

Kernel Trick

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence

Kernel : different sentences.

Kernel Trick

- Problem solved:

Kernel Trick

- Problem solved:

Kernel Trick

- Problem solved:

Kernel Trick

- Problem solved:

Kernel Trick

- Problem solved:

Kernel Trick

- Problem solved:

Similarity and Dissimilarity of Many Things

Throughout the course we’ll measure similarity between documentsWe’ll also (implicitly) study similarity of probability distributionsDevelop a measure of distribution dissimilarity

Similarity of Probability Distributions

Definition

Suppose P is a continuous random variable with density p : < → < and Qis a continuous random variable with density q : < → q.We can define the KL-Divergence between P and Q as

KL(P||Q) =

∫ ∞−∞

p(x) logp(x)

q(x)dx

Assessing Similarity of Other Things

KL-divergence measures dissimilarity between two distributions.

Consider a function. f (x) = −x2.

Maps numbers to other numbers.

−4 −2 0 2 4

Consider a function. f (x) = −x2.Maps numbers to other numbers.

−4 −2 0 2 4

Consider a function. f (x) = −x2.Maps numbers to other numbers.

−4 −2 0 2 4

Take some input (-2 here)

−4 −2 0 2 4

Then obtain the value of f (−2)

−4 −2 0 2 4

Then obtain the value of f (−2) = −4

−4 −2 0 2 4

KL(q||p) is a functional.

A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.

KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.

For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)

KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

If q and p are the same distribution then KL(q||p) = 0.

Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.

If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.

Then make this approximation the best possible–minimize theKL-divergence.

If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.

A simple example.

Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).

Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

Answer:

b =√

−4 −2 0 2 4

Answer:b =√

−4 −2 0 2 4

Answer:b =√

−4 −2 0 2 4

1) Documents in vector space geometry of texts

2) Many methods to measure similarity and dissimilarity