text as datastanford.edu/~jgrimmer/text14/tc6.pdf · suppose documents live in aspace rich set of...

221
Text as Data Justin Grimmer Associate Professor Department of Political Science Stanford University October 9th, 2014 Justin Grimmer (Stanford University) Text as Data October 9th, 2014 1 / 44

Upload: others

Post on 03-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Text as Data

Justin Grimmer

Associate ProfessorDepartment of Political Science

Stanford University

October 9th, 2014

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 1 / 44

Page 2: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Vector Space Model of Text

1) Task:- Numerous tasks will suppose that we can measure document similarity

or dissimiliarity

2) Objective Function- For a variety of tasks, will impose some measure or definition of

similarity, dissimilarity, or distance.

d(X i ,X j) = Dissimilarity(Distance) Bigger implies further apart

s(X i ,X j) = Similarity Bigger implies closer together

- Objective functions determine which points we compare andaggregate similarity, dissimilarity, and distance

3) Optimization- Depends on the particular task, likely arranging/grouping objects to

find similarity

4) Validation- Are the mathematical definitions of similarity actually similar for our

particular purpose?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 2 / 44

Page 3: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 4: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space

rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 5: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 6: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry

modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 7: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 8: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 9: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 10: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts and Geometry

Consider a document-term matrix

X =

1 2 0 . . . 00 0 3 . . . 0...

......

. . ....

1 0 0 . . . 3

Suppose documents live in a space rich set of results from linearalgebra

- Provides a geometry modify with word weighting

- Natural notions of distance

- Kernel Trick: richer comparisons of large feature spaces

- Building block for clustering, supervised learning, and scaling

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44

Page 11: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 12: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 13: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 14: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 15: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 16: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 17: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 18: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Texts in Space

Doc1 = (1, 1, 3, . . . , 5)

Doc2 = (2, 0, 0, . . . , 1)

Doc1,Doc2 ∈ <J

Inner Product between documents:

Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)

= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1

= 7

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44

Page 19: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 20: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

a

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 21: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

a

b

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 22: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

a

bc =

(a^2 + b^2)^(1/2)

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 23: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector Length

x_1

x_2

a

bc =

(a^2 + b^2)^(1/2)

- Pythogorean Theorem:Side with length a

- Side with length b andright triangle

- c =√a2 + b2

- This is generally true

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44

Page 24: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Vector (Euclidean) Length

Definition

Suppose v ∈ <J . Then, we will define its length as

||v || = (v · v)1/2

= (v21 + v22 + v23 + . . .+ v2J )1/2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 6 / 44

Page 25: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 26: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 27: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 28: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 29: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 30: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents

Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 31: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measures of Dissimilarity

Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k

1) d(X i ,X j) ≥ 0

2) d(X i ,X j) = 0 if and only if X i = X j

3) d(X i ,X j) = d(X j ,X i )

4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)

Explore distance functions to compare documents Do we want additionalassumptions/properties?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44

Page 32: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between DocumentsEuclidean Distance

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44

Page 33: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between DocumentsEuclidean Distance

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44

Page 34: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between DocumentsEuclidean Distance

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44

Page 35: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Definition

The Euclidean distance between documents X i and X j as

||X i − X j || =

√√√√ J∑m=1

(xim − xjm)2

Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:

||(1, 4)− (2, 1)|| =√

(1− 2)2 + (4− 1)2

=√

10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 9 / 44

Page 36: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Definition

The Euclidean distance between documents X i and X j as

||X i − X j || =

√√√√ J∑m=1

(xim − xjm)2

Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:

||(1, 4)− (2, 1)|| =√

(1− 2)2 + (4− 1)2

=√

10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 9 / 44

Page 37: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Many distance metrics

Consider the Minkowski family

Definition

The Minkowski Distance between documents X i and X j for value p is

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44

Page 38: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Many distance metrics Consider the Minkowski family

Definition

The Minkowski Distance between documents X i and X j for value p is

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44

Page 39: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring the Distance Between Documents

Many distance metrics Consider the Minkowski family

Definition

The Minkowski Distance between documents X i and X j for value p is

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44

Page 40: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 41: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 42: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 43: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 44: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 45: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Members of the Minkowski Family

Manhattan metric

d1(Xi ,Xj) =J∑

m=1

|xim − xjm|

d1((1, 4), (2, 1)) = |1|+ |3| = 4

Minkowski (p) metric

dp(Xi ,Xj) =

(J∑

m=1

|xim − xjm|p)1/p

dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44

Page 46: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 47: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differences

If we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 48: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 49: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 50: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest difference

All other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 51: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measure

Decreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 52: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 53: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

What Does p Do?

Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)

limp→∞

dp(Xi ,Xj) =J

maxm=1|xim − xjm|

In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences

limp→−∞

dp(Xi ,Xj) =J

minm=1|xim − xjm|

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44

Page 54: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)

Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 55: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 56: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 57: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 58: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 59: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 60: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 61: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 62: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 63: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Comparing the Metrics

Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:

d1(X i ,X j) = 10

d1(X i ,X k) = 10 + 4 + 3 = 17

d2(X i ,X j) = 10

d2(X i ,X k) =√

102 + 42 + 32 =√

125 = 11.18

d4(X i ,X j) = 10

d4(X i ,X k) =√

104 + 44 + 34 = (10337)1/4 = 10.08

d∞(X i ,X j) = 10

d∞(X i ,X k) = 10

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44

Page 64: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equal

We may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 65: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweighting

Mahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 66: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 67: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ

. Then we can define theMahalanobis Distance between documents X i and X j as

,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 68: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 69: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 70: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definite

What does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 71: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Are all differences equal?

Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance

Definition

Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,

dMah(X i ,X j) =√

(X i − X j)′Σ−1(X i − X j)

More generally: Σ could be symmetric and positive-definiteWhat does Σ do?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44

Page 72: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 00 1

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 73: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 00 0.5

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 74: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(0.5 00 1

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 75: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 76: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 77: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Some Intuition: The Unit Circle

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

Dim1

Dim

2

Σ =

(1 0.30.3 0.5

)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44

Page 78: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 79: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)

Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 80: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is Euclidean

Special Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 81: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 82: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix

Σ =

(1 00 1

)Then distance is EuclideanSpecial Case 2: Diagonal Matrix

Σ =

σ21 0 . . . 00 σ22 . . . 0...

.... . .

...0 0 . . . σ2J

Then

dMah(X i ,X j) =

√√√√ J∑m=1

(xim − xjm)2

σ2m

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44

Page 83: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 84: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 85: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 86: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 87: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 88: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 89: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

What properties should similarity measure have?

- Maximum: document with itself

- Minimum: documents have no words in common (orthogonal )

- Increasing when more of same words used

- ? s(a, b) = s(b, a).

How should additional words be treated?

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44

Page 90: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Measure 1: Inner product

(2, 1)′ · (1, 4) = 6

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 18 / 44

Page 91: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Measuring Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Measure 1: Inner product

(2, 1)′ · (1, 4) = 6

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 18 / 44

Page 92: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44

Page 93: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44

Page 94: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44

Page 95: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

Problem(?): length dependent

(4, 2)′(1, 4) = 12

a · b = ||a|| × ||b|| × cos θ

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44

Page 96: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 97: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)

(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 98: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 99: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 100: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 101: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

cos θ =

(a

||a||

)·(

b

||b||

)(4, 2)

||(4, 2)||= (0.89, 0.45)

(2, 1)

||(2, 1)||= (0.89, 0.45)

(1, 4)

||(1, 4)||= (0.24, 0.97)

(0.89, 0.45)′(0.24, 0.97) = 0.65

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 102: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measure

Projects texts to unit length representation onto sphere

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 103: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 104: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Cosine Similarity

0 1 2 3 4

01

23

4

Word 1

Wor

d 2

θ

cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44

Page 105: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 106: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 107: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 108: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 109: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 110: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 111: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 112: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 113: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Von Mises-Fisher Distribution

Consider document X i .

X ∗i =X i

||X i ||

Then we might suppose:

X ∗i ∼ von Mises-Fisher(κ,µ)

p(x i |κ,µ) = c(κ) exp (κx∗i µ)

Normal distribution, on a sphere

- Straightforward to Maximize

- Conjugate to itself

- Useful for clustering, hierarchies of topics

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44

Page 114: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 115: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 116: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 117: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 118: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 119: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Similarity

Definition

Suppose we have documents X i and X j . Define the Gaussian kernel as

k(X i ,X j) = exp

(−||X i − X j ||2

σ2

)

Kernel of the Gaussian distribution

σ2 = determines sensitivity of the kernel

If X i = X j then k(X i ,X j) = 1

As X i and X j become more dissimilar, then k(X i ,X j) = 0

Result often justify setting some kernel weights to zero

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44

Page 120: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 121: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimension

We might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 122: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 123: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 124: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 125: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 126: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 127: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

The Kernel Trick

Suppose all of our documents X i ∈ <J

There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,

s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >

- The only thing we care about, though is inner product of transformedvariables

- So long as we can calculate inner product, we need not makeexplicit transformation

- Kernels provide methods for capture wide array of transformations.

- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44

Page 128: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 129: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 130: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 131: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 132: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative

- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 133: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 134: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 135: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 136: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words

Are all words created equal?

- Treat all words equally

- Lots of noise

- Reweight words

- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words

How to generate weights?

- Assumptions about separating words

- Use training set to identify separating words (Monroe, Ideologymeasurement)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44

Page 137: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 138: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 139: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 140: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measures

Inverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 141: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 142: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 143: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

nj

idf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 144: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

What properties do words need to separate concepts?

- Used frequently

- But not too frequently

Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:

nj = No. documents in which word j occurs

idfj = logN

njidf = (idf1, idf2, . . . , idfJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44

Page 145: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44

Page 146: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44

Page 147: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44

Page 148: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF Weighting

Why log ?

- Maximum at nj = 1

- Decreases at rate 1nj⇒ diminishing “penalty” for more common use

- Other functional forms are fine, embed assumptions aboutpenalization of common use

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44

Page 149: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 150: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 151: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 152: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?

Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 153: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 154: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 155: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: TF-IDF

Xi ,idf ≡ Xi︸︷︷︸tf

×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)

Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)

How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product

Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)

= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +

. . .+ (idf2J × XiJ × XjJ)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44

Page 156: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44

Page 157: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44

Page 158: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44

Page 159: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Weighting Words: Inner Product

Define:

Σ =

idf21 0 0 . . . 0

0 idf22 0 . . . 0...

......

. . ....

0 0 0 . . . idf2J

If we use tf-idf for our documents, then

d2(X i ,X j) =

√√√√ J∑m=1

(xim,idf − xjm,idf)2

=√

(X i − X j)′Σ(X i − X j)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44

Page 160: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Final Product

Applying some measure of distance, similarity (if symmetric) yields:

D =

0 d(1, 2) d(1, 3) . . . d(1,N)

d(2, 1) 0 d(2, 3) . . . d(2,N)d(3, 1) d(3, 2) 0 . . . d(3,N)

......

.... . .

...d(N, 1) d(N, 2) d(N, 3) . . . 0

Lower Triangle contains unique information N(N − 1)/2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 29 / 44

Page 161: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native Americans

Why?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 162: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 163: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 164: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 165: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 166: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 167: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 168: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 169: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

Spirling (2013): model Treaties between US and Native AmericansWhy?

- American political development

- IR Theories of Treaties and Treaty Violations

- Comparative studies of indigenous/colonialist interaction

- Political Science question: how did Native Americans lose land soquickly?

Paper does a lot. We’re going to focus on

- Today: Text representation and similarity calculation

- Tuesday: Projecting to low dimensional space

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44

Page 170: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 171: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 172: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 173: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 174: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 175: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 176: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 177: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 178: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 179: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 180: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 181: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 182: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Spirling and Indian Treaties

How do we preserve word order and semantic language?After stemming, stopping, bag of wording:

- Peace Between Us

- No Peace Between Us

are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us

Analyzes K-substrings

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44

Page 183: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 184: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity

simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 185: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 186: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 187: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 188: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin

- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 189: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur

- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 190: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence

Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 191: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 192: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 193: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 194: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 195: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 196: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Kernel Trick

- Kernel Methods: Represent texts, measure similarity simultaneously

- Compare only substrings in both documents (without explicitlyquantifying entire documents)

- Problem solved:

- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.

Uses Kernel methods to measure similarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44

Page 197: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Similarity and Dissimilarity of Many Things

Throughout the course we’ll measure similarity between documentsWe’ll also (implicitly) study similarity of probability distributionsDevelop a measure of distribution dissimilarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 33 / 44

Page 198: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Similarity of Probability Distributions

Definition

Suppose P is a continuous random variable with density p : < → < and Qis a continuous random variable with density q : < → q.We can define the KL-Divergence between P and Q as

KL(P||Q) =

∫ ∞−∞

p(x) logp(x)

q(x)dx

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 34 / 44

Page 199: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Assessing Similarity of Other Things

KL-divergence measures dissimilarity between two distributions.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 35 / 44

Page 200: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Consider a function. f (x) = −x2.

Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44

Page 201: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Consider a function. f (x) = −x2.Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44

Page 202: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Consider a function. f (x) = −x2.Maps numbers to other numbers.

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44

Page 203: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Take some input (-2 here)

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 37 / 44

Page 204: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Then obtain the value of f (−2)

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 38 / 44

Page 205: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Then obtain the value of f (−2) = −4

−4 −2 0 2 4

−15

−10

−5

0

x

f(x)

−2

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 39 / 44

Page 206: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional.

A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 207: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.

KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 208: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.

For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 209: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)

KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 210: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09

−4 −2 0 2 4

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44

Page 211: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

If q and p are the same distribution then KL(q||p) = 0.

Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44

Page 212: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.

Then make this approximation the best possible–minimize theKL-divergence.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44

Page 213: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44

Page 214: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

A simple example.

Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44

Page 215: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).

Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44

Page 216: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44

Page 217: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44

Page 218: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Answer:

b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44

Page 219: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Answer:b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44

Page 220: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

Answer:b =√

3

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

x

Den

sity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44

Page 221: Text as Datastanford.edu/~jgrimmer/Text14/tc6.pdf · Suppose documents live in aspace rich set of results from linear algebra-Provides ageometry modify with word weighting-Natural

1) Documents in vector space geometry of texts

2) Many methods to measure similarity and dissimilarity

Justin Grimmer (Stanford University) Text as Data October 9th, 2014 44 / 44