text as datastanford.edu/~jgrimmer/text14/tc6.pdf · suppose documents live in aspace rich set of...
TRANSCRIPT
Text as Data
Justin Grimmer
Associate ProfessorDepartment of Political Science
Stanford University
October 9th, 2014
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 1 / 44
The Vector Space Model of Text
1) Task:- Numerous tasks will suppose that we can measure document similarity
or dissimiliarity
2) Objective Function- For a variety of tasks, will impose some measure or definition of
similarity, dissimilarity, or distance.
d(X i ,X j) = Dissimilarity(Distance) Bigger implies further apart
s(X i ,X j) = Similarity Bigger implies closer together
- Objective functions determine which points we compare andaggregate similarity, dissimilarity, and distance
3) Optimization- Depends on the particular task, likely arranging/grouping objects to
find similarity
4) Validation- Are the mathematical definitions of similarity actually similar for our
particular purpose?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 2 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space
rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry
modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts and Geometry
Consider a document-term matrix
X =
1 2 0 . . . 00 0 3 . . . 0...
......
. . ....
1 0 0 . . . 3
Suppose documents live in a space rich set of results from linearalgebra
- Provides a geometry modify with word weighting
- Natural notions of distance
- Kernel Trick: richer comparisons of large feature spaces
- Building block for clustering, supervised learning, and scaling
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 3 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Texts in Space
Doc1 = (1, 1, 3, . . . , 5)
Doc2 = (2, 0, 0, . . . , 1)
Doc1,Doc2 ∈ <J
Inner Product between documents:
Doc1 ·Doc2 = (1, 1, 3, . . . , 5)′(2, 0, 0, . . . , 1)
= 1× 2 + 1× 0 + 3× 0 + . . .+ 5× 1
= 7
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 4 / 44
Vector Length
x_1
x_2
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector Length
x_1
x_2
a
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector Length
x_1
x_2
a
b
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector Length
x_1
x_2
a
bc =
(a^2 + b^2)^(1/2)
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector Length
x_1
x_2
a
bc =
(a^2 + b^2)^(1/2)
- Pythogorean Theorem:Side with length a
- Side with length b andright triangle
- c =√a2 + b2
- This is generally true
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 5 / 44
Vector (Euclidean) Length
Definition
Suppose v ∈ <J . Then, we will define its length as
||v || = (v · v)1/2
= (v21 + v22 + v23 + . . .+ v2J )1/2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 6 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents
Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measures of Dissimilarity
Initial guess Distance metricsProperties of a metric: (distance function) d(·, ·). Consider arbitrarydocuments X i , X j , X k
1) d(X i ,X j) ≥ 0
2) d(X i ,X j) = 0 if and only if X i = X j
3) d(X i ,X j) = d(X j ,X i )
4) d(X i ,X k) ≤ d(X i ,X j) + d(X j ,X k)
Explore distance functions to compare documents Do we want additionalassumptions/properties?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 7 / 44
Measuring the Distance Between DocumentsEuclidean Distance
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44
Measuring the Distance Between DocumentsEuclidean Distance
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44
Measuring the Distance Between DocumentsEuclidean Distance
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 8 / 44
Measuring the Distance Between Documents
Definition
The Euclidean distance between documents X i and X j as
||X i − X j || =
√√√√ J∑m=1
(xim − xjm)2
Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:
||(1, 4)− (2, 1)|| =√
(1− 2)2 + (4− 1)2
=√
10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 9 / 44
Measuring the Distance Between Documents
Definition
The Euclidean distance between documents X i and X j as
||X i − X j || =
√√√√ J∑m=1
(xim − xjm)2
Suppose X i = (1, 4) and X j = (2, 1). The distance between thedocuments is:
||(1, 4)− (2, 1)|| =√
(1− 2)2 + (4− 1)2
=√
10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 9 / 44
Measuring the Distance Between Documents
Many distance metrics
Consider the Minkowski family
Definition
The Minkowski Distance between documents X i and X j for value p is
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44
Measuring the Distance Between Documents
Many distance metrics Consider the Minkowski family
Definition
The Minkowski Distance between documents X i and X j for value p is
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44
Measuring the Distance Between Documents
Many distance metrics Consider the Minkowski family
Definition
The Minkowski Distance between documents X i and X j for value p is
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 10 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
Members of the Minkowski Family
Manhattan metric
d1(Xi ,Xj) =J∑
m=1
|xim − xjm|
d1((1, 4), (2, 1)) = |1|+ |3| = 4
Minkowski (p) metric
dp(Xi ,Xj) =
(J∑
m=1
|xim − xjm|p)1/p
dp((1, 4), (2, 1)) = (|1− 2|p + |4− 1|p)1/p
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 11 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differences
If we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest difference
All other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measure
Decreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
What Does p Do?
Increasing p greater importance of coordinates with largest differencesIf we let p →∞ Obtain maximum-metric (Chebyshev’s Metric)
limp→∞
dp(Xi ,Xj) =J
maxm=1|xim − xjm|
In words: distance between documents only the biggest differenceAll other differences do not contribute to distance measureDecreasing p greater importance of coordinates with smallestdifferences
limp→−∞
dp(Xi ,Xj) =J
minm=1|xim − xjm|
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 12 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)
Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Comparing the Metrics
Suppose X i = (10, 4, 3), X j = (0, 4, 3), and X k = (0, 0, 0)Then:
d1(X i ,X j) = 10
d1(X i ,X k) = 10 + 4 + 3 = 17
d2(X i ,X j) = 10
d2(X i ,X k) =√
102 + 42 + 32 =√
125 = 11.18
d4(X i ,X j) = 10
d4(X i ,X k) =√
104 + 44 + 34 = (10337)1/4 = 10.08
d∞(X i ,X j) = 10
d∞(X i ,X k) = 10
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 13 / 44
Are all differences equal?
Previous metrics treat all dimensions as equal
We may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as
,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweighting
Mahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as
,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as
,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ
. Then we can define theMahalanobis Distance between documents X i and X j as
,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definite
What does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Are all differences equal?
Previous metrics treat all dimensions as equalWe may want to engage in some scaling/reweightingMahalanobis Distance
Definition
Suppose that we have a covariance matrix Σ. Then we can define theMahalanobis Distance between documents X i and X j as ,
dMah(X i ,X j) =√
(X i − X j)′Σ−1(X i − X j)
More generally: Σ could be symmetric and positive-definiteWhat does Σ do?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 14 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 00 1
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 00 0.5
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(0.5 00 1
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 0.30.3 0.5
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 0.30.3 0.5
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Some Intuition: The Unit Circle
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
Dim1
Dim
2
Σ =
(1 0.30.3 0.5
)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 15 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)Then distance is EuclideanSpecial Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)
Then distance is EuclideanSpecial Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)Then distance is Euclidean
Special Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)Then distance is EuclideanSpecial Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Distance with MahalanobisSpecial Case 1: Identity Matrix
Σ =
(1 00 1
)Then distance is EuclideanSpecial Case 2: Diagonal Matrix
Σ =
σ21 0 . . . 00 σ22 . . . 0...
.... . .
...0 0 . . . σ2J
Then
dMah(X i ,X j) =
√√√√ J∑m=1
(xim − xjm)2
σ2m
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 16 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
What properties should similarity measure have?
- Maximum: document with itself
- Minimum: documents have no words in common (orthogonal )
- Increasing when more of same words used
- ? s(a, b) = s(b, a).
How should additional words be treated?
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 17 / 44
Measuring Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Measure 1: Inner product
(2, 1)′ · (1, 4) = 6
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 18 / 44
Measuring Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Measure 1: Inner product
(2, 1)′ · (1, 4) = 6
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 18 / 44
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Problem(?): length dependent
(4, 2)′(1, 4) = 12
a · b = ||a|| × ||b|| × cos θ
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Problem(?): length dependent
(4, 2)′(1, 4) = 12
a · b = ||a|| × ||b|| × cos θ
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
Problem(?): length dependent
(4, 2)′(1, 4) = 12
a · b = ||a|| × ||b|| × cos θ
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
θ
Problem(?): length dependent
(4, 2)′(1, 4) = 12
a · b = ||a|| × ||b|| × cos θ
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 19 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)
(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
cos θ =
(a
||a||
)·(
b
||b||
)(4, 2)
||(4, 2)||= (0.89, 0.45)
(2, 1)
||(2, 1)||= (0.89, 0.45)
(1, 4)
||(1, 4)||= (0.24, 0.97)
(0.89, 0.45)′(0.24, 0.97) = 0.65
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
θ
cos θ: removes document length from similarity measure
Projects texts to unit length representation onto sphere
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
θ
cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Cosine Similarity
0 1 2 3 4
01
23
4
Word 1
Wor
d 2
θ
cos θ: removes document length from similarity measureProjects texts to unit length representation onto sphere
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 20 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Von Mises-Fisher Distribution
Consider document X i .
X ∗i =X i
||X i ||
Then we might suppose:
X ∗i ∼ von Mises-Fisher(κ,µ)
p(x i |κ,µ) = c(κ) exp (κx∗i µ)
Normal distribution, on a sphere
- Straightforward to Maximize
- Conjugate to itself
- Useful for clustering, hierarchies of topics
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 21 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
Kernel Similarity
Definition
Suppose we have documents X i and X j . Define the Gaussian kernel as
k(X i ,X j) = exp
(−||X i − X j ||2
σ2
)
Kernel of the Gaussian distribution
σ2 = determines sensitivity of the kernel
If X i = X j then k(X i ,X j) = 1
As X i and X j become more dissimilar, then k(X i ,X j) = 0
Result often justify setting some kernel weights to zero
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 22 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimension
We might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
The Kernel Trick
Suppose all of our documents X i ∈ <J
There may be some mapping φ : <J → <M where M > J that improvesour performance “lift” to higher dimensionWe might want, then,
s(φ(X i ), φ(X j)) = < φ(X i ), φ(X j) >
- The only thing we care about, though is inner product of transformedvariables
- So long as we can calculate inner product, we need not makeexplicit transformation
- Kernels provide methods for capture wide array of transformations.
- Kernel Trick calculate inner products on untransformed data(Gaussian Kernel), implicitly use wide array of φ’s.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 23 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative
- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words
Are all words created equal?
- Treat all words equally
- Lots of noise
- Reweight words
- Accentuate words that are likely to be informative- Make specific assumptions about characteristics of informative words
How to generate weights?
- Assumptions about separating words
- Use training set to identify separating words (Monroe, Ideologymeasurement)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 24 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measures
Inverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
nj
idf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
What properties do words need to separate concepts?
- Used frequently
- But not too frequently
Ex. If all statements about OBL contain Bin Laden than this contributesnothing to similarity/dissimilarity measuresInverse document frequency:
nj = No. documents in which word j occurs
idfj = logN
njidf = (idf1, idf2, . . . , idfJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 25 / 44
Weighting Words: TF-IDF Weighting
Why log ?
- Maximum at nj = 1
- Decreases at rate 1nj⇒ diminishing “penalty” for more common use
- Other functional forms are fine, embed assumptions aboutpenalization of common use
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44
Weighting Words: TF-IDF Weighting
Why log ?
- Maximum at nj = 1
- Decreases at rate 1nj⇒ diminishing “penalty” for more common use
- Other functional forms are fine, embed assumptions aboutpenalization of common use
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44
Weighting Words: TF-IDF Weighting
Why log ?
- Maximum at nj = 1
- Decreases at rate 1nj⇒ diminishing “penalty” for more common use
- Other functional forms are fine, embed assumptions aboutpenalization of common use
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44
Weighting Words: TF-IDF Weighting
Why log ?
- Maximum at nj = 1
- Decreases at rate 1nj⇒ diminishing “penalty” for more common use
- Other functional forms are fine, embed assumptions aboutpenalization of common use
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 26 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?
Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: TF-IDF
Xi ,idf ≡ Xi︸︷︷︸tf
×idf = (Xi1 × idf1,Xi2 × idf2, . . . ,XiJ × idfJ)
Xj ,idf ≡ Xj × idf = (Xj1 × idf1,Xj2 × idf2, . . . ,XjJ × idfJ)
How Does This Matter For Measuring Similarity/Dissimilarity?Inner Product
Xi ,idf · Xj ,idf = (Xi × idf)′(Xj × idf)
= (idf21 × Xi1 × Xj1) + (idf22 × Xi2 × Xj2) +
. . .+ (idf2J × XiJ × XjJ)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 27 / 44
Weighting Words: Inner Product
Define:
Σ =
idf21 0 0 . . . 0
0 idf22 0 . . . 0...
......
. . ....
0 0 0 . . . idf2J
If we use tf-idf for our documents, then
d2(X i ,X j) =
√√√√ J∑m=1
(xim,idf − xjm,idf)2
=√
(X i − X j)′Σ(X i − X j)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44
Weighting Words: Inner Product
Define:
Σ =
idf21 0 0 . . . 0
0 idf22 0 . . . 0...
......
. . ....
0 0 0 . . . idf2J
If we use tf-idf for our documents, then
d2(X i ,X j) =
√√√√ J∑m=1
(xim,idf − xjm,idf)2
=√
(X i − X j)′Σ(X i − X j)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44
Weighting Words: Inner Product
Define:
Σ =
idf21 0 0 . . . 0
0 idf22 0 . . . 0...
......
. . ....
0 0 0 . . . idf2J
If we use tf-idf for our documents, then
d2(X i ,X j) =
√√√√ J∑m=1
(xim,idf − xjm,idf)2
=√
(X i − X j)′Σ(X i − X j)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44
Weighting Words: Inner Product
Define:
Σ =
idf21 0 0 . . . 0
0 idf22 0 . . . 0...
......
. . ....
0 0 0 . . . idf2J
If we use tf-idf for our documents, then
d2(X i ,X j) =
√√√√ J∑m=1
(xim,idf − xjm,idf)2
=√
(X i − X j)′Σ(X i − X j)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 28 / 44
Final Product
Applying some measure of distance, similarity (if symmetric) yields:
D =
0 d(1, 2) d(1, 3) . . . d(1,N)
d(2, 1) 0 d(2, 3) . . . d(2,N)d(3, 1) d(3, 2) 0 . . . d(3,N)
......
.... . .
...d(N, 1) d(N, 2) d(N, 3) . . . 0
Lower Triangle contains unique information N(N − 1)/2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 29 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native Americans
Why?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
Spirling (2013): model Treaties between US and Native AmericansWhy?
- American political development
- IR Theories of Treaties and Treaty Violations
- Comparative studies of indigenous/colonialist interaction
- Political Science question: how did Native Americans lose land soquickly?
Paper does a lot. We’re going to focus on
- Today: Text representation and similarity calculation
- Tuesday: Projecting to low dimensional space
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 30 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Spirling and Indian Treaties
How do we preserve word order and semantic language?After stemming, stopping, bag of wording:
- Peace Between Us
- No Peace Between Us
are identical.Spirling uses complicated representation of texts to preserve word order broad applicationPeace Between Us
Analyzes K-substrings
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 31 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity
simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin
- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur
- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence
Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Kernel Trick
- Kernel Methods: Represent texts, measure similarity simultaneously
- Compare only substrings in both documents (without explicitlyquantifying entire documents)
- Problem solved:
- Arthur gives all his money to Justin- Justin gives all his money to Arthur- Discard word order: same sentence Kernel : different sentences.
Uses Kernel methods to measure similarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 32 / 44
Similarity and Dissimilarity of Many Things
Throughout the course we’ll measure similarity between documentsWe’ll also (implicitly) study similarity of probability distributionsDevelop a measure of distribution dissimilarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 33 / 44
Similarity of Probability Distributions
Definition
Suppose P is a continuous random variable with density p : < → < and Qis a continuous random variable with density q : < → q.We can define the KL-Divergence between P and Q as
KL(P||Q) =
∫ ∞−∞
p(x) logp(x)
q(x)dx
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 34 / 44
Assessing Similarity of Other Things
KL-divergence measures dissimilarity between two distributions.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 35 / 44
Consider a function. f (x) = −x2.
Maps numbers to other numbers.
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44
Consider a function. f (x) = −x2.Maps numbers to other numbers.
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44
Consider a function. f (x) = −x2.Maps numbers to other numbers.
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 36 / 44
Take some input (-2 here)
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
−2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 37 / 44
Then obtain the value of f (−2)
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
−2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 38 / 44
Then obtain the value of f (−2) = −4
−4 −2 0 2 4
−15
−10
−5
0
x
f(x)
−2
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 39 / 44
KL(q||p) is a functional.
A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.
KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.
For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)
KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
KL(q||p) is a functional. A functional takes functions as inputs, returns areal number.KL(q||p) maps from sets of distributions q ∈ Q and p ∈ P to positive realnumbers.For example, we could set q = Uniform(0,1) and p = Normal(0, 1)KL(Uniform(0,1)||Normal(0,1)) = 1.09
−4 −2 0 2 4
0.0
0.2
0.4
0.6
0.8
1.0
1.2
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 40 / 44
If q and p are the same distribution then KL(q||p) = 0.
Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44
If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.
Then make this approximation the best possible–minimize theKL-divergence.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44
If q and p are the same distribution then KL(q||p) = 0.Variational Approximation (topic models!): approximate one distributionp, with another, simpler distribution q.Then make this approximation the best possible–minimize theKL-divergence.
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 41 / 44
A simple example.
Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44
A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).
Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44
A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44
A simple example.Approximate a Normal(0,1) with symmetric Uniform distribution,Uniform(-b, b).Choose b to min. KL(Uniform(-b, b)|| Normal(0,1))
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 42 / 44
Answer:
b =√
3
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44
Answer:b =√
3
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44
Answer:b =√
3
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
x
Den
sity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 43 / 44
1) Documents in vector space geometry of texts
2) Many methods to measure similarity and dissimilarity
Justin Grimmer (Stanford University) Text as Data October 9th, 2014 44 / 44