![Page 1: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/1.jpg)
Text as Data
Justin Grimmer
Associate ProfessorDepartment of Political Science
Stanford University
November 10th, 2014
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 1 / 40
![Page 2: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/2.jpg)
Supervised Learning Methods
1) Task
- Classify documents to pre existing categories- Measure the proportion of documents in each category
2) Objective function1) Penalized Regressions
- Ridge regression- LASSO regression
2) Classification Surface Support Vector Machines3) Measure Proportions Naive Bayes(ish) objective
3) Optimization
- Depends on method
4) Validation
- Obtain predicted fit for new data f (X i , θ)- Examine prediction performance compare classification to gold
standard
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 2 / 40
![Page 3: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/3.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}
We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variable
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 4: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/4.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variable
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 5: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/5.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2
β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variable
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 6: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/6.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variable
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 7: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/7.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variable
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 8: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/8.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variable
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 9: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/9.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variable
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 10: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/10.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variable
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 11: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/11.jpg)
Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).
f (β,X ,Y ) =N∑i=1
(yi − β
′x i
)2β = arg minβ
{N∑i=1
(yi − β
′x i
)2}
=(X
′X)−1
X′Y
Problem:
- J will likely be large (perhaps J > N)
- There many correlated variables
Predictions will be variableJustin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40
![Page 12: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/12.jpg)
Mean Square Error
Suppose θ is some value of the true parameter
Bias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 13: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/13.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 14: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/14.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 15: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/15.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 16: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/16.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2]
= E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 17: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/17.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 18: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/18.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 19: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/19.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 20: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/20.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 21: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/21.jpg)
Mean Square Error
Suppose θ is some value of the true parameterBias:
Bias = E [θ − θ]
We may care about average distance from truth
E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2
= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2
= E [θ2]− E [θ]2 + (E [θ − θ])2
= Var(θ) + Bias2
To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40
![Page 22: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/22.jpg)
Ridge Regression
Penalty for model complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λJ∑
j=1
β2j︸ ︷︷ ︸Penalty
where:
- β0 intercept
- λ penalty parameter
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40
![Page 23: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/23.jpg)
Ridge Regression
Penalty for model complexity
f (β,X ,Y )
=N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λJ∑
j=1
β2j︸ ︷︷ ︸Penalty
where:
- β0 intercept
- λ penalty parameter
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40
![Page 24: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/24.jpg)
Ridge Regression
Penalty for model complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λJ∑
j=1
β2j︸ ︷︷ ︸Penalty
where:
- β0 intercept
- λ penalty parameter
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40
![Page 25: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/25.jpg)
Ridge Regression
Penalty for model complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
β2j︸ ︷︷ ︸Penalty
where:
- β0 intercept
- λ penalty parameter
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40
![Page 26: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/26.jpg)
Ridge Regression
Penalty for model complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
β2j︸ ︷︷ ︸Penalty
where:
- β0 intercept
- λ penalty parameter
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40
![Page 27: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/27.jpg)
Ridge Regression
Penalty for model complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
β2j︸ ︷︷ ︸Penalty
where:
- β0 intercept
- λ penalty parameter
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40
![Page 28: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/28.jpg)
Ridge Regression
Penalty for model complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
β2j︸ ︷︷ ︸Penalty
where:
- β0 intercept
- λ penalty parameter
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40
![Page 29: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/29.jpg)
Ridge Regression Optimization
βRidge = arg minβ {f (β,X ,Y )}
= arg minβ
N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λJ∑
j=1
β2j
= arg minβ
{(Y − X
′β)
′(Y − X
′β) + λβ
′β}
=(X
′X + λI J
)−1X
′Y
Demean the data and set β0 = y =∑N
i=1yiN
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40
![Page 30: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/30.jpg)
Ridge Regression Optimization
βRidge = arg minβ {f (β,X ,Y )}
= arg minβ
N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
β2j
= arg minβ
{(Y − X
′β)
′(Y − X
′β) + λβ
′β}
=(X
′X + λI J
)−1X
′Y
Demean the data and set β0 = y =∑N
i=1yiN
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40
![Page 31: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/31.jpg)
Ridge Regression Optimization
βRidge = arg minβ {f (β,X ,Y )}
= arg minβ
N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
β2j
= arg minβ
{(Y − X
′β)
′(Y − X
′β) + λβ
′β}
=(X
′X + λI J
)−1X
′Y
Demean the data and set β0 = y =∑N
i=1yiN
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40
![Page 32: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/32.jpg)
Ridge Regression Optimization
βRidge = arg minβ {f (β,X ,Y )}
= arg minβ
N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
β2j
= arg minβ
{(Y − X
′β)
′(Y − X
′β) + λβ
′β}
=(X
′X + λI J
)−1X
′Y
Demean the data and set β0 = y =∑N
i=1yiN
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40
![Page 33: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/33.jpg)
Ridge Regression Optimization
βRidge = arg minβ {f (β,X ,Y )}
= arg minβ
N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
β2j
= arg minβ
{(Y − X
′β)
′(Y − X
′β) + λβ
′β}
=(X
′X + λI J
)−1X
′Y
Demean the data and set β0 = y =∑N
i=1yiN
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40
![Page 34: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/34.jpg)
Ridge Regression Intuition (1)
Suppose X′X = I J .
β =(X
′X)−1
X′Y
= X′Y
βridge =(X
′X + λI J
)−1X
′Y
= (I j + λI j) X′Y
= (I j + λI j) β
βRidgej =βj
1 + λ
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40
![Page 35: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/35.jpg)
Ridge Regression Intuition (1)
Suppose X′X = I J .
β =(X
′X)−1
X′Y
= X′Y
βridge =(X
′X + λI J
)−1X
′Y
= (I j + λI j) X′Y
= (I j + λI j) β
βRidgej =βj
1 + λ
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40
![Page 36: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/36.jpg)
Ridge Regression Intuition (1)
Suppose X′X = I J .
β =(X
′X)−1
X′Y
= X′Y
βridge =(X
′X + λI J
)−1X
′Y
= (I j + λI j) X′Y
= (I j + λI j) β
βRidgej =βj
1 + λ
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40
![Page 37: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/37.jpg)
Ridge Regression Intuition (1)
Suppose X′X = I J .
β =(X
′X)−1
X′Y
= X′Y
βridge =(X
′X + λI J
)−1X
′Y
= (I j + λI j) X′Y
= (I j + λI j) β
βRidgej =βj
1 + λ
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40
![Page 38: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/38.jpg)
Ridge Regression Intuition (1)
Suppose X′X = I J .
β =(X
′X)−1
X′Y
= X′Y
βridge =(X
′X + λI J
)−1X
′Y
= (I j + λI j) X′Y
= (I j + λI j) β
βRidgej =βj
1 + λ
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40
![Page 39: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/39.jpg)
Ridge Regression Intuition (1)
Suppose X′X = I J .
β =(X
′X)−1
X′Y
= X′Y
βridge =(X
′X + λI J
)−1X
′Y
= (I j + λI j) X′Y
= (I j + λI j) β
βRidgej =βj
1 + λ
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40
![Page 40: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/40.jpg)
Ridge Regression Intuition (1)
Suppose X′X = I J .
β =(X
′X)−1
X′Y
= X′Y
βridge =(X
′X + λI J
)−1X
′Y
= (I j + λI j) X′Y
= (I j + λI j) β
βRidgej =βj
1 + λ
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40
![Page 41: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/41.jpg)
Ridge Regression Intuition (2)
βj ∼ Normal(0, τ2)
yi ∼ Normal(β0 + x′iβ, σ
2)
p(β|X ,Y ) ∝J∏
j=1
p(βj)N∏i=1
p(yi |x i ,β)
∝J∏
j=1
1√2πτ
exp
(−β2j
2τ 2
)N∏i=1
1√2πσ
exp
(− (yi − β0 + x
′β)2
2σ2
)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 8 / 40
![Page 42: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/42.jpg)
Ridge Regression Intuition (2)
βj ∼ Normal(0, τ2)
yi ∼ Normal(β0 + x′iβ, σ
2)
p(β|X ,Y ) ∝J∏
j=1
p(βj)N∏i=1
p(yi |x i ,β)
∝J∏
j=1
1√2πτ
exp
(−β2j
2τ 2
)N∏i=1
1√2πσ
exp
(− (yi − β0 + x
′β)2
2σ2
)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 8 / 40
![Page 43: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/43.jpg)
Ridge Regression Intuition (2)
βj ∼ Normal(0, τ2)
yi ∼ Normal(β0 + x′iβ, σ
2)
p(β|X ,Y ) ∝J∏
j=1
p(βj)N∏i=1
p(yi |x i ,β)
∝J∏
j=1
1√2πτ
exp
(−β2j
2τ 2
)N∏i=1
1√2πσ
exp
(− (yi − β0 + x
′β)2
2σ2
)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 8 / 40
![Page 44: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/44.jpg)
Ridge Regression Intuition (2)
log p(β|X ,Y ) = −J∑
j=1
β2j2τ2−
N∑i=1
(yi − β0 + x ′β)2
2σ2
− 2σ2 log p(β|X ,Y ) =N∑i=1
(yi − β0 + x′β)2 +
J∑j=1
σ2
τ2β2j
where:
- λ = σ2
τ2β2j
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 9 / 40
![Page 45: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/45.jpg)
Ridge Regression Intuition (2)
log p(β|X ,Y ) = −J∑
j=1
β2j2τ2−
N∑i=1
(yi − β0 + x ′β)2
2σ2
− 2σ2 log p(β|X ,Y ) =N∑i=1
(yi − β0 + x′β)2 +
J∑j=1
σ2
τ2β2j
where:
- λ = σ2
τ2β2j
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 9 / 40
![Page 46: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/46.jpg)
Ridge Regression Intuition (2)
log p(β|X ,Y ) = −J∑
j=1
β2j2τ2−
N∑i=1
(yi − β0 + x ′β)2
2σ2
− 2σ2 log p(β|X ,Y ) =N∑i=1
(yi − β0 + x′β)2 +
J∑j=1
σ2
τ2β2j
where:
- λ = σ2
τ2β2j
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 9 / 40
![Page 47: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/47.jpg)
Ridge Regression Intuition (2)
log p(β|X ,Y ) = −J∑
j=1
β2j2τ2−
N∑i=1
(yi − β0 + x ′β)2
2σ2
− 2σ2 log p(β|X ,Y ) =N∑i=1
(yi − β0 + x′β)2 +
J∑j=1
σ2
τ2β2j
where:
- λ = σ2
τ2β2j
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 9 / 40
![Page 48: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/48.jpg)
Lasso Regression Objective Function/Optimization
Different Penalty for Model Complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
|βj |︸︷︷︸Penalty
- Optimization is non-linear (Absolute Value)
- Coordinate Descent- Start with Ridge- Sub-differential, update steps
- Induces sparsity sets some coefficients to zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40
![Page 49: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/49.jpg)
Lasso Regression Objective Function/Optimization
Different Penalty for Model Complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
|βj |︸︷︷︸Penalty
- Optimization is non-linear (Absolute Value)
- Coordinate Descent- Start with Ridge- Sub-differential, update steps
- Induces sparsity sets some coefficients to zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40
![Page 50: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/50.jpg)
Lasso Regression Objective Function/Optimization
Different Penalty for Model Complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
|βj |︸︷︷︸Penalty
- Optimization is non-linear (Absolute Value)
- Coordinate Descent
- Start with Ridge- Sub-differential, update steps
- Induces sparsity sets some coefficients to zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40
![Page 51: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/51.jpg)
Lasso Regression Objective Function/Optimization
Different Penalty for Model Complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
|βj |︸︷︷︸Penalty
- Optimization is non-linear (Absolute Value)
- Coordinate Descent- Start with Ridge
- Sub-differential, update steps
- Induces sparsity sets some coefficients to zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40
![Page 52: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/52.jpg)
Lasso Regression Objective Function/Optimization
Different Penalty for Model Complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
|βj |︸︷︷︸Penalty
- Optimization is non-linear (Absolute Value)
- Coordinate Descent- Start with Ridge- Sub-differential, update steps
- Induces sparsity sets some coefficients to zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40
![Page 53: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/53.jpg)
Lasso Regression Objective Function/Optimization
Different Penalty for Model Complexity
f (β,X ,Y ) =N∑i=1
yi − β0 +J∑
j=1
βjxij
2
+ λ
J∑j=1
|βj |︸︷︷︸Penalty
- Optimization is non-linear (Absolute Value)
- Coordinate Descent- Start with Ridge- Sub-differential, update steps
- Induces sparsity sets some coefficients to zero
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40
![Page 54: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/54.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Suppose again X′X = I J
f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ
J∑j=1
|βj |
= −2X′Yβ + β
′β + λ
J∑j=1
|βj |
The coefficient is
βLASSOj = sign(βj
)(|βj | − λ
)+
- sign(·) 1 or −1
-(|βj | − λ
)+
= max(|βj | − λ, 0)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40
![Page 55: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/55.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Suppose again X′X = I J
f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ
J∑j=1
|βj |
= −2X′Yβ + β
′β + λ
J∑j=1
|βj |
The coefficient is
βLASSOj = sign(βj
)(|βj | − λ
)+
- sign(·) 1 or −1
-(|βj | − λ
)+
= max(|βj | − λ, 0)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40
![Page 56: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/56.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Suppose again X′X = I J
f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ
J∑j=1
|βj |
= −2X′Yβ + β
′β + λ
J∑j=1
|βj |
The coefficient is
βLASSOj = sign(βj
)(|βj | − λ
)+
- sign(·) 1 or −1
-(|βj | − λ
)+
= max(|βj | − λ, 0)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40
![Page 57: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/57.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Suppose again X′X = I J
f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ
J∑j=1
|βj |
= −2X′Yβ + β
′β + λ
J∑j=1
|βj |
The coefficient is
βLASSOj = sign(βj
)(|βj | − λ
)+
- sign(·) 1 or −1
-(|βj | − λ
)+
= max(|βj | − λ, 0)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40
![Page 58: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/58.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Suppose again X′X = I J
f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ
J∑j=1
|βj |
= −2X′Yβ + β
′β + λ
J∑j=1
|βj |
The coefficient is
βLASSOj = sign(βj
)(|βj | − λ
)+
- sign(·) 1 or −1
-(|βj | − λ
)+
= max(|βj | − λ, 0)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40
![Page 59: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/59.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Suppose again X′X = I J
f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ
J∑j=1
|βj |
= −2X′Yβ + β
′β + λ
J∑j=1
|βj |
The coefficient is
βLASSOj = sign(βj
)(|βj | − λ
)+
- sign(·) 1 or −1
-(|βj | − λ
)+
= max(|βj | − λ, 0)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40
![Page 60: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/60.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Suppose again X′X = I J
f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ
J∑j=1
|βj |
= −2X′Yβ + β
′β + λ
J∑j=1
|βj |
The coefficient is
βLASSOj = sign(βj
)(|βj | − λ
)+
- sign(·) 1 or −1
-(|βj | − λ
)+
= max(|βj | − λ, 0)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40
![Page 61: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/61.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Compare soft assignment
βLASSOj = sign(βj
)(|βj | − λ
)+
With hard assignment, selecting M biggest components
βsubsetj = βj · I(|βj | ≥ |β(M)|
)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40
![Page 62: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/62.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Compare soft assignment
βLASSOj = sign(βj
)(|βj | − λ
)+
With hard assignment, selecting M biggest components
βsubsetj = βj · I(|βj | ≥ |β(M)|
)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40
![Page 63: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/63.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Compare soft assignment
βLASSOj = sign(βj
)(|βj | − λ
)+
With hard assignment, selecting M biggest components
βsubsetj = βj · I(|βj | ≥ |β(M)|
)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40
![Page 64: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/64.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Compare soft assignment
βLASSOj = sign(βj
)(|βj | − λ
)+
With hard assignment, selecting M biggest components
βsubsetj = βj · I(|βj | ≥ |β(M)|
)
Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40
![Page 65: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/65.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Compare soft assignment
βLASSOj = sign(βj
)(|βj | − λ
)+
With hard assignment, selecting M biggest components
βsubsetj = βj · I(|βj | ≥ |β(M)|
)Intuition 2: Prior on coefficients Double exponential
Why does LASSO induce sparsity?
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40
![Page 66: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/66.jpg)
Lasso Regression Intuition 1, Soft Thresholding
Compare soft assignment
βLASSOj = sign(βj
)(|βj | − λ
)+
With hard assignment, selecting M biggest components
βsubsetj = βj · I(|βj | ≥ |β(M)|
)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40
![Page 67: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/67.jpg)
Comparing Ridge and LASSO
−4 −2 0 2 4
−4
−2
02
4
●
Ridge Regression
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 13 / 40
![Page 68: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/68.jpg)
Comparing Ridge and LASSO
−4 −2 0 2 4
−4
−2
02
4
●
LASSO Regression
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 13 / 40
![Page 69: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/69.jpg)
Comparing Ridge and LASSOContrast β = ( 1√
2, 1√
2) and β = (1, 0)
Under ridge:
2∑j=1
β2j =1
2+
1
2= 1
2∑j=1
β2j = 1 + 0 = 1
Under LASSO
2∑j=1
|βj | =1√2
+1√2
=√
2
2∑j=1
|βj | = 1 + 0 = 1
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40
![Page 70: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/70.jpg)
Comparing Ridge and LASSOContrast β = ( 1√
2, 1√
2) and β = (1, 0)
Under ridge:
2∑j=1
β2j =1
2+
1
2= 1
2∑j=1
β2j = 1 + 0 = 1
Under LASSO
2∑j=1
|βj | =1√2
+1√2
=√
2
2∑j=1
|βj | = 1 + 0 = 1
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40
![Page 71: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/71.jpg)
Comparing Ridge and LASSOContrast β = ( 1√
2, 1√
2) and β = (1, 0)
Under ridge:
2∑j=1
β2j =1
2+
1
2= 1
2∑j=1
β2j = 1 + 0 = 1
Under LASSO
2∑j=1
|βj | =1√2
+1√2
=√
2
2∑j=1
|βj | = 1 + 0 = 1
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40
![Page 72: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/72.jpg)
Comparing Ridge and LASSOContrast β = ( 1√
2, 1√
2) and β = (1, 0)
Under ridge:
2∑j=1
β2j =1
2+
1
2= 1
2∑j=1
β2j = 1 + 0 = 1
Under LASSO
2∑j=1
|βj | =1√2
+1√2
=√
2
2∑j=1
|βj | = 1 + 0 = 1
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40
![Page 73: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/73.jpg)
Comparing Ridge and LASSOContrast β = ( 1√
2, 1√
2) and β = (1, 0)
Under ridge:
2∑j=1
β2j =1
2+
1
2= 1
2∑j=1
β2j = 1 + 0 = 1
Under LASSO
2∑j=1
|βj | =1√2
+1√2
=√
2
2∑j=1
|βj | = 1 + 0 = 1
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40
![Page 74: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/74.jpg)
Comparing Ridge and LASSOContrast β = ( 1√
2, 1√
2) and β = (1, 0)
Under ridge:
2∑j=1
β2j =1
2+
1
2= 1
2∑j=1
β2j = 1 + 0 = 1
Under LASSO
2∑j=1
|βj | =1√2
+1√2
=√
2
2∑j=1
|βj | = 1 + 0 = 1
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40
![Page 75: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/75.jpg)
Comparing Ridge and LASSOContrast β = ( 1√
2, 1√
2) and β = (1, 0)
Under ridge:
2∑j=1
β2j =1
2+
1
2= 1
2∑j=1
β2j = 1 + 0 = 1
Under LASSO
2∑j=1
|βj | =1√2
+1√2
=√
2
2∑j=1
|βj | = 1 + 0 = 1
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40
![Page 76: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/76.jpg)
Selecting λ
How do we determine λ? Cross validation (lecture on Thursday)
Applying models gives score (probability) of document belong to class threshold to classify
To the R code!
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 15 / 40
![Page 77: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/77.jpg)
Selecting λ
How do we determine λ? Cross validation (lecture on Thursday)Applying models gives score (probability) of document belong to class threshold to classifyTo the R code!
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 15 / 40
![Page 78: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/78.jpg)
Selecting λ
How do we determine λ? Cross validation (lecture on Thursday)Applying models gives score (probability) of document belong to class threshold to classifyTo the R code!
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 15 / 40
![Page 79: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/79.jpg)
Assessing Models (Elements of Statistical Learning)
- Model Selection: tuning parameters to select final model (next week’sdiscussion)
- Model assessment : after selecting model, estimating error inclassification
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 16 / 40
![Page 80: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/80.jpg)
Comparing Training and Validation Set
Text classification and model assessment
- Replicate classification exercise with validation set
- General principle of classification/prediction
- Compare supervised learning labels to hand labels
Confusion matrix
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 17 / 40
![Page 81: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/81.jpg)
Comparing Training and Validation Set
Representation of Test Statistics from Dictionary week (along with somenew ones)
Actual Label
Classification (algorithm) Liberal Conservative
Liberal True Liberal False Liberal
Conservative False Conservative True Conservative
Accuracy =TrueLib + TrueCons
TrueLib + TrueCons + FalseLib + FalseCons
PrecisionLiberal =True Liberal
True Liberal + False Liberal
RecallLiberal =True Liberal
True Liberal + False Conservative
FLiberal =2PrecisionLiberalRecallLiberal
PrecisionLiberal + RecallLiberal
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40
![Page 82: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/82.jpg)
Comparing Training and Validation Set
Representation of Test Statistics from Dictionary week (along with somenew ones)
Actual Label
Classification (algorithm) Liberal Conservative
Liberal True Liberal False Liberal
Conservative False Conservative True Conservative
Accuracy =TrueLib + TrueCons
TrueLib + TrueCons + FalseLib + FalseCons
PrecisionLiberal =True Liberal
True Liberal + False Liberal
RecallLiberal =True Liberal
True Liberal + False Conservative
FLiberal =2PrecisionLiberalRecallLiberal
PrecisionLiberal + RecallLiberal
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40
![Page 83: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/83.jpg)
Comparing Training and Validation Set
Representation of Test Statistics from Dictionary week (along with somenew ones)
Actual Label
Classification (algorithm) Liberal Conservative
Liberal True Liberal False Liberal
Conservative False Conservative True Conservative
Accuracy =TrueLib + TrueCons
TrueLib + TrueCons + FalseLib + FalseCons
PrecisionLiberal =True Liberal
True Liberal + False Liberal
RecallLiberal =True Liberal
True Liberal + False Conservative
FLiberal =2PrecisionLiberalRecallLiberal
PrecisionLiberal + RecallLiberal
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40
![Page 84: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/84.jpg)
Comparing Training and Validation Set
Representation of Test Statistics from Dictionary week (along with somenew ones)
Actual Label
Classification (algorithm) Liberal Conservative
Liberal True Liberal False Liberal
Conservative False Conservative True Conservative
Accuracy =TrueLib + TrueCons
TrueLib + TrueCons + FalseLib + FalseCons
PrecisionLiberal =True Liberal
True Liberal + False Liberal
RecallLiberal =True Liberal
True Liberal + False Conservative
FLiberal =2PrecisionLiberalRecallLiberal
PrecisionLiberal + RecallLiberal
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40
![Page 85: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/85.jpg)
Comparing Training and Validation Set
Representation of Test Statistics from Dictionary week (along with somenew ones)
Actual Label
Classification (algorithm) Liberal Conservative
Liberal True Liberal False Liberal
Conservative False Conservative True Conservative
Accuracy =TrueLib + TrueCons
TrueLib + TrueCons + FalseLib + FalseCons
PrecisionLiberal =True Liberal
True Liberal + False Liberal
RecallLiberal =True Liberal
True Liberal + False Conservative
FLiberal =2PrecisionLiberalRecallLiberal
PrecisionLiberal + RecallLiberal
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40
![Page 86: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/86.jpg)
Comparing Training and Validation Set
Representation of Test Statistics from Dictionary week (along with somenew ones)
Actual Label
Classification (algorithm) Liberal Conservative
Liberal True Liberal False Liberal
Conservative False Conservative True Conservative
Accuracy =TrueLib + TrueCons
TrueLib + TrueCons + FalseLib + FalseCons
PrecisionLiberal =True Liberal
True Liberal + False Liberal
RecallLiberal =True Liberal
True Liberal + False Conservative
FLiberal =2PrecisionLiberalRecallLiberal
PrecisionLiberal + RecallLiberal
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40
![Page 87: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/87.jpg)
ROC Curve
ROC as a measure of model performance
RecallLiberal =True Liberal
True Liberal + False Conservative
RecallConservative =True Conservative
True Conservative + False Liberal
Tension:
- Everything liberal: RecallLiberal =1 ; RecallConservative = 0
- Everything conservative: RecallLiberal =0 ; RecallConservative = 1
Characterize Tradeoff:Plot True Positive Rate RecallLiberalFalse Positive Rate (1 - RecallConservative)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 19 / 40
![Page 88: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/88.jpg)
Precision/Recall Tradeoff
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
False Positive Rate
True
Pos
itive
Rat
e
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 20 / 40
![Page 89: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/89.jpg)
Simple Classification Example
Analyzing house press releasesHand Code: 1,000 press releases
- Advertising
- Credit Claiming
- Position Taking
Divide 1,000 press releases into two sets
- 500: Training set
- 500: Test set
Initial exploration: provides baseline measurement at classifierperformancesImprove: through improving model fit
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 21 / 40
![Page 90: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/90.jpg)
Example from First Model FitActual Label
Classification (Naive Bayes) Position Taking Advertising Credit Claim.Position Taking 10 0 0Advertising 2 40 2Credit Claiming 80 60 306
Accuracy =10 + 40 + 306
500= 0.71
PrecisionPT =10
10= 1
RecallPT =10
10 + 2 + 80= 0.11
PrecisionAD =40
40 + 2 + 2= 0.91
RecallAD =40
40 + 60= 0.4
PrecisionCredit =306
306 + 80 + 60= 0.67
RecallCredit =306
306 + 2= 0.99
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 22 / 40
![Page 91: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/91.jpg)
Fit Statistics in R
RWeka library provides Amazing functionality.You can easily code them yourself
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 23 / 40
![Page 92: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/92.jpg)
Support Vector Machines
Document i is an J × 1 vector of counts
x i = (x1i , x2i , . . . , xJi )
Suppose we have two classes, C1,C2.
Yi = 1 if i is in class 1
Yi = −1 if i is in class 2
Suppose they are separable:
- Draw a line between groups
- Goal: identify the line in the middle
- Maximum margin
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 24 / 40
![Page 93: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/93.jpg)
Support Vector Machines: Maximum Margin Classifier(Bishop 2006)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40
![Page 94: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/94.jpg)
Support Vector Machines: Maximum Margin Classifier(Bishop 2006)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40
![Page 95: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/95.jpg)
Support Vector Machines: Maximum Margin Classifier(Bishop 2006)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40
![Page 96: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/96.jpg)
Support Vector Machines: Maximum Margin Classifier(Bishop 2006)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40
![Page 97: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/97.jpg)
Support Vector Machines: Maximum Margin Classifier(Bishop 2006)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40
![Page 98: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/98.jpg)
Support Vector Machines: Algebra (Bishop 2006)
Goal create a score to classify:
s(x i ) = β′x i + b
- β Determines orientation of surface (slope)
- b determines location (moves surface up or down)
- If s(x i ) > 0→ class 1
- If s(x i ) < 0→ class 2
- |s(x i )|||β|| = Document distance from decision surface (margin)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 26 / 40
![Page 99: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/99.jpg)
Support Vector Machines: Algebra (Bishop 2006)
Objective function: maximum margin
mini [ |(s(x i )| ]: Point closest to decision surfaceWe want to identify β and b to maximize the margin:
arg maxβ,b
{1
||β||mini [ |(s(x i )| ]
}arg maxβ,b
{1
||β||mini [ |β
′x i + b| ]
}Constrained optimization problem Quadratic programming problem
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40
![Page 100: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/100.jpg)
Support Vector Machines: Algebra (Bishop 2006)
Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface
We want to identify β and b to maximize the margin:
arg maxβ,b
{1
||β||mini [ |(s(x i )| ]
}arg maxβ,b
{1
||β||mini [ |β
′x i + b| ]
}Constrained optimization problem Quadratic programming problem
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40
![Page 101: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/101.jpg)
Support Vector Machines: Algebra (Bishop 2006)
Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface
We want to identify β and b to maximize the margin:
arg maxβ,b
{1
||β||mini [ |(s(x i )| ]
}arg maxβ,b
{1
||β||mini [ |β
′x i + b| ]
}Constrained optimization problem Quadratic programming problem
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40
![Page 102: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/102.jpg)
Support Vector Machines: Algebra (Bishop 2006)
Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface
We want to identify β and b to maximize the margin:
arg maxβ,b
{1
||β||mini [ |(s(x i )| ]
}
arg maxβ,b
{1
||β||mini [ |β
′x i + b| ]
}Constrained optimization problem Quadratic programming problem
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40
![Page 103: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/103.jpg)
Support Vector Machines: Algebra (Bishop 2006)
Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface
We want to identify β and b to maximize the margin:
arg maxβ,b
{1
||β||mini [ |(s(x i )| ]
}arg maxβ,b
{1
||β||mini [ |β
′x i + b| ]
}
Constrained optimization problem Quadratic programming problem
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40
![Page 104: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/104.jpg)
Support Vector Machines: Algebra (Bishop 2006)
Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface
We want to identify β and b to maximize the margin:
arg maxβ,b
{1
||β||mini [ |(s(x i )| ]
}arg maxβ,b
{1
||β||mini [ |β
′x i + b| ]
}Constrained optimization problem Quadratic programming problem
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40
![Page 105: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/105.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 106: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/106.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 107: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/107.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 108: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/108.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 109: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/109.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 110: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/110.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 111: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/111.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 112: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/112.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 113: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/113.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 114: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/114.jpg)
What About Overlap? (Bishop 2006)
- Rare that classes are separable.
- Define:
ξi = 0 if correctly classified
ξi = |s(x i )| if incorrectly classified
Tradeoff:
- Maximize margin between correctly classified groups
- Minimize error from misclassified documents
arg maxβ,b
{C
N∑i=1
ξi +1
||β||mini [ |β
′x i + b| ]
}
C captures tradeoff
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40
![Page 115: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/115.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 116: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/116.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?
1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 117: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/117.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 118: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/118.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes
- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 119: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/119.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results
- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 120: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/120.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”
- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 121: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/121.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 122: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/122.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications
- Perform vote to select class (still suboptimal)3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 123: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/123.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 124: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/124.jpg)
How to Handle Multiple Comparisons?
- Rare that we only want to classify two categories
- How to handle classification into K groups?1) Set up K classification problems:
- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable
2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)
3) Simultaneous estimation possible, much slower
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40
![Page 125: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/125.jpg)
R Code to Run SVMs
library(e1071)
fit<- svm(T . , as.data.frame(tdm) , method =’C’,
kernel=’linear’)
where: method = ‘C’ → Classificationkernel=’linear’ → allows for distortion of feature space. Options:
- Linear
- Polynomial
- Radial
- sigmoid
preds<- predict(fit, data =
as.data.frame(tdm[-c(1:no.train),]))
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 30 / 40
![Page 126: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/126.jpg)
Example of SVMs in Political Science Research
Hillard, Purpura, Wilkerson: SVMs to code topic/sub topics for policyagendas project
SVMs are under utilized in political science
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 31 / 40
![Page 127: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/127.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Naive Bayes (and next week, SVM): focused on individual documentclassification.
But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .
- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies
- Hopkins and King (2010): extend the method to text documents
Basic intuition:
- Examine joint distribution of characteristics (without making NaiveBayes like assumption)
- Focus on distributions (only) makes this analysis possible
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40
![Page 128: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/128.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?
Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .
- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies
- Hopkins and King (2010): extend the method to text documents
Basic intuition:
- Examine joint distribution of characteristics (without making NaiveBayes like assumption)
- Focus on distributions (only) makes this analysis possible
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40
![Page 129: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/129.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classes
Can be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .
- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies
- Hopkins and King (2010): extend the method to text documents
Basic intuition:
- Examine joint distribution of characteristics (without making NaiveBayes like assumption)
- Focus on distributions (only) makes this analysis possible
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40
![Page 130: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/130.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .
- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies
- Hopkins and King (2010): extend the method to text documents
Basic intuition:
- Examine joint distribution of characteristics (without making NaiveBayes like assumption)
- Focus on distributions (only) makes this analysis possible
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40
![Page 131: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/131.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .
- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies
- Hopkins and King (2010): extend the method to text documents
Basic intuition:
- Examine joint distribution of characteristics (without making NaiveBayes like assumption)
- Focus on distributions (only) makes this analysis possible
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40
![Page 132: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/132.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .
- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies
- Hopkins and King (2010): extend the method to text documents
Basic intuition:
- Examine joint distribution of characteristics (without making NaiveBayes like assumption)
- Focus on distributions (only) makes this analysis possible
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40
![Page 133: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/133.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .
- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies
- Hopkins and King (2010): extend the method to text documents
Basic intuition:
- Examine joint distribution of characteristics (without making NaiveBayes like assumption)
- Focus on distributions (only) makes this analysis possible
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40
![Page 134: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/134.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .
- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies
- Hopkins and King (2010): extend the method to text documents
Basic intuition:
- Examine joint distribution of characteristics (without making NaiveBayes like assumption)
- Focus on distributions (only) makes this analysis possible
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40
![Page 135: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/135.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 136: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/136.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 137: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/137.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 138: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/138.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 139: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/139.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 140: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/140.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing x
P(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 141: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/141.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 142: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/142.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 143: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/143.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 144: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/144.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
Measure only presence/absence of each term [(Jx1) vector ]
x i = (1, 0, 0, 1, . . . , 0)
What are the possible realizations of x i?
- 2J possible vectors
Define:
P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj
P(X |C ) = Matrix collecting vectors
P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40
![Page 145: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/145.jpg)
ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)
P(x)︸ ︷︷ ︸2Jx1
= P(x |C )︸ ︷︷ ︸2JxK
P(C )︸ ︷︷ ︸Kx1
Matrix algebra problem to solve, for P(C )Like Naive Bayes, requires two pieces to estimateComplication 2J >> no. documentsKernel Smoothing Methods (without a formal model)
- P(x) = estimate directly from test set
- P(x |C ) = estimate from training set
- Key assumption: P(x |C ) in training set is equivalent to P(x |C ) in testset
- If true, can perform biased sampling of documents, worry less aboutdrift...
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 34 / 40
![Page 146: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/146.jpg)
Algorithm Summarized
- Estimate p(x) from test set
- Estimate p(x |C ) from training set
- Use p(x) and p(x |C ) to solve for p(C )
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 35 / 40
![Page 147: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/147.jpg)
Assessing Model Performance
Not classifying individual documents → different standardsMean Square Error :
E[(θ − θ)2] = var(θ) + Bias(θ, θ)2
Suppose we have true proportions P(C )true. Then, we’ll estimate RootMean Square Error
RMSE =
√∑Jj=1(P(Cj)true − P(Cj))
J
Mean Abs. Prediction Error = |∑J
j=1(P(Cj)true − P(Cj))
J|
Visualize: plot true and estimated proportions
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 36 / 40
![Page 148: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/148.jpg)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 37 / 40
![Page 149: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/149.jpg)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 37 / 40
![Page 150: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/150.jpg)
Using the House Press Release Data
Method RMSE APSE
ReadMe 0.036 0.056NaiveBayes 0.096 0.14SVM 0.052 0.084
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 38 / 40
![Page 151: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/151.jpg)
Code to Run in R
Control file:filename truth trainingset
20July2009LEWIS53.txt 4 126July2006LEWIS249.txt 2 0
tdm<- undergrad(control=control, fullfreq=F)
process<- preprocess(tdm)
output<- undergrad(process)
output$est.CSMF ## proportion in each category
output$true.CSMF ## if labeled for validation set (but not
used in training set)
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 39 / 40
![Page 152: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe](https://reader034.vdocuments.us/reader034/viewer/2022043023/5f3ec56d9b949b3cfa5bebea/html5/thumbnails/152.jpg)
Model Selection!
Justin Grimmer (Stanford University) Text as Data November 10th, 2014 40 / 40