Download - Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Text as Data

Justin Grimmer

Associate ProfessorDepartment of Political Science

Stanford University

November 10th, 2014

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 1 / 40

Page 2: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Supervised Learning Methods

1) Task

- Classify documents to pre existing categories- Measure the proportion of documents in each category

2) Objective function1) Penalized Regressions

- Ridge regression- LASSO regression

2) Classification Surface Support Vector Machines3) Measure Proportions Naive Bayes(ish) objective

3) Optimization

- Depends on method

4) Validation

- Obtain predicted fit for new data f (X i , θ)- Examine prediction performance compare classification to gold

standard

Page 3: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}

We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Page 4: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

Page 5: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2

β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

Page 6: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

Page 7: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

Page 8: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

Page 9: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

Page 10: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

Page 11: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

Predictions will be variableJustin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Page 12: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

Suppose θ is some value of the true parameter

Bias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Page 13: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

Suppose θ is some value of the true parameterBias:

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 14: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 15: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 16: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

E[(θ − θ)2]

= E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 17: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 18: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 19: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 20: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 21: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Mean Square Error

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

Page 22: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression

Penalty for model complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λJ∑

j=1

β2j︸︷︷︸Penalty

where:

- β0 intercept

- λ penalty parameter

Page 23: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression

f (β,X ,Y )

=N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λJ∑

j=1

where:

- β0 intercept

Page 24: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λJ∑

j=1

where:

- β0 intercept

Page 25: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

where:

- β0 intercept

Page 26: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

where:

- β0 intercept

Page 27: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

where:

- β0 intercept

Page 28: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

where:

- β0 intercept

Page 29: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression Optimization

βRidge = arg minβ {f (β,X ,Y )}

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λJ∑

j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

Demean the data and set β0 = y =∑N

i=1yiN

Page 30: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

i=1yiN

Page 31: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

i=1yiN

Page 32: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

i=1yiN

Page 33: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

i=1yiN

Page 34: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Ridge Regression Intuition (1)

Suppose X′X = I J .

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Page 35: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Page 36: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Page 37: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Page 38: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Page 39: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Page 40: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Page 41: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

βj ∼ Normal(0, τ2)

yi ∼ Normal(β0 + x′iβ, σ

2)

p(β|X ,Y ) ∝J∏

j=1

p(βj)N∏i=1

p(yi |x i ,β)

∝J∏

j=1

1√2πτ

exp

(−β2j

2τ 2

)N∏i=1

1√2πσ

exp

(− (yi − β0 + x

′β)2

2σ2

)

Page 42: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2)

p(β|X ,Y ) ∝J∏

j=1

p(βj)N∏i=1

p(yi |x i ,β)

∝J∏

j=1

1√2πτ

exp

(−β2j

2τ 2

)N∏i=1

1√2πσ

exp

(− (yi − β0 + x

′β)2

2σ2

)

Page 43: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2)

p(β|X ,Y ) ∝J∏

j=1

p(βj)N∏i=1

p(yi |x i ,β)

∝J∏

j=1

1√2πτ

exp

(−β2j

2τ 2

)N∏i=1

1√2πσ

exp

(− (yi − β0 + x

′β)2

2σ2

)

Page 44: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

log p(β|X ,Y ) = −J∑

j=1

β2j2τ2−

N∑i=1

(yi − β0 + x ′β)2

2σ2

− 2σ2 log p(β|X ,Y ) =N∑i=1

(yi − β0 + x′β)2 +

J∑j=1

σ2

τ2β2j

where:

- λ = σ2

τ2β2j

Page 45: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

j=1

β2j2τ2−

N∑i=1

(yi − β0 + x ′β)2

2σ2

− 2σ2 log p(β|X ,Y ) =N∑i=1

(yi − β0 + x′β)2 +

J∑j=1

σ2

τ2β2j

where:

- λ = σ2

τ2β2j

Page 46: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

j=1

β2j2τ2−

N∑i=1

(yi − β0 + x ′β)2

2σ2

− 2σ2 log p(β|X ,Y ) =N∑i=1

(yi − β0 + x′β)2 +

J∑j=1

σ2

τ2β2j

where:

- λ = σ2

τ2β2j

Page 47: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

j=1

β2j2τ2−

N∑i=1

(yi − β0 + x ′β)2

2σ2

− 2σ2 log p(β|X ,Y ) =N∑i=1

(yi − β0 + x′β)2 +

J∑j=1

σ2

τ2β2j

where:

- λ = σ2

τ2β2j

Page 48: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Lasso Regression Objective Function/Optimization

Different Penalty for Model Complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

|βj |︸︷︷︸Penalty

- Optimization is non-linear (Absolute Value)

- Coordinate Descent- Start with Ridge- Sub-differential, update steps

- Induces sparsity sets some coefficients to zero

Page 49: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

Page 50: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

- Coordinate Descent

- Start with Ridge- Sub-differential, update steps

Page 51: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

- Coordinate Descent- Start with Ridge

- Sub-differential, update steps

Page 52: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

Page 53: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

Page 54: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Lasso Regression Intuition 1, Soft Thresholding

Suppose again X′X = I J

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Page 55: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Page 56: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Page 57: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Page 58: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Page 59: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Page 60: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Page 61: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Compare soft assignment

βLASSOj = sign(βj

)(|βj | − λ

)+

With hard assignment, selecting M biggest components

βsubsetj = βj · I(|βj | ≥ |β(M)|

)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?

Page 62: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

βLASSOj = sign(βj

)(|βj | − λ

)+

Page 63: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

βLASSOj = sign(βj

)(|βj | − λ

)+

Page 64: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

βLASSOj = sign(βj

)(|βj | − λ

)+

)

Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?

Page 65: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

βLASSOj = sign(βj

)(|βj | − λ

)+

)Intuition 2: Prior on coefficients Double exponential

Why does LASSO induce sparsity?

Page 66: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

βLASSOj = sign(βj

)(|βj | − λ

)+

Page 67: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Comparing Ridge and LASSO

−4 −2 0 2 4

−4

−2

02

4

●

Ridge Regression

Page 68: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Comparing Ridge and LASSO

−4 −2 0 2 4

−4

−2

02

4

●

LASSO Regression

Page 69: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Comparing Ridge and LASSOContrast β = ( 1√

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Page 70: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Page 71: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Page 72: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Page 73: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Page 74: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Page 75: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Page 76: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Selecting λ

How do we determine λ? Cross validation (lecture on Thursday)

Applying models gives score (probability) of document belong to class threshold to classify

To the R code!

Page 77: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Selecting λ

How do we determine λ? Cross validation (lecture on Thursday)Applying models gives score (probability) of document belong to class threshold to classifyTo the R code!

Page 78: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Selecting λ

How do we determine λ? Cross validation (lecture on Thursday)Applying models gives score (probability) of document belong to class threshold to classifyTo the R code!

Page 79: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Assessing Models (Elements of Statistical Learning)

- Model Selection: tuning parameters to select final model (next week’sdiscussion)

- Model assessment : after selecting model, estimating error inclassification

Page 80: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Comparing Training and Validation Set

Text classification and model assessment

- Replicate classification exercise with validation set

- General principle of classification/prediction

- Compare supervised learning labels to hand labels

Confusion matrix

Page 81: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Representation of Test Statistics from Dictionary week (along with somenew ones)

Actual Label

Classification (algorithm) Liberal Conservative

Liberal True Liberal False Liberal

Conservative False Conservative True Conservative

Accuracy =TrueLib + TrueCons

TrueLib + TrueCons + FalseLib + FalseCons

PrecisionLiberal =True Liberal

True Liberal + False Liberal

RecallLiberal =True Liberal

True Liberal + False Conservative

FLiberal =2PrecisionLiberalRecallLiberal

PrecisionLiberal + RecallLiberal

Page 82: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Actual Label

Page 83: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Actual Label

Page 84: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Actual Label

Page 85: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Actual Label

Page 86: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Actual Label

Page 87: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

ROC Curve

ROC as a measure of model performance

RecallConservative =True Conservative

True Conservative + False Liberal

Tension:

- Everything liberal: RecallLiberal =1 ; RecallConservative = 0

- Everything conservative: RecallLiberal =0 ; RecallConservative = 1

Characterize Tradeoff:Plot True Positive Rate RecallLiberalFalse Positive Rate (1 - RecallConservative)

Page 88: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Precision/Recall Tradeoff

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Rate

True

Pos

itive

Rat

e

Page 89: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Simple Classification Example

Analyzing house press releasesHand Code: 1,000 press releases

- Advertising

- Credit Claiming

- Position Taking

Divide 1,000 press releases into two sets

- 500: Training set

- 500: Test set

Initial exploration: provides baseline measurement at classifierperformancesImprove: through improving model fit

Page 90: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Example from First Model FitActual Label

Classification (Naive Bayes) Position Taking Advertising Credit Claim.Position Taking 10 0 0Advertising 2 40 2Credit Claiming 80 60 306

Accuracy =10 + 40 + 306

500= 0.71

PrecisionPT =10

10= 1

RecallPT =10

10 + 2 + 80= 0.11

PrecisionAD =40

40 + 2 + 2= 0.91

RecallAD =40

40 + 60= 0.4

PrecisionCredit =306

306 + 80 + 60= 0.67

RecallCredit =306

306 + 2= 0.99

Page 91: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Fit Statistics in R

RWeka library provides Amazing functionality.You can easily code them yourself

Page 92: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Support Vector Machines

Document i is an J × 1 vector of counts

x i = (x1i , x2i , . . . , xJi )

Suppose we have two classes, C1,C2.

Yi = 1 if i is in class 1

Yi = −1 if i is in class 2

Suppose they are separable:

- Draw a line between groups

- Goal: identify the line in the middle

- Maximum margin

Page 93: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Support Vector Machines: Maximum Margin Classifier(Bishop 2006)

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●

Page 94: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●

Page 95: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●

Page 96: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●

Page 97: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●

Page 98: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Support Vector Machines: Algebra (Bishop 2006)

Goal create a score to classify:

s(x i ) = β′x i + b

- β Determines orientation of surface (slope)

- b determines location (moves surface up or down)

- If s(x i ) > 0→ class 1

- If s(x i ) < 0→ class 2

- |s(x i )|||β|| = Document distance from decision surface (margin)

Page 99: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Objective function: maximum margin

mini [ |(s(x i )| ]: Point closest to decision surfaceWe want to identify β and b to maximize the margin:

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

}Constrained optimization problem Quadratic programming problem

Page 100: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface

We want to identify β and b to maximize the margin:

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

Page 101: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

Page 102: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}

arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

Page 103: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

}

Constrained optimization problem Quadratic programming problem

Page 104: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

Page 105: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 106: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 107: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 108: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 109: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 110: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 111: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 112: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 113: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 114: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Define:

Tradeoff:

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Page 115: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Page 116: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- How to handle classification into K groups?

1) Set up K classification problems:

Page 117: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Page 118: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Compare each class to all other classes

- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

Page 119: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Compare each class to all other classes- Problem: can lead to inconsistent results

- Solution(?): select category with largest “score”- Problem: scales are not comparable

Page 120: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”

- Problem: scales are not comparable

Page 121: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Page 122: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

2) Common solution: set up K (K − 1)/2 classifications

- Perform vote to select class (still suboptimal)3) Simultaneous estimation possible, much slower

Page 123: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Page 124: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Page 125: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

R Code to Run SVMs

library(e1071)

fit<- svm(T . , as.data.frame(tdm) , method =’C’,

kernel=’linear’)

where: method = ‘C’ → Classificationkernel=’linear’ → allows for distortion of feature space. Options:

- Linear

- Polynomial

- Radial

- sigmoid

preds<- predict(fit, data =

as.data.frame(tdm[-c(1:no.train),]))

Page 126: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Example of SVMs in Political Science Research

Hillard, Purpura, Wilkerson: SVMs to code topic/sub topics for policyagendas project

SVMs are under utilized in political science

Page 127: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.

But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Page 128: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?

Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

Basic intuition:

Page 129: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classes

Can be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

Basic intuition:

Page 130: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

Basic intuition:

Page 131: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Basic intuition:

Page 132: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Basic intuition:

Page 133: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Basic intuition:

Page 134: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Basic intuition:

Page 135: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Page 136: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

Page 137: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

Page 138: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

Page 139: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

Page 140: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

P(x) = probability of observing x

P(x |Cj) = Probability of observing x conditional on category Cj

Page 141: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

Page 142: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

Page 143: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

Page 144: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

x i = (1, 0, 0, 1, . . . , 0)

Define:

Page 145: Text as Data - Stanford Universitystanford.edu/~jgrimmer/Text14/tc15.pdf · = Var( ) + Bias2 To reduce MSE, we are willing to induce bias to decrease variance methods thatshrink coe

P(x)︸︷︷︸2Jx1

= P(x |C )︸︷︷︸2JxK

P(C )︸︷︷︸Kx1

Matrix algebra problem to solve, for P(C )Like Naive Bayes, requires two pieces to estimateComplication 2J >> no. documentsKernel Smoothing Methods (without a formal model)

- P(x) = estimate directly from test set

- P(x |C ) = estimate from training set

- Key assumption: P(x |C ) in training set is equivalent to P(x |C ) in testset

- If true, can perform biased sampling of documents, worry less aboutdrift...