text as data - stanford universitystanford.edu/~jgrimmer/text14/tc15.pdf · = var( ) + bias2 to...

Post on 12-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Text as Data

Justin Grimmer

Associate ProfessorDepartment of Political Science

Stanford University

November 10th, 2014

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 1 / 40

Supervised Learning Methods

1) Task

- Classify documents to pre existing categories- Measure the proportion of documents in each category

2) Objective function1) Penalized Regressions

- Ridge regression- LASSO regression

2) Classification Surface Support Vector Machines3) Measure Proportions Naive Bayes(ish) objective

3) Optimization

- Depends on method

4) Validation

- Obtain predicted fit for new data f (X i , θ)- Examine prediction performance compare classification to gold

standard

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 2 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}

We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2

β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variable

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Regression modelsSuppose we have N documents, with each document i having labelyi ∈ {−1, 1} {liberal, conservative}We represent each document i is x i = (xi1, xi2, . . . , xiJ).

f (β,X ,Y ) =N∑i=1

(yi − β

′x i

)2β = arg minβ

{N∑i=1

(yi − β

′x i

)2}

=(X

′X)−1

X′Y

Problem:

- J will likely be large (perhaps J > N)

- There many correlated variables

Predictions will be variableJustin Grimmer (Stanford University) Text as Data November 10th, 2014 3 / 40

Mean Square Error

Suppose θ is some value of the true parameter

Bias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2]

= E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Mean Square Error

Suppose θ is some value of the true parameterBias:

Bias = E [θ − θ]

We may care about average distance from truth

E[(θ − θ)2] = E [θ2]− 2θE [θ] + θ2

= E [θ2]− E [θ]2 + E [θ]2 − 2θE [θ] + θ2

= E [θ2]− E [θ]2 + (E [θ − θ])2

= Var(θ) + Bias2

To reduce MSE, we are willing to induce bias to decrease variance methods that shrink coefficeints toward zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 4 / 40

Ridge Regression

Penalty for model complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λJ∑

j=1

β2j︸ ︷︷ ︸Penalty

where:

- β0 intercept

- λ penalty parameter

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40

Ridge Regression

Penalty for model complexity

f (β,X ,Y )

=N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λJ∑

j=1

β2j︸ ︷︷ ︸Penalty

where:

- β0 intercept

- λ penalty parameter

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40

Ridge Regression

Penalty for model complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λJ∑

j=1

β2j︸ ︷︷ ︸Penalty

where:

- β0 intercept

- λ penalty parameter

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40

Ridge Regression

Penalty for model complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j︸ ︷︷ ︸Penalty

where:

- β0 intercept

- λ penalty parameter

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40

Ridge Regression

Penalty for model complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j︸ ︷︷ ︸Penalty

where:

- β0 intercept

- λ penalty parameter

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40

Ridge Regression

Penalty for model complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j︸ ︷︷ ︸Penalty

where:

- β0 intercept

- λ penalty parameter

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40

Ridge Regression

Penalty for model complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j︸ ︷︷ ︸Penalty

where:

- β0 intercept

- λ penalty parameter

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 5 / 40

Ridge Regression Optimization

βRidge = arg minβ {f (β,X ,Y )}

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λJ∑

j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

Demean the data and set β0 = y =∑N

i=1yiN

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40

Ridge Regression Optimization

βRidge = arg minβ {f (β,X ,Y )}

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

Demean the data and set β0 = y =∑N

i=1yiN

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40

Ridge Regression Optimization

βRidge = arg minβ {f (β,X ,Y )}

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

Demean the data and set β0 = y =∑N

i=1yiN

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40

Ridge Regression Optimization

βRidge = arg minβ {f (β,X ,Y )}

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

Demean the data and set β0 = y =∑N

i=1yiN

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40

Ridge Regression Optimization

βRidge = arg minβ {f (β,X ,Y )}

= arg minβ

N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

β2j

= arg minβ

{(Y − X

′β)

′(Y − X

′β) + λβ

′β}

=(X

′X + λI J

)−1X

′Y

Demean the data and set β0 = y =∑N

i=1yiN

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 6 / 40

Ridge Regression Intuition (1)

Suppose X′X = I J .

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40

Ridge Regression Intuition (1)

Suppose X′X = I J .

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40

Ridge Regression Intuition (1)

Suppose X′X = I J .

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40

Ridge Regression Intuition (1)

Suppose X′X = I J .

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40

Ridge Regression Intuition (1)

Suppose X′X = I J .

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40

Ridge Regression Intuition (1)

Suppose X′X = I J .

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40

Ridge Regression Intuition (1)

Suppose X′X = I J .

β =(X

′X)−1

X′Y

= X′Y

βridge =(X

′X + λI J

)−1X

′Y

= (I j + λI j) X′Y

= (I j + λI j) β

βRidgej =βj

1 + λ

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 7 / 40

Ridge Regression Intuition (2)

βj ∼ Normal(0, τ2)

yi ∼ Normal(β0 + x′iβ, σ

2)

p(β|X ,Y ) ∝J∏

j=1

p(βj)N∏i=1

p(yi |x i ,β)

∝J∏

j=1

1√2πτ

exp

(−β2j

2τ 2

)N∏i=1

1√2πσ

exp

(− (yi − β0 + x

′β)2

2σ2

)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 8 / 40

Ridge Regression Intuition (2)

βj ∼ Normal(0, τ2)

yi ∼ Normal(β0 + x′iβ, σ

2)

p(β|X ,Y ) ∝J∏

j=1

p(βj)N∏i=1

p(yi |x i ,β)

∝J∏

j=1

1√2πτ

exp

(−β2j

2τ 2

)N∏i=1

1√2πσ

exp

(− (yi − β0 + x

′β)2

2σ2

)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 8 / 40

Ridge Regression Intuition (2)

βj ∼ Normal(0, τ2)

yi ∼ Normal(β0 + x′iβ, σ

2)

p(β|X ,Y ) ∝J∏

j=1

p(βj)N∏i=1

p(yi |x i ,β)

∝J∏

j=1

1√2πτ

exp

(−β2j

2τ 2

)N∏i=1

1√2πσ

exp

(− (yi − β0 + x

′β)2

2σ2

)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 8 / 40

Ridge Regression Intuition (2)

log p(β|X ,Y ) = −J∑

j=1

β2j2τ2−

N∑i=1

(yi − β0 + x ′β)2

2σ2

− 2σ2 log p(β|X ,Y ) =N∑i=1

(yi − β0 + x′β)2 +

J∑j=1

σ2

τ2β2j

where:

- λ = σ2

τ2β2j

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 9 / 40

Ridge Regression Intuition (2)

log p(β|X ,Y ) = −J∑

j=1

β2j2τ2−

N∑i=1

(yi − β0 + x ′β)2

2σ2

− 2σ2 log p(β|X ,Y ) =N∑i=1

(yi − β0 + x′β)2 +

J∑j=1

σ2

τ2β2j

where:

- λ = σ2

τ2β2j

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 9 / 40

Ridge Regression Intuition (2)

log p(β|X ,Y ) = −J∑

j=1

β2j2τ2−

N∑i=1

(yi − β0 + x ′β)2

2σ2

− 2σ2 log p(β|X ,Y ) =N∑i=1

(yi − β0 + x′β)2 +

J∑j=1

σ2

τ2β2j

where:

- λ = σ2

τ2β2j

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 9 / 40

Ridge Regression Intuition (2)

log p(β|X ,Y ) = −J∑

j=1

β2j2τ2−

N∑i=1

(yi − β0 + x ′β)2

2σ2

− 2σ2 log p(β|X ,Y ) =N∑i=1

(yi − β0 + x′β)2 +

J∑j=1

σ2

τ2β2j

where:

- λ = σ2

τ2β2j

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 9 / 40

Lasso Regression Objective Function/Optimization

Different Penalty for Model Complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

|βj |︸︷︷︸Penalty

- Optimization is non-linear (Absolute Value)

- Coordinate Descent- Start with Ridge- Sub-differential, update steps

- Induces sparsity sets some coefficients to zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40

Lasso Regression Objective Function/Optimization

Different Penalty for Model Complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

|βj |︸︷︷︸Penalty

- Optimization is non-linear (Absolute Value)

- Coordinate Descent- Start with Ridge- Sub-differential, update steps

- Induces sparsity sets some coefficients to zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40

Lasso Regression Objective Function/Optimization

Different Penalty for Model Complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

|βj |︸︷︷︸Penalty

- Optimization is non-linear (Absolute Value)

- Coordinate Descent

- Start with Ridge- Sub-differential, update steps

- Induces sparsity sets some coefficients to zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40

Lasso Regression Objective Function/Optimization

Different Penalty for Model Complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

|βj |︸︷︷︸Penalty

- Optimization is non-linear (Absolute Value)

- Coordinate Descent- Start with Ridge

- Sub-differential, update steps

- Induces sparsity sets some coefficients to zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40

Lasso Regression Objective Function/Optimization

Different Penalty for Model Complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

|βj |︸︷︷︸Penalty

- Optimization is non-linear (Absolute Value)

- Coordinate Descent- Start with Ridge- Sub-differential, update steps

- Induces sparsity sets some coefficients to zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40

Lasso Regression Objective Function/Optimization

Different Penalty for Model Complexity

f (β,X ,Y ) =N∑i=1

yi − β0 +J∑

j=1

βjxij

2

+ λ

J∑j=1

|βj |︸︷︷︸Penalty

- Optimization is non-linear (Absolute Value)

- Coordinate Descent- Start with Ridge- Sub-differential, update steps

- Induces sparsity sets some coefficients to zero

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 10 / 40

Lasso Regression Intuition 1, Soft Thresholding

Suppose again X′X = I J

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40

Lasso Regression Intuition 1, Soft Thresholding

Suppose again X′X = I J

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40

Lasso Regression Intuition 1, Soft Thresholding

Suppose again X′X = I J

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40

Lasso Regression Intuition 1, Soft Thresholding

Suppose again X′X = I J

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40

Lasso Regression Intuition 1, Soft Thresholding

Suppose again X′X = I J

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40

Lasso Regression Intuition 1, Soft Thresholding

Suppose again X′X = I J

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40

Lasso Regression Intuition 1, Soft Thresholding

Suppose again X′X = I J

f (β,X ,Y ) = (Y − Xβ)′(Y − Xβ) + λ

J∑j=1

|βj |

= −2X′Yβ + β

′β + λ

J∑j=1

|βj |

The coefficient is

βLASSOj = sign(βj

)(|βj | − λ

)+

- sign(·) 1 or −1

-(|βj | − λ

)+

= max(|βj | − λ, 0)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 11 / 40

Lasso Regression Intuition 1, Soft Thresholding

Compare soft assignment

βLASSOj = sign(βj

)(|βj | − λ

)+

With hard assignment, selecting M biggest components

βsubsetj = βj · I(|βj | ≥ |β(M)|

)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40

Lasso Regression Intuition 1, Soft Thresholding

Compare soft assignment

βLASSOj = sign(βj

)(|βj | − λ

)+

With hard assignment, selecting M biggest components

βsubsetj = βj · I(|βj | ≥ |β(M)|

)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40

Lasso Regression Intuition 1, Soft Thresholding

Compare soft assignment

βLASSOj = sign(βj

)(|βj | − λ

)+

With hard assignment, selecting M biggest components

βsubsetj = βj · I(|βj | ≥ |β(M)|

)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40

Lasso Regression Intuition 1, Soft Thresholding

Compare soft assignment

βLASSOj = sign(βj

)(|βj | − λ

)+

With hard assignment, selecting M biggest components

βsubsetj = βj · I(|βj | ≥ |β(M)|

)

Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40

Lasso Regression Intuition 1, Soft Thresholding

Compare soft assignment

βLASSOj = sign(βj

)(|βj | − λ

)+

With hard assignment, selecting M biggest components

βsubsetj = βj · I(|βj | ≥ |β(M)|

)Intuition 2: Prior on coefficients Double exponential

Why does LASSO induce sparsity?

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40

Lasso Regression Intuition 1, Soft Thresholding

Compare soft assignment

βLASSOj = sign(βj

)(|βj | − λ

)+

With hard assignment, selecting M biggest components

βsubsetj = βj · I(|βj | ≥ |β(M)|

)Intuition 2: Prior on coefficients Double exponentialWhy does LASSO induce sparsity?

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 12 / 40

Comparing Ridge and LASSO

−4 −2 0 2 4

−4

−2

02

4

Ridge Regression

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 13 / 40

Comparing Ridge and LASSO

−4 −2 0 2 4

−4

−2

02

4

LASSO Regression

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 13 / 40

Comparing Ridge and LASSOContrast β = ( 1√

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40

Comparing Ridge and LASSOContrast β = ( 1√

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40

Comparing Ridge and LASSOContrast β = ( 1√

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40

Comparing Ridge and LASSOContrast β = ( 1√

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40

Comparing Ridge and LASSOContrast β = ( 1√

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40

Comparing Ridge and LASSOContrast β = ( 1√

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40

Comparing Ridge and LASSOContrast β = ( 1√

2, 1√

2) and β = (1, 0)

Under ridge:

2∑j=1

β2j =1

2+

1

2= 1

2∑j=1

β2j = 1 + 0 = 1

Under LASSO

2∑j=1

|βj | =1√2

+1√2

=√

2

2∑j=1

|βj | = 1 + 0 = 1

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 14 / 40

Selecting λ

How do we determine λ? Cross validation (lecture on Thursday)

Applying models gives score (probability) of document belong to class threshold to classify

To the R code!

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 15 / 40

Selecting λ

How do we determine λ? Cross validation (lecture on Thursday)Applying models gives score (probability) of document belong to class threshold to classifyTo the R code!

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 15 / 40

Selecting λ

How do we determine λ? Cross validation (lecture on Thursday)Applying models gives score (probability) of document belong to class threshold to classifyTo the R code!

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 15 / 40

Assessing Models (Elements of Statistical Learning)

- Model Selection: tuning parameters to select final model (next week’sdiscussion)

- Model assessment : after selecting model, estimating error inclassification

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 16 / 40

Comparing Training and Validation Set

Text classification and model assessment

- Replicate classification exercise with validation set

- General principle of classification/prediction

- Compare supervised learning labels to hand labels

Confusion matrix

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 17 / 40

Comparing Training and Validation Set

Representation of Test Statistics from Dictionary week (along with somenew ones)

Actual Label

Classification (algorithm) Liberal Conservative

Liberal True Liberal False Liberal

Conservative False Conservative True Conservative

Accuracy =TrueLib + TrueCons

TrueLib + TrueCons + FalseLib + FalseCons

PrecisionLiberal =True Liberal

True Liberal + False Liberal

RecallLiberal =True Liberal

True Liberal + False Conservative

FLiberal =2PrecisionLiberalRecallLiberal

PrecisionLiberal + RecallLiberal

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40

Comparing Training and Validation Set

Representation of Test Statistics from Dictionary week (along with somenew ones)

Actual Label

Classification (algorithm) Liberal Conservative

Liberal True Liberal False Liberal

Conservative False Conservative True Conservative

Accuracy =TrueLib + TrueCons

TrueLib + TrueCons + FalseLib + FalseCons

PrecisionLiberal =True Liberal

True Liberal + False Liberal

RecallLiberal =True Liberal

True Liberal + False Conservative

FLiberal =2PrecisionLiberalRecallLiberal

PrecisionLiberal + RecallLiberal

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40

Comparing Training and Validation Set

Representation of Test Statistics from Dictionary week (along with somenew ones)

Actual Label

Classification (algorithm) Liberal Conservative

Liberal True Liberal False Liberal

Conservative False Conservative True Conservative

Accuracy =TrueLib + TrueCons

TrueLib + TrueCons + FalseLib + FalseCons

PrecisionLiberal =True Liberal

True Liberal + False Liberal

RecallLiberal =True Liberal

True Liberal + False Conservative

FLiberal =2PrecisionLiberalRecallLiberal

PrecisionLiberal + RecallLiberal

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40

Comparing Training and Validation Set

Representation of Test Statistics from Dictionary week (along with somenew ones)

Actual Label

Classification (algorithm) Liberal Conservative

Liberal True Liberal False Liberal

Conservative False Conservative True Conservative

Accuracy =TrueLib + TrueCons

TrueLib + TrueCons + FalseLib + FalseCons

PrecisionLiberal =True Liberal

True Liberal + False Liberal

RecallLiberal =True Liberal

True Liberal + False Conservative

FLiberal =2PrecisionLiberalRecallLiberal

PrecisionLiberal + RecallLiberal

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40

Comparing Training and Validation Set

Representation of Test Statistics from Dictionary week (along with somenew ones)

Actual Label

Classification (algorithm) Liberal Conservative

Liberal True Liberal False Liberal

Conservative False Conservative True Conservative

Accuracy =TrueLib + TrueCons

TrueLib + TrueCons + FalseLib + FalseCons

PrecisionLiberal =True Liberal

True Liberal + False Liberal

RecallLiberal =True Liberal

True Liberal + False Conservative

FLiberal =2PrecisionLiberalRecallLiberal

PrecisionLiberal + RecallLiberal

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40

Comparing Training and Validation Set

Representation of Test Statistics from Dictionary week (along with somenew ones)

Actual Label

Classification (algorithm) Liberal Conservative

Liberal True Liberal False Liberal

Conservative False Conservative True Conservative

Accuracy =TrueLib + TrueCons

TrueLib + TrueCons + FalseLib + FalseCons

PrecisionLiberal =True Liberal

True Liberal + False Liberal

RecallLiberal =True Liberal

True Liberal + False Conservative

FLiberal =2PrecisionLiberalRecallLiberal

PrecisionLiberal + RecallLiberal

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 18 / 40

ROC Curve

ROC as a measure of model performance

RecallLiberal =True Liberal

True Liberal + False Conservative

RecallConservative =True Conservative

True Conservative + False Liberal

Tension:

- Everything liberal: RecallLiberal =1 ; RecallConservative = 0

- Everything conservative: RecallLiberal =0 ; RecallConservative = 1

Characterize Tradeoff:Plot True Positive Rate RecallLiberalFalse Positive Rate (1 - RecallConservative)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 19 / 40

Precision/Recall Tradeoff

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

False Positive Rate

True

Pos

itive

Rat

e

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 20 / 40

Simple Classification Example

Analyzing house press releasesHand Code: 1,000 press releases

- Advertising

- Credit Claiming

- Position Taking

Divide 1,000 press releases into two sets

- 500: Training set

- 500: Test set

Initial exploration: provides baseline measurement at classifierperformancesImprove: through improving model fit

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 21 / 40

Example from First Model FitActual Label

Classification (Naive Bayes) Position Taking Advertising Credit Claim.Position Taking 10 0 0Advertising 2 40 2Credit Claiming 80 60 306

Accuracy =10 + 40 + 306

500= 0.71

PrecisionPT =10

10= 1

RecallPT =10

10 + 2 + 80= 0.11

PrecisionAD =40

40 + 2 + 2= 0.91

RecallAD =40

40 + 60= 0.4

PrecisionCredit =306

306 + 80 + 60= 0.67

RecallCredit =306

306 + 2= 0.99

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 22 / 40

Fit Statistics in R

RWeka library provides Amazing functionality.You can easily code them yourself

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 23 / 40

Support Vector Machines

Document i is an J × 1 vector of counts

x i = (x1i , x2i , . . . , xJi )

Suppose we have two classes, C1,C2.

Yi = 1 if i is in class 1

Yi = −1 if i is in class 2

Suppose they are separable:

- Draw a line between groups

- Goal: identify the line in the middle

- Maximum margin

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 24 / 40

Support Vector Machines: Maximum Margin Classifier(Bishop 2006)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40

Support Vector Machines: Maximum Margin Classifier(Bishop 2006)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40

Support Vector Machines: Maximum Margin Classifier(Bishop 2006)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40

Support Vector Machines: Maximum Margin Classifier(Bishop 2006)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40

Support Vector Machines: Maximum Margin Classifier(Bishop 2006)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 25 / 40

Support Vector Machines: Algebra (Bishop 2006)

Goal create a score to classify:

s(x i ) = β′x i + b

- β Determines orientation of surface (slope)

- b determines location (moves surface up or down)

- If s(x i ) > 0→ class 1

- If s(x i ) < 0→ class 2

- |s(x i )|||β|| = Document distance from decision surface (margin)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 26 / 40

Support Vector Machines: Algebra (Bishop 2006)

Objective function: maximum margin

mini [ |(s(x i )| ]: Point closest to decision surfaceWe want to identify β and b to maximize the margin:

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

}Constrained optimization problem Quadratic programming problem

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40

Support Vector Machines: Algebra (Bishop 2006)

Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface

We want to identify β and b to maximize the margin:

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

}Constrained optimization problem Quadratic programming problem

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40

Support Vector Machines: Algebra (Bishop 2006)

Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface

We want to identify β and b to maximize the margin:

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

}Constrained optimization problem Quadratic programming problem

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40

Support Vector Machines: Algebra (Bishop 2006)

Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface

We want to identify β and b to maximize the margin:

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}

arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

}Constrained optimization problem Quadratic programming problem

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40

Support Vector Machines: Algebra (Bishop 2006)

Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface

We want to identify β and b to maximize the margin:

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

}

Constrained optimization problem Quadratic programming problem

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40

Support Vector Machines: Algebra (Bishop 2006)

Objective function: maximum marginmini [ |(s(x i )| ]: Point closest to decision surface

We want to identify β and b to maximize the margin:

arg maxβ,b

{1

||β||mini [ |(s(x i )| ]

}arg maxβ,b

{1

||β||mini [ |β

′x i + b| ]

}Constrained optimization problem Quadratic programming problem

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 27 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

What About Overlap? (Bishop 2006)

- Rare that classes are separable.

- Define:

ξi = 0 if correctly classified

ξi = |s(x i )| if incorrectly classified

Tradeoff:

- Maximize margin between correctly classified groups

- Minimize error from misclassified documents

arg maxβ,b

{C

N∑i=1

ξi +1

||β||mini [ |β

′x i + b| ]

}

C captures tradeoff

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 28 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?

1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes

- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results

- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”

- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications

- Perform vote to select class (still suboptimal)3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

How to Handle Multiple Comparisons?

- Rare that we only want to classify two categories

- How to handle classification into K groups?1) Set up K classification problems:

- Compare each class to all other classes- Problem: can lead to inconsistent results- Solution(?): select category with largest “score”- Problem: scales are not comparable

2) Common solution: set up K (K − 1)/2 classifications- Perform vote to select class (still suboptimal)

3) Simultaneous estimation possible, much slower

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 29 / 40

R Code to Run SVMs

library(e1071)

fit<- svm(T . , as.data.frame(tdm) , method =’C’,

kernel=’linear’)

where: method = ‘C’ → Classificationkernel=’linear’ → allows for distortion of feature space. Options:

- Linear

- Polynomial

- Radial

- sigmoid

preds<- predict(fit, data =

as.data.frame(tdm[-c(1:no.train),]))

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 30 / 40

Example of SVMs in Political Science Research

Hillard, Purpura, Wilkerson: SVMs to code topic/sub topics for policyagendas project

SVMs are under utilized in political science

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 31 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.

But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?

Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classes

Can be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Naive Bayes (and next week, SVM): focused on individual documentclassification.But what if we’re focused on proportions only?Hopkins and King (2010): method for characterizing distribution of classesCan be much more accurate than individual classifiers, requires fewerassumptions (do not need random sample of documents ) .

- King and Lu (2008): derive method for characterizing causes ofdeaths for verbal autopsies

- Hopkins and King (2010): extend the method to text documents

Basic intuition:

- Examine joint distribution of characteristics (without making NaiveBayes like assumption)

- Focus on distributions (only) makes this analysis possible

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 32 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing x

P(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

Measure only presence/absence of each term [(Jx1) vector ]

x i = (1, 0, 0, 1, . . . , 0)

What are the possible realizations of x i?

- 2J possible vectors

Define:

P(x) = probability of observing xP(x |Cj) = Probability of observing x conditional on category Cj

P(X |C ) = Matrix collecting vectors

P(C ) = P(C1,C2, . . . ,CK ) target quantity of interest

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 33 / 40

ReadMe: Optimization for a Different Goal (Hopkins andKing 2010)

P(x)︸ ︷︷ ︸2Jx1

= P(x |C )︸ ︷︷ ︸2JxK

P(C )︸ ︷︷ ︸Kx1

Matrix algebra problem to solve, for P(C )Like Naive Bayes, requires two pieces to estimateComplication 2J >> no. documentsKernel Smoothing Methods (without a formal model)

- P(x) = estimate directly from test set

- P(x |C ) = estimate from training set

- Key assumption: P(x |C ) in training set is equivalent to P(x |C ) in testset

- If true, can perform biased sampling of documents, worry less aboutdrift...

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 34 / 40

Algorithm Summarized

- Estimate p(x) from test set

- Estimate p(x |C ) from training set

- Use p(x) and p(x |C ) to solve for p(C )

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 35 / 40

Assessing Model Performance

Not classifying individual documents → different standardsMean Square Error :

E[(θ − θ)2] = var(θ) + Bias(θ, θ)2

Suppose we have true proportions P(C )true. Then, we’ll estimate RootMean Square Error

RMSE =

√∑Jj=1(P(Cj)true − P(Cj))

J

Mean Abs. Prediction Error = |∑J

j=1(P(Cj)true − P(Cj))

J|

Visualize: plot true and estimated proportions

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 36 / 40

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 37 / 40

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 37 / 40

Using the House Press Release Data

Method RMSE APSE

ReadMe 0.036 0.056NaiveBayes 0.096 0.14SVM 0.052 0.084

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 38 / 40

Code to Run in R

Control file:filename truth trainingset

20July2009LEWIS53.txt 4 126July2006LEWIS249.txt 2 0

tdm<- undergrad(control=control, fullfreq=F)

process<- preprocess(tdm)

output<- undergrad(process)

output$est.CSMF ## proportion in each category

output$true.CSMF ## if labeled for validation set (but not

used in training set)

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 39 / 40

Model Selection!

Justin Grimmer (Stanford University) Text as Data November 10th, 2014 40 / 40

top related