cs 6501 natural language processing - text classification (i ...din tai fung, every time i go eat at...
TRANSCRIPT
CS 6501 Natural LanguageProcessingText Classification (I): Logistic Regression
Yangfeng Ji
Department of Computer Science
University of Virginia
Overview
1. Problem Definition
2. Bag-of-Words Representation
3. Case Study: Sentiment Analysis
4. Logistic Regression
5. !2 Regularization
6. Demo Code
1
Problem Definition
Case I: Sentiment Analysis
[Pang et al., 2002]
3
Case II: Topic Classification
Example topics
I Business
I Arts
I Technology
I Sports
I · · ·
4
Classification
I Input: a text xI Example: a product review on Amazon
I Output: H ∈ Y, where Y is the predefined category set (sample
space)
I Example: Y= {Positive,Negative}
The pipeline of text classification:1
Text Numeric Vector x Classifier Category H
1In this course, we use x for both text and its representation with no distinction
5
Classification
I Input: a text xI Example: a product review on Amazon
I Output: H ∈ Y, where Y is the predefined category set (sample
space)
I Example: Y= {Positive,Negative}
The pipeline of text classification:1
Text Numeric Vector x Classifier Category H
1In this course, we use x for both text and its representation with no distinction
5
Probabilistic Formulation
With the conditional probability %(. | ^ ), the prediction on . for a
given text ^ = x is
H = argmax
H∈Y%(. = H | ^ = x) (1)
Or, for simplicity
H = argmax
H∈Y%(H | x) (2)
6
Probabilistic Formulation
With the conditional probability %(. | ^ ), the prediction on . for a
given text ^ = x is
H = argmax
H∈Y%(. = H | ^ = x) (1)
Or, for simplicity
H = argmax
H∈Y%(H | x) (2)
6
Key Questions
Recall
I The formulation defined in the previous slide
H = argmax
H∈Y%(. = H | ^ = x) (3)
I The pipeline of text classification
Text Numeric Vector x Classifier Category H
Building a text classifier is about answering the following two
questions
1. How to represent a text as x?
I Bag-of-words representation
2. How to estimate %(H | x)?
I Logistic regression models
I Neural network classifiers (Lecture 3)
7
Key Questions
Recall
I The formulation defined in the previous slide
H = argmax
H∈Y%(. = H | ^ = x) (3)
I The pipeline of text classification
Text Numeric Vector x Classifier Category H
Building a text classifier is about answering the following two
questions
1. How to represent a text as x?
I Bag-of-words representation
2. How to estimate %(H | x)?
I Logistic regression models
I Neural network classifiers (Lecture 3)
7
Key Questions
Recall
I The formulation defined in the previous slide
H = argmax
H∈Y%(. = H | ^ = x) (3)
I The pipeline of text classification
Text Numeric Vector x Classifier Category H
Building a text classifier is about answering the following two
questions
1. How to represent a text as x?I Bag-of-words representation
2. How to estimate %(H | x)?
I Logistic regression models
I Neural network classifiers (Lecture 3)
7
Key Questions
Recall
I The formulation defined in the previous slide
H = argmax
H∈Y%(. = H | ^ = x) (3)
I The pipeline of text classification
Text Numeric Vector x Classifier Category H
Building a text classifier is about answering the following two
questions
1. How to represent a text as x?I Bag-of-words representation
2. How to estimate %(H | x)?I Logistic regression models
I Neural network classifiers (Lecture 3)
7
Key Questions
Recall
I The formulation defined in the previous slide
H = argmax
H∈Y%(. = H | ^ = x) (3)
I The pipeline of text classification
Text Numeric Vector x Classifier Category H
Building a text classifier is about answering the following two
questions
1. How to represent a text as x?I Bag-of-words representation
2. How to estimate %(H | x)?I Logistic regression models
I Neural network classifiers (Lecture 3)
7
Bag-of-Words Representation
Bag-of-Words Representation
Example TextsText 1: I love coffee.
Text 2: I don’t like tea.
Step I: convert a text into a collection of tokens (e.g., tokenization)
Tokenized TextsTokenized text 1: I love coffee
Tokenized text 2: I don t like tea
Step II: build a dictionary/vocabulary
Vocabulary{I love coffee don t like tea}
9
Bag-of-Words Representation
Example TextsText 1: I love coffee.
Text 2: I don’t like tea.
Step I: convert a text into a collection of tokens (e.g., tokenization)
Tokenized TextsTokenized text 1: I love coffee
Tokenized text 2: I don t like tea
Step II: build a dictionary/vocabulary
Vocabulary{I love coffee don t like tea}
9
Bag-of-Words Representation
Example TextsText 1: I love coffee.
Text 2: I don’t like tea.
Step I: convert a text into a collection of tokens (e.g., tokenization)
Tokenized TextsTokenized text 1: I love coffee
Tokenized text 2: I don t like tea
Step II: build a dictionary/vocabulary
Vocabulary{I love coffee don t like tea}
9
Bag-of-Words Representations
Step III: based on the vocab, convert each text into a numeric
representation as
Bag-of-Words Representations
I love coffee don t like tea
x(1) = [1 1 1 0 0 0 0]T
x(2) = [1 0 0 1 1 1 1]T
The pipeline of text classification:
Text Numeric Vector x Classifier Category H
Bag-of-words
Representation
10
Bag-of-Words Representations
Step III: based on the vocab, convert each text into a numeric
representation as
Bag-of-Words Representations
I love coffee don t like tea
x(1) = [1 1 1 0 0 0 0]T
x(2) = [1 0 0 1 1 1 1]T
The pipeline of text classification:
Text Numeric Vector x Classifier Category H
Bag-of-words
Representation
10
Preprocessing for Building Vocab
1. convert all characters to lowercase
UVa, UVA→ uva
Shall we always convert all words to lowercase?
Apple vs. apple
2. map low frequency words to a special token 〈unk〉
Zipf’s law: freq(FC) ∝ 1/ACwhere freq(FC) is the frequency of word FC and AC is the rank of
this word
11
Preprocessing for Building Vocab
1. convert all characters to lowercase
UVa, UVA→ uva
Shall we always convert all words to lowercase?
Apple vs. apple
2. map low frequency words to a special token 〈unk〉
Zipf’s law: freq(FC) ∝ 1/ACwhere freq(FC) is the frequency of word FC and AC is the rank of
this word
11
Information Embedded in BoW Representations
It is critical to keep in mind about what information is preserved in
bag-of-words representations:
I Keep:
I words in texts
I Lose:
I word order
I love coffee don t like tea
I sentence boundary
I sentence order
I · · ·
12
Information Embedded in BoW Representations
It is critical to keep in mind about what information is preserved in
bag-of-words representations:
I Keep:
I words in texts
I Lose:
I word order
I love coffee don t like tea
I sentence boundary
I sentence order
I · · ·
12
Information Embedded in BoW Representations
It is critical to keep in mind about what information is preserved in
bag-of-words representations:
I Keep:
I words in texts
I Lose:
I word order
I love coffee don t like tea
I sentence boundary
I sentence order
I · · ·
12
Case Study: Sentiment Analysis
A Dummy Predictor
Consider the following toy example (adding one more example to
make it more interesting)
Tokenized Texts
Text ^ Label .
Tokenized text 1 I love coffee Positive
Tokenized text 2 I don t like tea Negative
Tokenized text 3 I like coffee Positive
What is the simplest classifier that we can constructed based on this
“dataset”?
I Predict every text as Positive
I 66.7% prediction accuracy on this dataset2
I It has a name: majority baseline3
2The evaluation of classifiers will be discussed in one of the future lectures.
3It can be a competitive baseline in practice, particularly for unbalanced datasets
14
A Dummy Predictor
Consider the following toy example (adding one more example to
make it more interesting)
Tokenized Texts
Text ^ Label .
Tokenized text 1 I love coffee Positive
Tokenized text 2 I don t like tea Negative
Tokenized text 3 I like coffee Positive
What is the simplest classifier that we can constructed based on this
“dataset”?
I Predict every text as Positive
I 66.7% prediction accuracy on this dataset2
I It has a name: majority baseline3
2The evaluation of classifiers will be discussed in one of the future lectures.
3It can be a competitive baseline in practice, particularly for unbalanced datasets
14
A Dummy Predictor
Consider the following toy example (adding one more example to
make it more interesting)
Tokenized Texts
Text ^ Label .
Tokenized text 1 I love coffee Positive
Tokenized text 2 I don t like tea Negative
Tokenized text 3 I like coffee Positive
What is the simplest classifier that we can constructed based on this
“dataset”?
I Predict every text as Positive
I 66.7% prediction accuracy on this dataset2
I It has a name: majority baseline3
2The evaluation of classifiers will be discussed in one of the future lectures.
3It can be a competitive baseline in practice, particularly for unbalanced datasets
14
A Dummy Predictor
Consider the following toy example (adding one more example to
make it more interesting)
Tokenized Texts
Text ^ Label .
Tokenized text 1 I love coffee Positive
Tokenized text 2 I don t like tea Negative
Tokenized text 3 I like coffee Positive
What is the simplest classifier that we can constructed based on this
“dataset”?
I Predict every text as Positive
I 66.7% prediction accuracy on this dataset2
I It has a name: majority baseline3
2The evaluation of classifiers will be discussed in one of the future lectures.
3It can be a competitive baseline in practice, particularly for unbalanced datasets
14
A Dummy Predictor
Consider the following toy example (adding one more example to
make it more interesting)
Tokenized Texts
Text ^ Label .
Tokenized text 1 I love coffee Positive
Tokenized text 2 I don t like tea Negative
Tokenized text 3 I like coffee Positive
What is the simplest classifier that we can constructed based on this
“dataset”?
I Predict every text as Positive
I 66.7% prediction accuracy on this dataset2
I It has a name: majority baseline3
2The evaluation of classifiers will be discussed in one of the future lectures.
3It can be a competitive baseline in practice, particularly for unbalanced datasets
14
A Simple Predictor
Consider the following toy example, again
Tokenized TextsTokenized text 1: I love coffee
Tokenized text 2: I don t like tea
Tokenized text 3: I like coffee
I love coffee don t like tea
x(1) [1 1 1 0 0 0 0 ]T
wPos
[0 1 0 0 0 1 0 ]T
wNeg
[0 0 0 1 0 0 0 ]T
The prediction of sentiment polarity can be formulated as the
following
wTPos
x = 1 > wTNeg
x = 0 (4)
Essentially, it is equivalent to counting the positive and negative
words.
15
A Simple Predictor
Consider the following toy example, again
Tokenized TextsTokenized text 1: I love coffee
Tokenized text 2: I don t like tea
Tokenized text 3: I like coffee
I love coffee don t like tea
x(1) [1 1 1 0 0 0 0 ]T
wPos
[0 1 0 0 0 1 0 ]T
wNeg
[0 0 0 1 0 0 0 ]T
The prediction of sentiment polarity can be formulated as the
following
wTPos
x = 1 > wTNeg
x = 0 (4)
Essentially, it is equivalent to counting the positive and negative
words.
15
A Simple Predictor
Consider the following toy example, again
Tokenized TextsTokenized text 1: I love coffee
Tokenized text 2: I don t like tea
Tokenized text 3: I like coffee
I love coffee don t like tea
x(1) [1 1 1 0 0 0 0 ]T
wPos
[0 1 0 0 0 1 0 ]T
wNeg
[0 0 0 1 0 0 0 ]T
The prediction of sentiment polarity can be formulated as the
following
wTPos
x = 1 > wTNeg
x = 0 (4)
Essentially, it is equivalent to counting the positive and negative
words.
15
A Simple Predictor
Consider the following toy example, again
Tokenized TextsTokenized text 1: I love coffee
Tokenized text 2: I don t like tea
Tokenized text 3: I like coffee
I love coffee don t like tea
x(1) [1 1 1 0 0 0 0 ]T
wPos
[0 1 0 0 0 1 0 ]T
wNeg
[0 0 0 1 0 0 0 ]T
The prediction of sentiment polarity can be formulated as the
following
wTPos
x = 1 > wTNeg
x = 0 (4)
Essentially, it is equivalent to counting the positive and negative
words. 15
Another Example
The limitation of word counting
I love coffee don t like tea
x(2) [1 0 0 1 1 1 1 ]T
wPos [0 1 0 0 0 1 0 ]T
wNeg [0 0 0 1 0 0 0 ]T
I Different words should contribute differently. e.g., not vs.
dislike
I Sentiment word lists are definitely incomplete
Example II: PositiveDin Tai Fung, every time I go eat at anyone of the locations around theKing County area, I keep being reminded on why I have to keep comingback to this restaurant. · · ·
16
Another Example
The limitation of word counting
I love coffee don t like tea
x(2) [1 0 0 1 1 1 1 ]T
wPos [0 1 0 0 0 1 0 ]T
wNeg [0 0 0 1 0 0 0 ]T
I Different words should contribute differently. e.g., not vs.
dislike
I Sentiment word lists are definitely incomplete
Example II: PositiveDin Tai Fung, every time I go eat at anyone of the locations around theKing County area, I keep being reminded on why I have to keep comingback to this restaurant. · · ·
16
Another Example
The limitation of word counting
I love coffee don t like tea
x(2) [1 0 0 1 1 1 1 ]T
wPos [0 1 0 0 0 1 0 ]T
wNeg [0 0 0 1 0 0 0 ]T
I Different words should contribute differently. e.g., not vs.
dislike
I Sentiment word lists are definitely incomplete
Example II: PositiveDin Tai Fung, every time I go eat at anyone of the locations around theKing County area, I keep being reminded on why I have to keep comingback to this restaurant. · · ·
16
Logistic Regression
Linear Models
Directly modeling a linear classifier as
ℎH(x) = wTHx + 1H (5)
with
I x ∈ ℕ+: vector, bag-of-words representation
I wH ∈ ℝ+: vector, classification weights associated with label H
I 1H ∈ ℝ: scalar, label bias in the training set H
About Label BiasConsider a case with highly-imbalanced examples, where we have
90 positive examples and 10 negative examples in the training set.
With
1Pos > 1Neg ,
a classifier can get 90% predictions correct without even resorting
the texts.
18
Linear Models
Directly modeling a linear classifier as
ℎH(x) = wTHx + 1H (5)
with
I x ∈ ℕ+: vector, bag-of-words representation
I wH ∈ ℝ+: vector, classification weights associated with label H
I 1H ∈ ℝ: scalar, label bias in the training set H
About Label BiasConsider a case with highly-imbalanced examples, where we have
90 positive examples and 10 negative examples in the training set.
With
1Pos > 1Neg ,
a classifier can get 90% predictions correct without even resorting
the texts.
18
Logistic Regression
Rewrite the linear decision function in the log probabilitic form
log%(H | x) ∝ wTHx + 1H︸ ︷︷ ︸ℎH (x)
(6)
or, the probabilistic form is
%(H | x) ∝ exp(wTHx + 1H) (7)
To make sure %(H | x) is a valid definition of probability, we need to
make sure
∑H %(H | x) = 1,
%(H | x) =exp(wT
Hx + 1H)∑H′∈Yexp(wT
H′x + 1H′)(8)
Problem: Show the classifier based on Eq. 8 is still a linear classifier
19
Logistic Regression
Rewrite the linear decision function in the log probabilitic form
log%(H | x) ∝ wTHx + 1H︸ ︷︷ ︸ℎH (x)
(6)
or, the probabilistic form is
%(H | x) ∝ exp(wTHx + 1H) (7)
To make sure %(H | x) is a valid definition of probability, we need to
make sure
∑H %(H | x) = 1,
%(H | x) =exp(wT
Hx + 1H)∑H′∈Yexp(wT
H′x + 1H′)(8)
Problem: Show the classifier based on Eq. 8 is still a linear classifier
19
Logistic Regression
Rewrite the linear decision function in the log probabilitic form
log%(H | x) ∝ wTHx + 1H︸ ︷︷ ︸ℎH (x)
(6)
or, the probabilistic form is
%(H | x) ∝ exp(wTHx + 1H) (7)
To make sure %(H | x) is a valid definition of probability, we need to
make sure
∑H %(H | x) = 1,
%(H | x) =exp(wT
Hx + 1H)∑H′∈Yexp(wT
H′x + 1H′)(8)
Problem: Show the classifier based on Eq. 8 is still a linear classifier
19
Logistic Regression
Rewrite the linear decision function in the log probabilitic form
log%(H | x) ∝ wTHx + 1H︸ ︷︷ ︸ℎH (x)
(6)
or, the probabilistic form is
%(H | x) ∝ exp(wTHx + 1H) (7)
To make sure %(H | x) is a valid definition of probability, we need to
make sure
∑H %(H | x) = 1,
%(H | x) =exp(wT
Hx + 1H)∑H′∈Yexp(wT
H′x + 1H′)(8)
Problem: Show the classifier based on Eq. 8 is still a linear classifier
19
Alternative Form
Rewriting x and w as
I xT = [G1 , G2 , · · · , G+ , 1]I wT
H = [F1 , F2 , · · · , F+ , 1H]
allows us to have a more concise form
%(H | x) =exp(wT
Hx)∑H′∈Yexp(wT
H′x)(9)
Comments:
I exp(0)∑0′ exp(0′) is the softmax function
I This form works with any size of Y— it does not have to be a
binary classification problem.
20
Alternative Form
Rewriting x and w as
I xT = [G1 , G2 , · · · , G+ , 1]I wT
H = [F1 , F2 , · · · , F+ , 1H]
allows us to have a more concise form
%(H | x) =exp(wT
Hx)∑H′∈Yexp(wT
H′x)(9)
Comments:
I exp(0)∑0′ exp(0′) is the softmax function
I This form works with any size of Y— it does not have to be a
binary classification problem.
20
Binary Classifier
Assume Y= {neg, pos}, then the corresponding logistic regression
classifier with . = Pos is
%(. = Pos | x) = 1
1 + exp(−wTx) (10)
where w is the only parameter.
I %(. = Neg | x) = 1 − %(. = Pos | x)I 1
1+exp(−I) is the Sigmoid function
Problem: Show Eq. 10 is a special form of Eq. 9.
21
Binary Classifier
Assume Y= {neg, pos}, then the corresponding logistic regression
classifier with . = Pos is
%(. = Pos | x) = 1
1 + exp(−wTx) (10)
where w is the only parameter.
I %(. = Neg | x) = 1 − %(. = Pos | x)
I 1
1+exp(−I) is the Sigmoid function
Problem: Show Eq. 10 is a special form of Eq. 9.
21
Binary Classifier
Assume Y= {neg, pos}, then the corresponding logistic regression
classifier with . = Pos is
%(. = Pos | x) = 1
1 + exp(−wTx) (10)
where w is the only parameter.
I %(. = Neg | x) = 1 − %(. = Pos | x)I 1
1+exp(−I) is the Sigmoid function
Problem: Show Eq. 10 is a special form of Eq. 9.
21
Binary Classifier
Assume Y= {neg, pos}, then the corresponding logistic regression
classifier with . = Pos is
%(. = Pos | x) = 1
1 + exp(−wTx) (10)
where w is the only parameter.
I %(. = Neg | x) = 1 − %(. = Pos | x)I 1
1+exp(−I) is the Sigmoid function
Problem: Show Eq. 10 is a special form of Eq. 9.
21
Two Questions on Building LR Models
... of building a logistic regression classifier
%(H | x) =exp(wT
Hx)∑H′∈Yexp(wT
H′x)(11)
I How to learn the parameters] = {wH}H∈Y?
I Can x be better than the bag-of-words representations?
I Will be discussed in lecture 04
22
Two Questions on Building LR Models
... of building a logistic regression classifier
%(H | x) =exp(wT
Hx)∑H′∈Yexp(wT
H′x)(11)
I How to learn the parameters] = {wH}H∈Y?I Can x be better than the bag-of-words representations?
I Will be discussed in lecture 04
22
Review: (Log)-likelihood Function
With a collection of training examples {(x(8) , H(8))}<8=1
, the likelihood
function of {wH}H∈Y is
!(] ) =<∏8=1
%(H(8) | x(8)) (12)
and the log-likelihood function is
ℓ ({wH}) =<∑8=1
log%(H(8) | x(8)) (13)
23
Log-likelihood Function of a LR Model
With the definition of a LR model
%(H | x) =exp(wT
Hx)∑H′∈Yexp(wT
H′x)(14)
the log-likelihood function is
ℓ (] ) =
<∑8=1
log%(H(8) | x(8)) (15)
=
<∑8=1
{wTH(8)
x(8) − log
∑H′∈Y
exp(wTH′x(8))
}(16)
Given the training examples {(x(8) , H(8))}<8=1
, ℓ (] ) is a function of
] = {wH}.
24
Optimization with Gradient
MLE is equivalent to minimize the Negative Log-Likelihood (NLL) as
NLL(] ) = −ℓ (] )
=
<∑8=1
{−wT
H(8)x(8) + log
∑H′∈Y
exp(wTH′x)
}then, the parameter wH associated with label H can be updated as
wH ← wH − � ·%NLL({wH})
%wH, ∀H ∈ Y (17)
where � is called learning rate.
25
Optimization with Gradient (II)
Two questions answered by the update equation
(1) which direction?
(2) how far it should go?
wH ← wH − �︸︷︷︸(2)
·%NLL({wH})
%wH︸ ︷︷ ︸(1)
(18)
[Jurafsky and Martin, 2019]26
Optimization with Gradient (II)
Two questions answered by the update equation
(1) which direction?
(2) how far it should go?
wH ← wH − �︸︷︷︸(2)
·%NLL({wH})
%wH︸ ︷︷ ︸(1)
(18)
[Jurafsky and Martin, 2019]26
Optimization with Gradient (II)
Two questions answered by the update equation
(1) which direction?
(2) how far it should go?
wH ← wH − �︸︷︷︸(2)
·%NLL({wH})
%wH︸ ︷︷ ︸(1)
(18)
[Jurafsky and Martin, 2019]26
Training Procedure
Steps for parameter estimation, given the current parameter {wH}
1. Compute the derivative
%NLL({wH})%wH
, ∀H ∈ Y
2. Update parameters with
wH ← wH − � ·%NLL({wH})
%wH, ∀H ∈ Y
3. If not done, retrun to step 1
27
Procedure of Building a Classifier
Review: the pipeline of text classification:
Text Numeric Vector x Classifier Category H
Bag-of-words Logistic
Representation Regression
28
!2 Regularization
!2 Regularization
The commonly used regularization trick is the !2 regularization. For
that, we need to redefine the objective function of LR by adding an
additional item
Loss(] ) =<∑8=1
{−wT
H(8)x(8) + log
∑H′∈Y
exp(wTH′x(8))
}︸ ︷︷ ︸
NLL
+ �2
·∑H∈Y‖wH ‖2
2︸ ︷︷ ︸!2 reg
(19)
I � is the regularization parameter
30
!2 Regularization
The commonly used regularization trick is the !2 regularization. For
that, we need to redefine the objective function of LR by adding an
additional item
Loss(] ) =<∑8=1
{−wT
H(8)x(8) + log
∑H′∈Y
exp(wTH′x(8))
}︸ ︷︷ ︸
NLL
+ �2
·∑H∈Y‖wH ‖2
2︸ ︷︷ ︸!2 reg
(19)
I � is the regularization parameter
30
!2 Regularization in Gradient Descent
I The gradient of the loss function
%Loss(] )%wH
=%NLL(] )
%wH+ �wH (20)
I To minimize the loss, we need update the parameter as
wH ← wH − �(%NLL(] )
%wH+ �wH
)(21)
= (1 − ��) ·wH − �%NLL(] )
%wH
I Depending on the strength (value) of �, the regularization term
tries to keep the parameter values close to 0, which to some
extent can help avoid overfitting
31
!2 Regularization in Gradient Descent
I The gradient of the loss function
%Loss(] )%wH
=%NLL(] )
%wH+ �wH (20)
I To minimize the loss, we need update the parameter as
wH ← wH − �(%NLL(] )
%wH+ �wH
)(21)
= (1 − ��) ·wH − �%NLL(] )
%wH
I Depending on the strength (value) of �, the regularization term
tries to keep the parameter values close to 0, which to some
extent can help avoid overfitting
31
!2 Regularization in Gradient Descent
I The gradient of the loss function
%Loss(] )%wH
=%NLL(] )
%wH+ �wH (20)
I To minimize the loss, we need update the parameter as
wH ← wH − �(%NLL(] )
%wH+ �wH
)(21)
= (1 − ��) ·wH − �%NLL(] )
%wH
I Depending on the strength (value) of �, the regularization term
tries to keep the parameter values close to 0, which to some
extent can help avoid overfitting
31
Learning without Regularization
In the demo code, we chose � = 1
� = 0.001 to approximate the case
without regularization.
I Training accuracy: 99.89%
I Val accuracy: 52.21%
32
Classification Weights without Regularization
Here are some word features and their classification weights from the
previous model without regularization. Positive weights indicate the
word feature contribute to positive sentiment classification and
negative weights indicate the opposite contribution
interesting pleasure boring zoe write workings
Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16
I negative: woody allen can write and deliver a one liner as
well as anybody .
I positive: soderbergh , like kubrick before him , may not
touch the planet ’s skin , but understands the workings of
its spirit .
33
Classification Weights without Regularization
Here are some word features and their classification weights from the
previous model without regularization. Positive weights indicate the
word feature contribute to positive sentiment classification and
negative weights indicate the opposite contribution
interesting pleasure boring zoe write workings
Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16
I negative: woody allen can write and deliver a one liner as
well as anybody .
I positive: soderbergh , like kubrick before him , may not
touch the planet ’s skin , but understands the workings of
its spirit .
33
Classification Weights without Regularization
Here are some word features and their classification weights from the
previous model without regularization. Positive weights indicate the
word feature contribute to positive sentiment classification and
negative weights indicate the opposite contribution
interesting pleasure boring zoe write workings
Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16
I negative: woody allen can write and deliver a one liner as
well as anybody .
I positive: soderbergh , like kubrick before him , may not
touch the planet ’s skin , but understands the workings of
its spirit .
33
Learning with Regularization
We chose � = 1
� = 102
I Training accuracy: 62.54%
I Val accuracy: 63.17%
34
Classification Weights with Regularization
With regularization, the classification weights make more sense to us
interesting pleasure boring zoe write workings
Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16
With Reg 0.16 0.36 -0.21 -0.057 -0.066 0.040
Regularization for Avoiding OverfittingReduce the correlation between class label and some noisy features.
35
Classification Weights with Regularization
With regularization, the classification weights make more sense to us
interesting pleasure boring zoe write workings
Without Reg 0.011 -5.63 1.80 -5.68 -8.20 14.16
With Reg 0.16 0.36 -0.21 -0.057 -0.066 0.040
Regularization for Avoiding OverfittingReduce the correlation between class label and some noisy features.
35
Demo Code
Demo
What we are going to review from this demo code
I NLP
I Bag-of-words representations
I Text classifiers
I Machine Learning
I Overfitting
I !2regularization
37
Reference
Jurafsky, D. and Martin, J. (2019).
Speech and language processing.
Pang, B., Lee, L., and Vaithyanathan, S. (2002).
Thumbs up?: sentiment classification using machine learning techniques.
In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 79–86. Association for
Computational Linguistics.
38