maximum entropy models - csee.umbc.edu · we’ve only developed binary classifiers so far…...
TRANSCRIPT
![Page 1: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/1.jpg)
Maximum Entropy Models/Logistic Regression
CMSC 678
UMBC
![Page 2: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/2.jpg)
Recap from last time…
![Page 3: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/3.jpg)
Central Question: How Well Are We Doing?
Classification
Regression
Clustering
the task: what kindof problem are you
solving?
• Precision, Recall, F1
• Accuracy• Log-loss• ROC-AUC• …
• (Root) Mean Square Error• Mean Absolute Error• …
• Mutual Information• V-score• …
This does not have to be the same thing as the
loss function
you optimize
![Page 4: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/4.jpg)
Rule #1
![Page 5: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/5.jpg)
We’ve only developed binary classifiers so far…
Option 1: Develop a multi-class version
Option 2: Build a one-vs-all (OvA) classifier
Option 3: Build an all-vs-all (AvA) classifier
(there can be others)
Which option you choose is problem-dependent:
1. Why might you want to use option 1 or options OvA/AvA?
2. What are the benefits of OvA vs. AvA?
3. What if you start with a balanced dataset, e.g., 100 instances per class?
![Page 6: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/6.jpg)
Some Classification Metrics
Accuracy
PrecisionRecall
AUC (Area Under Curve)
F1
Confusion Matrix
Correct Value
Guessed
Value
# # #
# # #
# # #
Trade-off and weight
Different ways of averaging in a
multi-class & multi-label setting
![Page 7: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/7.jpg)
Outline
Log-Linear (Maximum Entropy) Models
Basic Modeling
Connections to other techniques (“… by any other name…”)
Objective to optimize
Regularization
![Page 8: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/8.jpg)
Maximum Entropy (Log-linear) Models
𝑝 𝑦 𝑥) ∝ exp(𝜃𝑇𝑓 𝑥, 𝑦 )
“model the posterior probabilities of the K classes via linear functions
in θ, while at the same time ensuring that they sum to one and
remain in [0, 1]” ~ Ch 4.4
![Page 9: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/9.jpg)
Document Classification
ATTACKThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Observed document Label
Q: What features of this document could indicate an ATTACK?
![Page 10: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/10.jpg)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
Document Classification
ATTACK
• # killed:
• Type:
• Perp:
attack
ATTACK
![Page 11: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/11.jpg)
Document Classification
ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.
there could be many relevant clues
![Page 12: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/12.jpg)
Features
The “clues” that help our system make its decision
Apply a vector of features
𝑓 🗎, 𝑦 = (𝑓1(🗎, 𝑦), … , 𝑓𝐾(🗎, 𝑦))to a given document 🗎 and possible label y
…
ffatally shot, ATTACK(🗎, ATTACK)
fseriously wounded, ATTACK(🗎, ATTACK)
fShining Path, ATTACK(🗎, ATTACK)
fhappy cat, ATTACK(🗎, ATTACK)
![Page 13: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/13.jpg)
FeaturesThe “clues” that help our system make its decision
Apply a vector of features
𝑓 🗎, 𝑦 = (𝑓1(🗎, 𝑦), … , 𝑓𝐾(🗎, 𝑦))to a given document 🗎 and possible label y
Each feature function 𝑓𝑘 can take any real value:
binary
count-based
likelihood
…
ffatally shot, ATTACK(🗎, ATTACK)
fseriously wounded, ATTACK(🗎, ATTACK)
fShining Path, ATTACK(🗎, ATTACK)
fhappy cat, ATTACK(🗎, ATTACK)
![Page 14: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/14.jpg)
FeaturesThe “clues” that help our system make its decision
Apply a vector of features 𝑓 🗎, 𝑦 = (𝑓1(🗎, 𝑦), … , 𝑓𝐾(🗎, 𝑦)) to a given document 🗎 and possible label y
Each feature function 𝑓𝑘 can take any real value:
binarycount-basedlikelihood
Features that don’t “fire” don’t apply to the pair
𝑓𝑘 🗎, 𝑦 = 0
…
ffatally shot, ATTACK(🗎, ATTACK)
fseriously wounded, ATTACK(🗎, ATTACK)
fShining Path, ATTACK(🗎, ATTACK)
fhappy cat, ATTACK(🗎, ATTACK)
![Page 15: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/15.jpg)
Features:Score and Combine Our Possibilities
…
define for each key phrase/clue...
θfatally shot, ATTACK(🗎, ATTACK)
θseriously wounded, ATTACK(🗎, ATTACK)
θShining Path, ATTACK(🗎, ATTACK)
θhappy cat, ATTACK(🗎, ATTACK)
Remember: each θw, l(🗎,y) is actually
computed as θw, l * fw, l (🗎,y)
![Page 16: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/16.jpg)
Features:Score and Combine Our Possibilities
…
define for each key phrase/clue...
θfatally shot, ATTACK(🗎, ATTACK)
θseriously wounded, ATTACK(🗎, ATTACK)
θShining Path, ATTACK(🗎, ATTACK)
θhappy cat, ATTACK(🗎, ATTACK)
…
θfatally shot, TECH(🗎, ATTACK)
θseriously wounded, TECH(🗎, ATTACK)
θShining Path, TECH(🗎, ATTACK)
θhappy cat, TECH(🗎, ATTACK)
… and for each label
Remember: each θw, l(🗎,y) is actually
computed as θw, l * fw, l (🗎,y)
![Page 17: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/17.jpg)
Features:Score and Combine Our Possibilities
…
define for each key phrase/clue...
θfatally shot, ATTACK(🗎, ATTACK)
θseriously wounded, ATTACK(🗎, ATTACK)
θShining Path, ATTACK(🗎, ATTACK)
θhappy cat, ATTACK(🗎, ATTACK)
…
θfatally shot, TECH(🗎, ATTACK)
θseriously wounded, TECH(🗎, ATTACK)
θShining Path, TECH(🗎, ATTACK)
θhappy cat, TECH(🗎, ATTACK)
… and for each label
Remember: each θw, l(🗎,y) is actually
computed as θw, l * fw, l (🗎,y)
Not all of these will be relevant
![Page 18: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/18.jpg)
Features:Score and Combine Our Possibilities
…
define for each key phrase/clue...
θfatally shot, ATTACK(🗎, ATTACK)
θseriously wounded, ATTACK(🗎, ATTACK)
θShining Path, ATTACK(🗎, ATTACK)
θhappy cat, ATTACK(🗎, ATTACK)
…
θfatally shot, TECH(🗎, ATTACK)
θseriously wounded, TECH(🗎, ATTACK)
θShining Path, TECH(🗎, ATTACK)
θhappy cat, TECH(🗎, ATTACK)
… and for each label
Each of these scored features describes how “good” a particular phrase is for a given document type if the
provided document document 🗎 has a proposed type
Remember: each θw, l(🗎,y) is actually
computed as θw, l * fw, l (🗎,y)
![Page 19: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/19.jpg)
Score and Combine Our Possibilities
θ1(fatally shot, ATTACK)
θ2(seriously wounded, ATTACK)
θ 3(Shining Path, ATTACK)
…
Weight each of these: score how “important” each feature
(clue) is
Q: How many features are there?
A: As many as you want there to be (but be
careful of underfitting/overfitting)
Shortcut notation: focus only on the features that “fire”
![Page 20: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/20.jpg)
Score and Combine Our Possibilities
θ1(fatally shot, ATTACK)
θ2(seriously wounded, ATTACK)
θ 3(Shining Path, ATTACK)
…
COMBINEposterior
probability of ATTACK
Weight each of these: score how “important” each feature
(clue) is
![Page 21: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/21.jpg)
Scoring Our Possibilities
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
score( , ) =ATTACK
θ1(fatally shot, ATTACK)
θ2(seriously wounded, ATTACK)
θ3(Shining Path, ATTACK)
…
our linear regression model
![Page 22: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/22.jpg)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
SNAP(score( , ))ATTACK
Maxent Modeling
![Page 23: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/23.jpg)
What function…
operates on any real number?
is never less than 0?
![Page 24: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/24.jpg)
What function…
operates on any real number?
is never less than 0?
f(x) = exp(x)
![Page 25: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/25.jpg)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .
exp(score( , ))ATTACK
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
![Page 26: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/26.jpg)
exp( ))…
Maxent Modeling
θ1(fatally shot, ATTACK)
θ2(seriously wounded, ATTACK)
θ3(Shining Path, ATTACK)
this is assuming binary features, but they don’t have to be
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
![Page 27: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/27.jpg)
exp( ))weight1 * f1(fatally shot, ATTACK)
weight2 * f2(seriously wounded, ATTACK)
weight3 * f3(Shining Path, ATTACK)…
Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | )∝ATTACK
![Page 28: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/28.jpg)
Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .
p( | ) =ATTACK
exp( ))…
Maxent Modeling
weight1 * f1(fatally shot, ATTACK)
weight2 * f2(seriously wounded, ATTACK)
weight3 * f3(Shining Path, ATTACK)
1
Z
Q: How do we define Z?
![Page 29: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/29.jpg)
exp( )…
Σlabel y
Z =
Normalization for Classification
weight1 * f1(fatally shot, Y)
weight2 * f2(seriously wounded, Y)
weight3 * f3(Shining Path, Y)
![Page 30: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/30.jpg)
Q: What if none of our features apply?
![Page 31: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/31.jpg)
Guiding Principle for Maximum Entropy Models
“[The log-linear estimate] is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957
exp(θ· f) ➔exp(θ· 0) = 1
![Page 32: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/32.jpg)
https://www.csee.umbc.edu/courses/graduate/678/spring19/loglin-tutorial
Lesson 1: Basic Feature Design
![Page 33: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/33.jpg)
Ingredients for classification
Inject your knowledge into a learning system
Feature representationTraining data:
labeled examplesModel
Courtesy Hamed Pirsiavash
![Page 34: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/34.jpg)
Ingredients for classification
Inject your knowledge into a learning system
Problem specific
Difficult to learn from bad ones
Feature representationTraining data:
labeled examplesModel
Courtesy Hamed Pirsiavash
![Page 35: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/35.jpg)
distinguish a picture of me from a picture of someone else?
determine whether a sentence is grammatical or not?
distinguish cancerous cells from normal cells? o.
What features would you extract to…
Courtesy Hamed Pirsiavash
![Page 36: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/36.jpg)
Outline
Log-Linear (Maximum Entropy) Models
Basic Modeling
Connections to other techniques (“… by any other name…”)
Objective to optimize
Regularization
![Page 37: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/37.jpg)
Connections to Other Techniques
Log-Linear Models
![Page 38: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/38.jpg)
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regressionas statistical regression
![Page 39: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/39.jpg)
“Solution” 1: A Simple Probabilistic (Linear*) Classifier
loss function:
ℓ = 1[𝑦𝑖𝑝 ෝ𝑦𝑖 = 1 𝑥𝑖 < 0]
turn responses into probabilities
min𝐰
𝑖
𝔼ෞ𝑦𝑖[1 𝑦𝑖𝑝 ෝ𝑦𝑖 = 1 𝑥𝑖 < 0 ] =
minimize posterior 0-1 loss:
max𝐰
𝑖
𝑝 ෝ𝑦𝑖 = 𝑦𝑖 𝑥𝑖
why MAP classifiers are
reasonable
decision rule:
ෝ𝑦𝑖 = ൝0, 𝜎(𝐰𝐓𝐱𝐢 + 𝑏) < .5
1, 𝜎(𝐰𝐓𝐱𝐢 + 𝑏) ≥ .5
Plot from https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f *linear not strictly required
![Page 40: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/40.jpg)
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
as statistical regression
based in information theory
![Page 41: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/41.jpg)
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
Generalized Linear Models
as statistical regression
a form of
based in information theory
![Page 42: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/42.jpg)
Generalized Linear Models
𝑦 =
𝑘
𝜃𝑘𝑥𝑘 + 𝑏
response linear* wrt parameters
*affine is okay
the response can be a general (transformed) version of another response
![Page 43: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/43.jpg)
Generalized Linear Models
𝑦 =
𝑘
𝜃𝑘𝑥𝑘 + 𝑏
response linear* wrt parameters
*affine is okay
the response can be a general (transformed) version of another response
log 𝑝(𝑥 = 𝑖)
log 𝑝(𝑥 = 𝐾)=
𝑘
𝜃𝑘𝑓(𝑥𝑘 , 𝑖) + 𝑏logistic regression
![Page 44: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/44.jpg)
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
Generalized Linear Models
Discriminative Naïve Bayes
as statistical regression
a form of
viewed as
based in information theory
![Page 45: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/45.jpg)
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
Generalized Linear Models
Discriminative Naïve Bayes
Very shallow (sigmoidal) neural nets
as statistical regression
a form of
viewed as
based in information theory
to be cool today :)
![Page 46: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/46.jpg)
Outline
Log-Linear (Maximum Entropy) Models
Basic Modeling
Connections to other techniques (“… by any other name…”)
Objective to optimize
Regularization
![Page 47: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/47.jpg)
Version 1: Minimize Cross Entropy Loss
ℓxent 𝑦∗, 𝑦 = −
𝑘
𝑦∗ 𝑘 log 𝑝(𝑦 = 𝑘)
00…1…0
one-hot vector
index of “1” indicates
correct value
ℓxent 𝑦∗, 𝑝(𝑦)
loss uses y (random variable), or model’s probabilities
minimize xent loss →maximize log-likelihood (A2, Q2)
objective is convex
![Page 48: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/48.jpg)
Version 2: Maximize (Full/Log) Likelihood
These values can have very small magnitude ➔ underflow
Differentiating this product could be a pain
ෑ
𝑖
𝑝𝜃 𝑦𝑖 𝑥𝑖 ∝ ෑ
𝑖
exp(𝜃𝑇𝑓 𝑥𝑖 , 𝑦𝑖 )
![Page 49: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/49.jpg)
Version 2: Maximize Log-Likelihood
Wide range of (negative) numbers
Sums are more stable
logෑ
𝑖
𝑝𝜃 𝑦𝑖 𝑥𝑖 =
𝑖
log 𝑝𝜃(𝑦𝑖|𝑥𝑖)
![Page 50: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/50.jpg)
Version 2: Maximize Log-Likelihood
Wide range of (negative) numbers
Sums are more stable
Differentiating this becomes nicer (even
though Z depends on θ)
logෑ
𝑖
𝑝𝜃 𝑦𝑖 𝑥𝑖 =
𝑖
log 𝑝𝜃(𝑦𝑖|𝑥𝑖)
=
𝑖
𝜃𝑇𝑓 𝑥𝑖 , 𝑦𝑖 − log 𝑍(𝑥𝑖)
![Page 51: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/51.jpg)
Log-Likelihood Gradient
Each component k is the difference between:
![Page 52: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/52.jpg)
Log-Likelihood Gradient
Each component k is the difference between:
the total value of feature fk in the training data
![Page 53: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/53.jpg)
Log-Likelihood Gradient
Each component k is the difference between:
the total value of feature fk in the training data
and
the total value the current model pθ
thinks it computes for feature fk
“Moment Matching” A1 Q4, Eq-1 (what were the feature functions)?
𝑖
𝔼𝑝[𝑓(𝑥𝑖 , 𝑦′)
![Page 54: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/54.jpg)
https://www.csee.umbc.edu/courses/graduate/678/spring19/loglin-tutorial
Lesson 6: Gradient Optimization
![Page 55: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/55.jpg)
𝛻𝜃𝐹 𝜃 = 𝛻𝜃
𝑖
𝜃𝑇𝑓 𝑥𝑖 , 𝑦𝑖 − log 𝑍(𝑥𝑖)
Log-Likelihood Gradient Derivation
𝑦𝑖
![Page 56: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/56.jpg)
𝛻𝜃𝐹 𝜃 = 𝛻𝜃
𝑖
𝜃𝑇𝑓 𝑥𝑖 , 𝑦𝑖 − log 𝑍(𝑥𝑖)
= 𝛻𝜃
𝑖
𝑓 𝑥𝑖 , 𝑦𝑖 −
Log-Likelihood Gradient Derivation
𝑦𝑖
𝑍 𝑥𝑖 =
𝑦′
exp(𝜃 ⋅ 𝑓 𝑥𝑖 , 𝑦′ )
![Page 57: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/57.jpg)
𝛻𝜃𝐹 𝜃 = 𝛻𝜃
𝑖
𝜃𝑇𝑓 𝑥𝑖 , 𝑦𝑖 − log 𝑍(𝑥𝑖)
= 𝛻𝜃
𝑖
𝑓 𝑥𝑖 , 𝑦𝑖 −
𝑖
𝑦′
exp 𝜃𝑇𝑓 𝑥𝑖 , 𝑦′
𝑍 𝑥𝑖𝑓(𝑥𝑖 , 𝑦
′)
Log-Likelihood Gradient Derivation
𝜕
𝜕𝜃log𝑔(ℎ 𝜃 ) =
𝜕𝑔
𝜕ℎ(𝜃)
𝜕ℎ
𝜕𝜃
use the (calculus) chain rulescalar p(y’ | xi)
vector of functions
𝑦𝑖
![Page 58: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/58.jpg)
Log-Likelihood Gradient Derivation
Do we want these to fully match?
What does it mean if they do?
What if we have missing values in our data?
𝛻𝜃𝐹 𝜃 = 𝛻𝜃
𝑖
𝜃𝑇𝑓 𝑥𝑖 , 𝑦𝑖 − log 𝑍(𝑥𝑖)
= 𝛻𝜃
𝑖
𝑓 𝑥𝑖 , 𝑦𝑖 −
𝑖
𝑦′
exp 𝜃𝑇𝑓 𝑥𝑖 , 𝑦′
𝑍 𝑥𝑖𝑓(𝑥𝑖 , 𝑦
′)
![Page 59: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/59.jpg)
Outline
Log-Linear (Maximum Entropy) Models
Basic Modeling
Connections to other techniques (“… by any other name…”)
Objective to optimize
Regularization
![Page 60: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/60.jpg)
Nice if R(w) is convex
Small weights regularization
Sparsity regularization
Family of “p-norm” regularization
Weight regularization R(w)
not convex
convex: 𝑝 ≥ 1
not convex: 0 ≤ 𝑝 < 1Courtesy Hamed Pirsiavash
![Page 61: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/61.jpg)
Contours of p-norms
http://en.wikipedia.org/wiki/Lp_spaceCourtesy Hamed Pirsiavash
examine shape (slope) of surfaces to determine effect on
the regularized parameters
![Page 62: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/62.jpg)
Contours of p-norms
Counting non-zeros:
http://en.wikipedia.org/wiki/Lp_spaceCourtesy Hamed Pirsiavash
examine shape (slope) of surfaces to determine effect on
the regularized parameters
![Page 63: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/63.jpg)
A Simple Regularized Linear Classifier
regularize towarda simpler model
hyperparameter
decision rule: ෝ𝑦𝑖 = ൝0, 𝐰𝐓𝐱𝐢 < 0
1, 𝐰𝐓𝐱𝐢 ≥ 0
loss function: ℓ = 1[𝑦𝑖𝐰𝐓𝐱𝐢 < 0]
fewest mistakeson training
![Page 64: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/64.jpg)
https://www.csee.umbc.edu/courses/graduate/678/spring19/loglin-tutorial
Lesson 8: Regularization
![Page 65: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/65.jpg)
Understanding Conditioning
𝑝 𝑦 𝑥) ∝ exp(𝜃 ⋅ 𝑓 x )
Is this a good posterior classifier? (no)
![Page 66: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/66.jpg)
https://www.csee.umbc.edu/courses/graduate/678/spring19/loglin-tutorial
Lesson 11: Global vs. Conditional Modeling
![Page 67: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/67.jpg)
Connections to Other Techniques
Log-Linear Models
(Multinomial) logistic regression
Softmax regression
Maximum Entropy models (MaxEnt)
Generalized Linear Models
Discriminative Naïve Bayes
Very shallow (sigmoidal) neural nets
as statistical regression
a form of
viewed as
based in information theory
to be cool today :)
![Page 68: Maximum Entropy Models - csee.umbc.edu · We’ve only developed binary classifiers so far… Option 1: Develop a multi-class version Option 2: Build a one-vs-all (OvA) classifier](https://reader035.vdocuments.us/reader035/viewer/2022071215/60466b85e939ac54c007fca9/html5/thumbnails/68.jpg)
Outline
Log-Linear (Maximum Entropy) Models
Basic Modeling
Connections to other techniques (“… by any other name…”)
Objective to optimize
Regularization