sequence modelling with features: linear-chain conditional ...jcheung/teaching/fall-2016/...logistic...
TRANSCRIPT
![Page 1: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/1.jpg)
Sequence Modelling with Features:
Linear-Chain Conditional Random
Fields
COMP-599
Oct 3, 2016
![Page 2: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/2.jpg)
OutlineSequence modelling: review and miscellaneous notes
Hidden Markov models: shortcomings
Generative vs. discriminative models
Linear-chain CRFs
โข Inference and learning algorithms with linear-chain CRFs
2
![Page 3: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/3.jpg)
Hidden Markov ModelGraph specifies how join probability decomposes
๐(๐ถ,๐ธ) = ๐ ๐1
๐ก=1
๐โ1
๐(๐๐ก+1|๐๐ก)
๐ก=1
๐
๐(๐๐ก|๐๐ก)
3
๐1
๐1
๐2
๐2
๐3
๐3
๐4
๐4
๐5
๐5
Initial state probability
State transition probabilities
Emission probabilities
![Page 4: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/4.jpg)
EM and Baum-WelchEM finds a local optimum in ๐ ๐ถ ๐ .
You can show that after each step of EM:
๐ ๐ถ ๐๐+1 > ๐(๐ถ|๐๐)
However, this is not necessarily a global optimum.
4
![Page 5: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/5.jpg)
Dealing with Local OptimaRandom restarts
โข Train multiple models with different random initializations
โข Model selection on development set to pick the best one
Biased initialization
โข Bias your initialization using some external source of knowledge (e.g., external corpus counts or clustering procedure, expert knowledge about domain)
โข Further training will hopefully improve results
5
![Page 6: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/6.jpg)
CaveatsBaum-Welch with no labelled data generally gives poor results, at least for linguistic structure (~40% accuracy, according to Johnson, (2007))
Semi-supervised learning: combine small amounts of labelled data with larger corpus of unlabelled data
6
![Page 7: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/7.jpg)
In PracticePer-token (i.e., per word) accuracy results on WSJ corpus:
Most frequent tag baseline ~90โ94%
HMMs (Brants, 2000) 96.5%
Stanford tagger (Manning, 2011) 97.32%
7
![Page 8: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/8.jpg)
Other Sequence Modelling TasksChunking (a.k.a., shallow parsing)
โข Find syntactic chunks in the sentence; not hierarchical
[NPThe chicken] [Vcrossed] [NPthe road] [Pacross] [NPthe lake].
Named-Entity Recognition (NER)
โข Identify elements in text that correspond to some high level categories (e.g., PERSON, ORGANIZATION, LOCATION)
[ORGMcGill University] is located in [LOCMontreal, Canada].
โข Problem: need to detect spans of multiple words
8
![Page 9: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/9.jpg)
First AttemptSimply label words by their category
ORG ORG ORG - ORG - - - LOC
McGill, UQAM, UdeM, and Concordia are located in Montreal.
What is the problem with this scheme?
9
![Page 10: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/10.jpg)
IOB TaggingLabel whether a word is inside, outside, or at the beginning of a span as well.
For n categories, 2n+1 total tags.
BORG IORG O O O BLOC ILOC
McGill University is located in Montreal, Canada
BORG BORG BORG O BORG O O O BLOC
McGill, UQAM, UdeM, and Concordia are located in Montreal.
10
![Page 11: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/11.jpg)
11
Sequence Modelling with Linguistic
Features
![Page 12: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/12.jpg)
Shortcomings of Standard HMMsHow do we add more features to HMMs?
Might be useful for POS tagging:
โข Word position within sentence (1st, 2nd, lastโฆ)
โข Capitalization
โข Word prefix and suffixes (-tion, -ed, -ly, -er, re-, de-)
โข Features that depend on more than the current word or the previous words.
12
![Page 13: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/13.jpg)
Possible to Do with HMMsAdd more emissions at each timestep
Clunky
Is there a better way to do this?
13
๐1
๐12๐11 ๐13 ๐14
Word identity Capitalization Prefix feature Word position
![Page 14: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/14.jpg)
Discriminative ModelsHMMs are generative models
โข Build a model of the joint probability distribution ๐(๐ถ,๐ธ),
โข Letโs rename the variables
โข Generative models specify ๐(๐, ๐; ๐gen)
If we are only interested in POS tagging, we can instead train a discriminative model
โข Model specifies ๐(๐|๐; ๐disc)
โข Now a task-specific model for sequence labelling; cannot use it for generating new samples of word and POS sequences
14
![Page 15: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/15.jpg)
Generative or Discriminative?Naive Bayes
๐ ๐ฆ ๐ฅ = ๐ ๐ฆ ๐ ๐ ๐ฅ๐ ๐ฆ / ๐( ๐ฅ)
Logistic regression
๐(๐ฆ| ๐ฅ) =1
๐๐๐1๐ฅ1 + ๐2๐ฅ2 + โฆ +๐๐๐ฅ๐ + ๐
15
![Page 16: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/16.jpg)
Discriminative Sequence ModelThe parallel to an HMM in the discriminative case: linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001)
๐ ๐ ๐ =1
๐ ๐exp
๐ก
๐
๐๐๐๐(๐ฆ๐ก , ๐ฆ๐กโ1, ๐ฅ๐ก)
Z(X) is a normalization constant:
๐ ๐ =
๐
exp
๐ก
๐
๐๐๐๐(๐ฆ๐ก , ๐ฆ๐กโ1, ๐ฅ๐ก)
16
sum over all features
sum over all time-steps
sum over all possible sequences of hidden states
![Page 17: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/17.jpg)
IntuitionStandard HMM: product of probabilities; these probabilities are defined over the identity of the states and words
โข Transition from state DT to NN: ๐(๐ฆ๐ก+1 = ๐๐|๐ฆ๐ก = ๐ท๐)
โข Emit word the from state DT: ๐(๐ฅ๐ก = ๐กโ๐|๐ฆ๐ก = ๐ท๐)
Linear-chain CRF: replace the products by numbers that are NOT probabilities, but linear combinations of weights and feature values.
17
![Page 18: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/18.jpg)
Features in CRFsStandard HMM probabilities as CRF features:
โข Transition from state DT to state NN๐๐ท๐โ๐๐(๐ฆ๐ก , ๐ฆ๐กโ1, ๐ฅ๐ก) = ๐(๐ฆ๐กโ1 = ๐ท๐) ๐(๐ฆ๐ก = ๐๐)
โข Emit the from state DT๐๐ท๐โ๐กโ๐(๐ฆ๐ก , ๐ฆ๐กโ1, ๐ฅ๐ก) = ๐(๐ฆ๐ก = ๐ท๐) ๐(๐ฅ๐ก = ๐กโ๐)
โข Initial state is DT๐๐ท๐ ๐ฆ1, ๐ฅ1 = ๐ ๐ฆ1 = ๐ท๐
Indicator function:
Let ๐ ๐๐๐๐๐๐ก๐๐๐ = 1 if ๐๐๐๐๐๐ก๐๐๐ is true0 otherwise
18
![Page 19: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/19.jpg)
Features in CRFsAdditional features that may be useful
โข Word is capitalized๐๐๐๐(๐ฆ๐ก , ๐ฆ๐กโ1, ๐ฅ๐ก) = ๐(๐ฆ๐ก = ? )๐(๐ฅ๐ก is capitalized)
โข Word ends in โed๐โ๐๐(๐ฆ๐ก , ๐ฆ๐กโ1, ๐ฅ๐ก) = ๐(๐ฆ๐ก = ? )๐(๐ฅ๐ก ends with ๐๐)
โข Exercise: propose more features
19
![Page 20: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/20.jpg)
Inference with LC-CRFsDynamic programming still works โ modify the forward and the Viterbi algorithms to work with the weight-feature products.
20
HMM LC-CRF
Forward algorithm ๐ ๐ ๐ ๐(๐)
Viterbi algorithm argmax๐
๐(๐, ๐|๐) argmax๐
๐(๐|๐, ๐)
![Page 21: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/21.jpg)
Forward Algorithm for HMMsCreate trellis ๐ผ๐(๐ก) for ๐ = 1โฆ๐, ๐ก = 1โฆ๐
๐ผ๐ 1 = ๐๐๐๐(๐1) for j = 1 โฆ N
for t = 2 โฆ T:
for j = 1 โฆ N:
๐ผ๐ ๐ก =
๐=1
๐
๐ผ๐ ๐ก โ 1 ๐๐๐๐๐(๐๐ก)
๐ ๐ถ ๐ =
๐=1
๐
๐ผ๐(๐)
Runtime: O(๐2๐)
21
![Page 22: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/22.jpg)
Forward Algorithm for LC-CRFsCreate trellis ๐ผ๐(๐ก) for ๐ = 1โฆ๐, ๐ก = 1โฆ๐
๐ผ๐ 1 = exp ๐ ๐๐๐๐๐๐ก๐๐
๐๐๐๐ก(๐ฆ1 = ๐, ๐ฅ1) for j = 1 โฆ N
for t = 2 โฆ T:
for j = 1 โฆ N:
๐ผ๐ ๐ก =
๐=1
๐
๐ผ๐ ๐ก โ 1 exp
๐
๐๐๐๐(๐ฆ๐ก = ๐, ๐ฆ๐กโ1, ๐ฅ๐ก)
๐(๐) =
๐=1
๐
๐ผ๐(๐)
Runtime: O(๐2๐)
Having ๐(๐) allows us to compute ๐(๐|๐)
22
Transition and emission probabilities replacedby exponent of weighted sums of features.
![Page 23: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/23.jpg)
Viterbi Algorithm for HMMsCreate trellis ๐ฟ๐(๐ก) for ๐ = 1โฆ๐, ๐ก = 1โฆ๐
๐ฟ๐ 1 = ๐๐๐๐(๐1) for j = 1 โฆ N
for t = 2 โฆ T:
for j = 1 โฆ N:๐ฟ๐ ๐ก = max
๐๐ฟ๐ ๐ก โ 1 ๐๐๐๐๐(๐๐ก)
Take max๐
๐ฟ๐ ๐
Runtime: O(๐2๐)
23
![Page 24: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/24.jpg)
Viterbi Algorithm for LC-CRFsCreate trellis ๐ฟ๐(๐ก) for ๐ = 1โฆ๐, ๐ก = 1โฆ๐
๐ฟ๐ 1 = exp ๐ ๐๐๐๐๐๐ก๐๐
๐๐๐๐ก(๐ฆ1 = ๐, ๐ฅ1)for j = 1 โฆ N
for t = 2 โฆ T:
for j = 1 โฆ N:
๐ฟ๐ ๐ก = max๐
๐ฟ๐ ๐ก โ 1 exp
๐
๐๐๐๐(๐ฆ๐ก = ๐, ๐ฆ๐กโ1, ๐ฅ๐ก)
Take max๐
๐ฟ๐ ๐
Runtime: O(๐2๐)
Remember that we need backpointers.
24
![Page 25: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/25.jpg)
Training LC-CRFsUnlike for HMMs, no analytic MLE solution
Use iterative method to improve data likelihood
Gradient descent
A version of Newtonโs method to find where the gradient is 0
25
![Page 26: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/26.jpg)
ConvexityFortunately, ๐ ๐ is a concave function (equivalently, its negation is a convex function). That means that we will find the global maximum of ๐ ๐ with gradient descent.
26
![Page 27: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/27.jpg)
Gradient AscentWalk in the direction of the gradient to maximize ๐(๐)
โข a.k.a., gradient descent on a loss function
๐new = ๐old + ๐พ๐ป๐(๐)
๐พis a learning rate that specifies how large a step to take.
There are more sophisticated ways to do this update:
โข Conjugate gradient
โข L-BFGS (approximates using second derivative)
27
![Page 28: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/28.jpg)
Gradient Descent SummaryDescent vs ascent
Convention: think about the problem as a minimization problem
Minimize the negative log likelihood
Initialize ๐ = ๐1, ๐2, โฆ , ๐๐ randomly
Do for a while:
Compute ๐ป๐(๐), which will require dynamic programming (i.e., forward algorithm)
๐ โ ๐ โ ๐พ๐ป๐(๐)
28
![Page 29: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/29.jpg)
Gradient of Log-LikelihoodFind the gradient of the log likelihood of the training corpus:
๐ ๐ = log
๐
๐(๐ ๐ |๐ ๐ )
29
![Page 30: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/30.jpg)
Interpretation of GradientOverall gradient is the difference between:
๐
๐ก
๐๐(๐ฆ๐ก๐, ๐ฆ๐กโ1
๐, ๐ฅ๐ก
๐)
the empirical distribution of feature ๐๐ in the training corpus
and:
๐
๐ก
๐ฆ,๐ฆโฒ
๐๐ ๐ฆ, ๐ฆโฒ, ๐ฅ๐ก๐ ๐(๐ฆ, ๐ฆโฒ|๐ ๐ )
the expected distribution of ๐๐ as predicted by the current model
30
![Page 31: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/31.jpg)
Interpretation of GradientWhen the corpus likelihood is maximized, the gradient is zero, so the difference is zero.
Intuitively, this means that finding parameter estimate by gradient descent is equivalent to telling our model to predict the features in such a way that they are found in the same distribution as in the gold standard.
31
![Page 32: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/32.jpg)
RegularizationTo avoid overfitting, we can encourage the weights to be close to zero.
Add term to corpus log likelihood:
๐โ ๐ = log
๐
๐(๐ ๐ |๐ ๐ ) โ exp
๐
๐๐2
2๐2
๐ controls the amount of regularization
Results in extra term in gradient:
โ๐๐๐2
32
![Page 33: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/33.jpg)
Stochastic Gradient DescentIn the standard version of the algorithm, the gradient is computed over the entire training corpus.
โข Weight update only once per iteration through training corpus.
Alternative: calculate gradient over a small mini-batch of the training corpus and update weights then
โข Many weight updates per iteration through training corpus
โข Usually results in much faster convergence to final solution, without loss in performance
33
![Page 34: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โฆ+ ๐ ๐+ 15 Discriminative Sequence Model The](https://reader033.vdocuments.us/reader033/viewer/2022052000/6012c956e9a0422b6c27e0c6/html5/thumbnails/34.jpg)
Stochastic Gradient DescentInitialize ๐ = ๐1, ๐2, โฆ , ๐๐ randomly
Do for a while:
Randomize order of samples in training corpus
For each mini-batch in the training corpus:
Compute ๐ปโ๐(๐) over this mini-batch๐ โ ๐ โ ๐พ๐ปโ๐(๐)
34