final review
DESCRIPTION
Final review. LING 572 Fei Xia 03/07/06. Misc. Parts 3 and 4 were due at 6am today. Presentation: email me the slides by 6am on 3/9 Final report: email me by 6am on 3/14. Group meetings: 1:30-4:00pm on 3/16. Outline. Main topics Applying to NLP tasks Tricks. Main topics. Main topics. - PowerPoint PPT PresentationTRANSCRIPT
Final review
LING 572Fei Xia
03/07/06
Misc
• Parts 3 and 4 were due at 6am today.
• Presentation: email me the slides by 6am on 3/9
• Final report: email me by 6am on 3/14.
• Group meetings: 1:30-4:00pm on 3/16.
Outline
• Main topics
• Applying to NLP tasks
• Tricks
Main topics
Main topics• Supervised learning
– Decision tree– Decision list– TBL– MaxEnt– Boosting
• Semi-supervised learning– Self-training– Co-training– EM– Co-EM
Main topics (cont)• Unsupervised learning
– The EM algorithm– The EM algorithm for PM models
• Forward-backward• Inside-outside • IBM models for MT
• Others– Two dynamic models: FSA and HMM– Re-sampling: bootstrap– System combination– Bagging
Main topics (cont)• Homework
– Hw1: FSA and HMM– Hw2: DT, DL, CNF, DNF, and TBL– Hw3: Boosting
• Project:– P1: Trigram (learn to use Carmel, relation between
HMM and FSA)– P2: TBL– P3: MaxEnt– P4: Bagging, boosting, system combination, SSL
Supervised learning
A classification problemDistrict House
typeIncome Previous
CustomerOutcome
Suburban Detached High No Nothing
Suburban Semi-detached
High Yes Respond
Rural Semi-detached
Low No Respond
Urban Detached Low Yes Nothing
…
Classification and estimation problems
• Given– x: input attributes– y: the goal– training data: a set of (x, y)
• Predict y given a new x: – y is a discrete variable classification problem– y is a continuous variable estimation problem
Five ML methods
• Decision tree• Decision list• TBL• Boosting• MaxEnt
Decision tree
• Modeling: tree representation
• Training: top-down induction, greedy algorithm
• Decoding: find the path from root to a leaf node, where the tests along the path are satisfied.
Decision tree (cont)• Main algorithms: ID3, C4.5, CART
• Strengths:– Ability to generate understandable rules– Ability to clearly indicate best attributes
• Weakness:– Data splitting– Trouble with non-rectangular regions– The instability of top-down induction bagging
Decision list
• Modeling: a list of decision rules
• Training: greedy, iterative algorithm
• Decoding: find the 1st rule that applies
• Each decision is based on a single piece of evidence, in contrast to MaxEnt, boosting, TBL
TBL
• Modeling: a list of transformations (similar to decision rules)
• Training: – Greedy, iterative algorithm – The concept of current state
• Decoding: apply every transformation to the data
TBL (cont)
• Strengths:– Minimizing error rate directly– Ability to handle non-classification problem
• Dynamic problem: POS tagging• Non-classification problem: parsing
• Weaknesses:– Transformations are hard to interpret as they interact
with one another– Probabilistic TBL: TBL-DT
Boosting
Training Sample
Weighted Sample
Weighted Sample
fT
f1
…f2
f
ML
ML
ML
Boosting (cont)
• Modeling: combining a set of weak classifiers to produce a powerful committee.
• Training: learn one classifier at each iteration
• Decoding: use the weighted majority vote of the weak classifiers
Boosting (cont)
• Strengths– It comes with a set of theoretical guarantee
(e.g., training error, test error).– It only needs to find weak classifiers.
• Weaknesses:– It is susceptible to noise.– The actual performance depends on the data
and the base learner.
MaxEnt
)(maxarg* pHpPp
}},...,1{,|{ ~ kjfEfEpP jpjp
The task: find p* s.t.
where
Zexp
xf jk
jj )(
1
)(
If p* exists, it has of the form
MaxEnt (cont)
• If p* exists, then )(maxarg* qLpQq
x
xqxpqL )(log)(~)(
}0,)(|{1
)(
j
xf
ZexqqQ
k
jjj
where
MaxEnt (cont)
• Training: GIS, IIS
• Feature selection: – Greedy algorithm– Select one (or more) at a time
• In general, MaxEnt achieves good performance on many NLP tasks.
Common issues
• Objective function / Quality measure:– DT, DL: e.g., information gain– TBL, Boosting: minimize training errors– MaxEnt: maximize entropy while satisfying
constraints
Common issues (cont)
• Avoiding overfitting– Use development data– Two strategies:
• stop early• post-pruning
Common issues (cont)
• Missing attribute values:– Assume a “blank” value– Assign most common value among all “similar”
examples in the training data – (DL, DT): Assign a fraction of example to each
possible class.
• Continuous-valued attributes– Choosing thresholds by checking the training data
Common issues (cont)
• Attributes with different costs– DT: Change the quality measure to include
the costs
• Continuous-valued goal attribute– DT, DL: each “leaf” node is marked with a real
value or a linear function– TBL, MaxEnt, Boosting: ??
Comparison of supervised learnersDT DL TBL Boosting MaxEnt
Probabilistic PDT PDL TBL-DT Confidence Y
Parametric N N N N Y
representation Tree Ordered list of rules
Ordered list of transformations
List of weighted classifiers
List of weighted features
Each iteration Attribute Rule Transformation
Classifier & weight
Feature & weight
Data processing
Splitdata
Split data*
Change cur_y
Reweight (x,y)
None
decoding Path 1st rule Sequence of rules
Calc f(x) Calc f(x)
Semi-supervised Learning
Semi-supervised learning
• Each learning method makes some assumptions about the problem.
• SSL works when those assumptions are satisfied.
• SSL could degrade the performance when mistakes reinforce themselves.
SSL (cont)
• We have covered four methods: self-training, co-training, EM, co-EM
Co-training
• The original paper: (Blum and Mitchell, 1998)– Two “independent” views: split the features into two
sets.– Train a classifier on each view.– Each classifier labels data that can be used to train
the other classifier.• Extension:
– Relax the conditional independence assumptions– Instead of using two views, use two or more
classifiers trained on the whole feature set.
Unsupervised learning
Unsupervised learning
• EM is a method of estimating parameters in the MLE framework.
• It finds a sequence of parameters that improve the likelihood of the training data.
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence– E-step: calculate
– M-step: find
),(maxarg)1( tt Q
)|,(log),|(),(1
yxPxyPQ it
n
i yi
t
The EM algorithm (cont)
• The optimal solution for the M-step exists for many classes of problems.
A number of well-known methods are special cases of EM.
• The EM algorithm for PM models– Forward-backward algorithm– Inside-outside algorithm– …
Other topics
FSA and HMM
• Two types of HMMs:– State-emission and arc-emission HMMs– They are equivalent
• We can convert HMM into WFA• Modeling: Marcov assumption• Training:
– Supervised: counting– Unsupervised: forward-backward algorithm
• Decoding: Viterbi algorithm
Bootstrap
f1
f2
fB
ML
ML
ML
f
Bootstrap (cont)
• A method of re-sampling:– One original sample B bootstrap samples
• It has a strong mathematical background.
• It is a method for estimating standard errors, bias, and so on.
System combination
f1
f2
fB
ML1
MLB
ML2
f
System combination (cont)
• Hybridization: combine substructures to produce a new one.– Voting– Naïve Bayes
• Switching: choose one of the fi(x)– Similarity switching– Naïve Bayes
))(),....,(()( 1 xfxfgxf m
Bagging
f1
f2
fB
ML
ML
ML
f
bootstrap + system combination
Bagging (cont)
• It is effective for unstable learning methods:– Decision tree– Regression tree– Neural network
• It does not help stable learning methods– K-nearest neighbors
Relations
Relations
• WFSA and HMM• DL, DT, TBL• EM, EM for PM
WFSA and HMM
HMM Finish
Add a “Start” state and a transition from “Start” to any state in HMM.Add a “Finish” state and a transition from any state in HMM to “Finish”.
Start
DT, CNF, DNF, DT, TBL
k-CNF k-DNFk-DT
K-DL
k-TBL
The EM algorithm
The generalized EM
The EM algorithm
PM Gaussian MixInside-OutsideForward-backwardIBM models
Solving a NLP problem
Issues• Modeling: represent the problem as a formula and
decompose the formula into a function of parameters
• Training: estimate model parameters
• Decoding: find the best answer given the parameters
• Other issues:– Preprocessing– Postprocessing– Evaluation– …
Modeling
• Generative vs. discriminative models
• Introducing hidden variables
• The order of decomposition
)|()|(),( FEPEFPEFP
a
EaFPEFP )|,()|(
),|(*)|()|,( EaFPEaPEaFP
),|(*)|()|,( EFaPEFPEaFP
Modeling (cont)
• Approximation / assumptions
• Final formulae and types of parameters
)|(),|()|( 11
1 i
ii
i
ii aaPaEaPEaP
)|()1()|()|(
1 1i
m
j
l
ijm efP
llmPEFP
Modeling (cont)
• Using classifiers for non-classification problem– POS tagging– Chunking– Parsing
Training
• Objective functions:– Maximize likelihood: EM– Minimize error rate: TBL– Maximum entropy: MaxEnt– ….
• Supervised, semi-supervised, unsupervised:– Ex: Maximize likelihood
• Supervised: simple counting• Unsupervised: EM
Training (cont)
• At each iteration: – Choose one attribute / rule / weight / … at a time,
and never change it in later time: DT, DL, TBL,
– Update all the parameters at each iteration: EM
• Choose “untrained” parameters (e.g., thresholds): use development data.– Minimal “gain” for continuing iteration
Decoding
• Dynamic programming: – CYK for PCFG– Viterbi for HMM
• Dynamic problem:– Decode from left to right– Features only look at the left context– Keep top-N hypotheses at each position
Preprocessing
• Sentence segmentation• Sentence alignment (for MT)• Tokenization• Morphing• POS tagging • …
Post-processing
• System combination• Casing (MT)• …
Evaluation
• Use standard training/test data if possible.
• Choose appropriate evaluation measures:– WSD: for what applications?– Word alignment: F-measure vs. AER. How
does it affect MT result?– Parsing: F-measure vs. dependency link
accuracy
Tricks
Tricks
• Algebra• Probability• Optimization• Programming
Algebra
),...,(...),...,(...1 21 2
11 nx x x
nx x x
xxfcxxfcnn
),...,(...),...,(...1 11 2
11 nx x x
nx x x
xxfxxfn nn
The order of sums:
Pulling out constants:
Algebra (cont)
i
ii
i ff loglog
The order of sums and products:
)()(...1111 22
n
i xii
x x x
n
iii
iinn
xfxf
The order of log and product / sum:
Probability
yy
yxpypyxpxp ),|(*)|()|,()|(
),|()|()(),,( zyxPyzPyPzyxP
Introducing a new random variable:
The order of decomposition:
),|()|()(),,( yxzPxyPxPzyxP
More general cases
),...|(),...,(
),...,()(
111
1
,...,11
2
ii
in
AAn
AAAPAAP
AAPAPn
Probability (cont)
)()|()()|(
xpyxpypxyp
)|()(maxarg)|(maxarg yxpypxypyy
Source-channel model:
Bayes Rule:
Probability (cont)
x
xCtxCtxp)()()(Normalization:
Jensen’s inequality:
)])([log()]([log( xpExpE
Optimization
• When there is no analytical solution, use iterative approach.
• If the optimal solution to g(x) is hard to find, look for the optimal solution to a (tight) lower bound of g(x).
Optimization (cont)• Using Lagrange multipliers: Constrained problem: maximize f(x) with the constraint that g(x)=0
Unconstrained problem: maximize f(x) – λg(x)
• Taking first derivatives to find the stationary points.
Programming
• Using/creating a good package:– Tutorial, sample data, well-written code– Multiple levels of code
• Core ML algorithm: e.g., TBL• Wrapper for a task: e.g., POS tagger• Wrapper to deal with input, output, etc.
Programming (cont)• Good practice:
– Write notes and create wrappers (all the commands should be stored in the notes, or even better in a wrapper code)
– Use standard directory structures:• src/, include/, exec/, bin/, obj/, docs/, sample/, data/, result/
– Give meaning filenames only to important code: e.g., aaa100.exec, build_trigram_tagger.pl
– Give meaning function, variable names
– Don’t use global variables
Final words
• We have covered a lot of topics: 5+4+3+4
• It takes time to digest, but at least we understand the basic concepts.
• The next step: applying them to real applications.