final review

Final review

LING 572Fei Xia

03/07/06

Misc

• Parts 3 and 4 were due at 6am today.

• Presentation: email me the slides by 6am on 3/9

• Final report: email me by 6am on 3/14.

• Group meetings: 1:30-4:00pm on 3/16.

Outline

• Main topics

• Applying to NLP tasks

• Tricks

Main topics

Main topics• Supervised learning

– Decision tree– Decision list– TBL– MaxEnt– Boosting

• Semi-supervised learning– Self-training– Co-training– EM– Co-EM

Main topics (cont)• Unsupervised learning

– The EM algorithm– The EM algorithm for PM models

• Forward-backward• Inside-outside • IBM models for MT

• Others– Two dynamic models: FSA and HMM– Re-sampling: bootstrap– System combination– Bagging

Main topics (cont)• Homework

– Hw1: FSA and HMM– Hw2: DT, DL, CNF, DNF, and TBL– Hw3: Boosting

• Project:– P1: Trigram (learn to use Carmel, relation between

HMM and FSA)– P2: TBL– P3: MaxEnt– P4: Bagging, boosting, system combination, SSL

Supervised learning

A classification problemDistrict House

typeIncome Previous

CustomerOutcome

Suburban Detached High No Nothing

Suburban Semi-detached

High Yes Respond

Rural Semi-detached

Low No Respond

Urban Detached Low Yes Nothing

…

Classification and estimation problems

• Given– x: input attributes– y: the goal– training data: a set of (x, y)

• Predict y given a new x: – y is a discrete variable classification problem– y is a continuous variable estimation problem

Five ML methods

• Decision tree• Decision list• TBL• Boosting• MaxEnt

Decision tree

• Modeling: tree representation

• Training: top-down induction, greedy algorithm

• Decoding: find the path from root to a leaf node, where the tests along the path are satisfied.

Decision tree (cont)• Main algorithms: ID3, C4.5, CART

• Strengths:– Ability to generate understandable rules– Ability to clearly indicate best attributes

• Weakness:– Data splitting– Trouble with non-rectangular regions– The instability of top-down induction bagging

Decision list

• Modeling: a list of decision rules

• Training: greedy, iterative algorithm

• Decoding: find the 1st rule that applies

• Each decision is based on a single piece of evidence, in contrast to MaxEnt, boosting, TBL

TBL

• Modeling: a list of transformations (similar to decision rules)

• Training: – Greedy, iterative algorithm – The concept of current state

• Decoding: apply every transformation to the data

TBL (cont)

• Strengths:– Minimizing error rate directly– Ability to handle non-classification problem

• Dynamic problem: POS tagging• Non-classification problem: parsing

• Weaknesses:– Transformations are hard to interpret as they interact

with one another– Probabilistic TBL: TBL-DT

Boosting

Training Sample

Weighted Sample

Weighted Sample

fT

f1

…f2

f

ML

ML

ML

Boosting (cont)

• Modeling: combining a set of weak classifiers to produce a powerful committee.

• Training: learn one classifier at each iteration

• Decoding: use the weighted majority vote of the weak classifiers

Boosting (cont)

• Strengths– It comes with a set of theoretical guarantee

(e.g., training error, test error).– It only needs to find weak classifiers.

• Weaknesses:– It is susceptible to noise.– The actual performance depends on the data

and the base learner.

MaxEnt

)(maxarg* pHpPp

}},...,1{,|{ ~ kjfEfEpP jpjp

The task: find p* s.t.

where

Zexp

xf jk

jj )(

1

)(

If p* exists, it has of the form

MaxEnt (cont)

• If p* exists, then )(maxarg* qLpQq

x

xqxpqL )(log)(~)(

}0,)(|{1

)(

j

xf

ZexqqQ

k

jjj

where

MaxEnt (cont)

• Training: GIS, IIS

• Feature selection: – Greedy algorithm– Select one (or more) at a time

• In general, MaxEnt achieves good performance on many NLP tasks.

Common issues

• Objective function / Quality measure:– DT, DL: e.g., information gain– TBL, Boosting: minimize training errors– MaxEnt: maximize entropy while satisfying

constraints

Common issues (cont)

• Avoiding overfitting– Use development data– Two strategies:

• stop early• post-pruning


• Missing attribute values:– Assume a “blank” value– Assign most common value among all “similar”

examples in the training data – (DL, DT): Assign a fraction of example to each

possible class.

• Continuous-valued attributes– Choosing thresholds by checking the training data


• Attributes with different costs– DT: Change the quality measure to include

the costs

• Continuous-valued goal attribute– DT, DL: each “leaf” node is marked with a real

value or a linear function– TBL, MaxEnt, Boosting: ??

Comparison of supervised learnersDT DL TBL Boosting MaxEnt

Probabilistic PDT PDL TBL-DT Confidence Y

Parametric N N N N Y

representation Tree Ordered list of rules

Ordered list of transformations

List of weighted classifiers

List of weighted features

Each iteration Attribute Rule Transformation

Classifier & weight

Feature & weight

Data processing

Splitdata

Split data*

Change cur_y

Reweight (x,y)

None

decoding Path 1st rule Sequence of rules

Calc f(x) Calc f(x)

Semi-supervised Learning

Semi-supervised learning

• Each learning method makes some assumptions about the problem.

• SSL works when those assumptions are satisfied.

• SSL could degrade the performance when mistakes reinforce themselves.

SSL (cont)

• We have covered four methods: self-training, co-training, EM, co-EM

Co-training

• The original paper: (Blum and Mitchell, 1998)– Two “independent” views: split the features into two

sets.– Train a classifier on each view.– Each classifier labels data that can be used to train

the other classifier.• Extension:

– Relax the conditional independence assumptions– Instead of using two views, use two or more

classifiers trained on the whole feature set.

Unsupervised learning

Unsupervised learning

• EM is a method of estimating parameters in the MLE framework.

• It finds a sequence of parameters that improve the likelihood of the training data.

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

The EM algorithm (cont)

• The optimal solution for the M-step exists for many classes of problems.

A number of well-known methods are special cases of EM.

• The EM algorithm for PM models– Forward-backward algorithm– Inside-outside algorithm– …

final review

Documents

list of decision rulestraining

set of weak classifiers

set of x

applieseach decision

greedy algorithmselect

greedy algorithmdecoding

iterative algorithm

data splittingtrouble