feature selection & maximum entropy advanced statistical methods in nlp ling 572 january 26,...

1

Feature Selection & Maximum Entropy

Advanced Statistical Methods in NLPLing 572

January 26, 2012

2

RoadmapFeature selection and weighting

Feature weightingChi-square feature selectionChi-square feature selection example

HW #4

Maximum Entropy Introduction: Maximum Entropy PrincipleMaximum Entropy NLP examples

3

Feature Selection RecapProblem: Curse of dimensionality

Data sparseness, computational cost, overfitting

4



Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|

5



Solution: Dimensionality reductionNew feature set r’ s.t. |r’| < |r|Approaches:

Global & local approachesFeature extraction:

New features in r’ transformations of features in r

6






Feature selection: Wrapper techniques

7






Feature selection: Wrapper techniques Feature scoring

8

Feature WeightingFor text classification, typical weights include:

9


Binary: weights in {0,1}

10



Term frequency (tf): # occurrences of tk in document di

11




Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

12




Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

tfidf = tf*idf

13

Chi SquareTests for presence/absence of relation between

random variables

14


random variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

15




16




(Strictly speaking) doesn’t test direction

17

Chi Square ExampleCan gender predict shoe choice?

Due to F. Xia

18


A: male/female Features

Due to F. Xia

19


A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

20


A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

21

Comparing DistributionsObserved distribution (O):


boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Due to F. Xia

22


Expected distribution (E):


boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Due to F. Xia

23




boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 50

Female 50

Total 19 22 20 25 14 100

Due to F. Xia

24




boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 50

Female 9.5 50

Total 19 22 20 25 14 100

Due to F. Xia

25




boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 50

Female 9.5 11 50

Total 19 22 20 25 14 100

Due to F. Xia

26




boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 50

Female 9.5 11 10 50

Total 19 22 20 25 14 100

Due to F. Xia

27




boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 50

Female 9.5 11 10 12.5 50

Total 19 22 20 25 14 100

Due to F. Xia

28




boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Due to F. Xia

29

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

30



31



X2=(6-9.5)2/9.5+

32



X2=(6-9.5)2/9.5+(17-11)2/11

33



X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026

34

Calculating X2

Tabulate contigency table of observed values: O

35

Calculating X2


Compute row, column totals

36

Calculating X2



Compute table of expected values, given row/colAssuming no association

37

Calculating X2



Compute table of expected values, given row/colAssuming no association

Compute X2

38

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

39

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk

tk

total

40

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk a+b

tk c+d

total a+c b+d N

41

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

a+b

tk c+d

total a+c b+d N

42

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk c+d

total a+c b+d N

43

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N c+d

total a+c b+d N

44

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N (c+d)(b+d)/N c+d

total a+c b+d N

45

X2 TestTest whether random variables are independent

46


Null hypothesis: R.V.s are independent

47


Null hypothesis: 2 R.V.s are independent

Compute X2 statistic:

48



Compute X2 statistic:Compute degrees of freedom

49




df = (# rows -1)(# cols -1)

50




df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

51





Test probability of X2 statistic value X2 table

52



Compute X2 statistic: Compute degrees of freedom


Test probability of X2 statistic value X2 table

If probability is low – below some significance level Can reject null hypothesis

53

Requirements for X2 TestEvents assumed independent, same distribution

54


Outcomes must be mutually exclusive

55



Raw frequencies, not percentages

56



Raw frequencies, not percentages

Sufficient values per cell: > 5

57

X2 Example

58

X2 ExampleShared Task Evaluation:

Topic Detection and Tracking (aka TDT)

59



Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-

4)Define a topicCreate a model that allows tracking of the topic

I.e. find all subsequent documents on this topic

60



Sub-task: Topic Tracking TaskGiven a small number of exemplar documents (1-

4)Define a topicCreate a model that allows tracking of the topic

I.e. find all subsequent documents on this topic

Exemplars: 1-4 newswire articles300-600 words each

61

ChallengesMany news articles look alike

Create a profile (feature representation)Highlights terms strongly associated with current

topic Differentiate from all other topics

62

ChallengesMany news articles look alike

Create a profile (feature representation)Highlights terms strongly associated with current

topic Differentiate from all other topics

Not all documents labeledOnly a small subset belong to topics of interest

Differentiate from other topics AND ‘background’

63

ApproachX2 feature selection:

64


Assume terms have binary representation

65


Assume terms have binary representation Positive class term occurrences from exemplar docs

66


Assume terms have binary representation Positive class term occurrences from exemplar docsNegative class term occurrences from

other class exemplars, ‘earlier’ uncategorized docs

67




Compute X2 for termsRetain terms with highest X2 scores

Keep top N terms

68




Compute X2 for termsRetain terms with highest X2 scores

Keep top N terms

Create one feature set per topic to be tracked

69

Tracking ApproachBuild vector space model

70


Feature weighting: tf*idfwith some modifications

71



Distance measure: Cosine similarity

72




Select documents scoring above thresholdFor each topic

73




Select documents scoring above thresholdFor each topic

Result: Improved retrieval

74

HW #4Topic: Feature Selection for kNN

Build a kNN classifier using:Euclidean distance, Cosine Similarity

Write a program to compute X2 on a data set

Use X2 at different significance levels to filter

Compare the effects of different feature filteringon kNN classification

75

Maximum Entropy

76

Maximum Entropy“MaxEnt”:

Popular machine learning technique for NLP

First uses in NLP circa 1996 – Rosenfeld, Berger

Applied to a wide range of tasks

Sentence boundary detection (MxTerminator, Ratnaparkhi), POS tagging (Ratnaparkhi, Berger), topic segmentation (Berger), Language modeling (Rosenfeld), prosody labeling, etc….

77

Readings & CommentsSeveral readings:

(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): Tutorial

78


(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

79


(Berger, 1996), (Ratnaparkhi, 1997)(Klein & Manning, 2003): TutorialNote: Some of these are very ‘dense’

Don’t spend huge amounts of time on every detailTake a first pass before class, review after lecture

Going forward: Techniques more complex

Goal: Understand basic model, concepts Training esp. complex – we’ll discuss, but not implement

80

Notation NoteNot entirely consistent:

We’ll use: input = x; output=y; pair = (x,y)

Consistent with Berger, 1996

Ratnaparkhi, 1996: input = h; output=t; pair = (h,t)

Klein/Manning, ‘03: input = d; output=c; pair = (c,d)

81

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to

learn a model Θ s.t. given a new x, can predict label y.

82



Different types of models: Joint models (aka generative models) estimate

P(x,y) by maximizing P(X,Y|Θ)

83




P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etc

84




P(x,y) by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequency

85

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative frequency

Conditional (aka discriminative) models estimate P(y|x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …

86

Joint vs Conditional ModelsAssuming some training data {(x,y)}, need to learn

a model Θ s.t. given a new x, can predict label y.

Different types of models: Joint models (aka generative models) estimate P(x,y)

by maximizing P(X,Y|Θ)Most models so far: n-gram, Naïve Bayes, HMM, etcConceptually easy to compute weights: relative

frequencyConditional (aka discriminative) models estimate P(y|

x), by maximizing P(Y|X, Θ)Models going forward: MaxEnt, SVM, CRF, …Computing weights more complex

Naïve Bayes Model

Naïve Bayes Model assumes features f are independent of each other, given the class C

c

f1 f2 f3 fk

88

Naïve Bayes ModelMakes assumption of conditional independence

of features given class

89



However, this is generally unrealistic

90




P(“cuts”|politics) = pcuts

91





What about P(“cuts”|politics,”budget”) ?= pcuts

92





What about P(“cuts”|politics,”budget”) ?= pcuts

Would like a model that doesn’t assume

Model ParametersOur model:

c*= argmaxc P(c)ΠjP(fj|c)

Types of parametersTwo:

P(C): Class priorsP(fj|c): Class conditional feature probabilities

Features in total |C|+|VC|, if features are words in vocabulary V

94

Weights in Naïve Bayes

c1 c2 c3 … ck

f1 P(f1|c1) P(f1|c2) P(f1|c3) P(f1|ck)

f2 P(f2|c1) P(f2|c2) …

… …

f|V| P(f|V||,c1)

95

Weights in Naïve Bayes and

Maximum EntropyNaïve Bayes:

P(f|y) are probabilities in [0,1] , weights

96



P(f|y) are probabilities in [0,1] , weightsP(y|x) =

97




98




99




MaxEnt:Weights are real numbers; any magnitude, sign




MaxEnt:Weights are real numbers; any magnitude, signP(y|x) =

100

MaxEnt OverviewPrediction:

P(y|x)

101


P(y|x)

fj (x,y): binary feature function, indicating presence of feature in instance x of class y

102


P(y|x)


λj : feature weights, learned in training

103


P(y|x)


λj : feature weights, learned in training

Prediction: Compute P(y|x), pick highest y

104

Weights in MaxEnt

c1 c2 c3 … ck

f1 λ1 λ8…

f2 λ2 …

… …

f|V| λ6

105

Maximum Entropy Principle

106


Intuitively, model all that is known, and assume as little as possible about what is unknown

107



Maximum entropy = minimum commitment

108




Related to concepts like Occam’s razor

109




Related to concepts like Occam’s razor

Laplace’s “Principle of Insufficient Reason”:When one has no information to distinguish

between the probability of two events, the best strategy is to consider them equally likely

110

Example I: (K&M 2003)Consider a coin flip

H(X)

111


H(X)

What values of P(X=H), P(X=T)maximize H(X)?

112


H(X)

What values of P(X=H), P(X=T)maximize H(X)?P(X=H)=P(X=T)=1/2

If no prior information, best guess is fair coin

113


H(X)



What if you know P(X=H) =0.3?

114


H(X)



What if you know P(X=H) =0.3?P(X=T)=0.7

115

Example II: MT (Berger, 1996)Task: English French machine translation

Specifically, translating ‘in’

Suppose we’ve seen in translated as:{dans, en, à, au cours de, pendant}

Constraint:

116




Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?

117




Constraint: p(dans)+p(en)+p(à)+p(au cours de)

+p(pendant)=1

If no other constraint, what is maxent model?p(dans)=p(en)=p(à)=p(au cours

de)=p(pendant)=1/5

118

Example II: MT (Berger, 1996)What we find out that translator uses dans or en

30%?Constraint

119


30%?Constraint: p(dans)+p(en)=3/10

Now what is maxent model?

120



Now what is maxent model?p(dans)=p(en)=

121



Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=

122

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

123

Example II: MT (Berger, 1996)What we find out that translator uses dans or en 30%?

Constraint: p(dans)+p(en)=3/10

Now what is maxent model?p(dans)=p(en)=3/20p(à)=p(au cours de)=p(pendant)=7/30

What if we also know translate picks à or dans 50%?Add new constraint: p(à)+p(dans)=0.5Now what is maxent model??

Not intuitively obvious…

124

125

Example III: POS (K&M, 2003)

126


127


128


129

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbs

130

Example IIIProblem: Too uniform

What else do we know? Nouns more common than verbsSo fN={NN,NNS,NNP,NNPS}, and E[fN]=32/36

Also, proper nouns more frequent than common, soE[NNP,NNPS]=24/36

Etc

feature selection & maximum entropy advanced statistical methods in nlp ling 572 january 26,...

Documents